git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
* [RFC PATCH 0/4] Improvements to sha1_file
@ 2017-06-09 19:23 Jonathan Tan
  2017-06-09 19:23 ` [RFC PATCH 1/4] sha1_file: teach packed_object_info about typename Jonathan Tan
                   ` (31 more replies)
  0 siblings, 32 replies; 70+ messages in thread
From: Jonathan Tan @ 2017-06-09 19:23 UTC (permalink / raw)
  To: git; +Cc: Jonathan Tan

I was investigating how to adapt my existing patch for missing blob
support [1] to consult a manifest of missing blobs, and found it
difficult to further modify sha1_file.c without doing some further
refactoring. So here are some patches to do that.

I think patch 1 is an independently good change - it makes the code
clearer and is also a net reduction in lines. If none of the other
patches here make it, maybe patch 1 should go in independently.

Patches 2-3 are also collectively independent, but more invasive. The
commit messages explain what's going on in more detail, but basically
there are 3 functions doing similar things (getting information for an
object regardless of where it's stored) with duplicated mechanisms, and
for maintainability, it is better to combine them into one function.

Patch 4 is my adaptation of [1] after all the refactoring - notice that
I just needed to edit 1 storage-agnostic object info function instead
of previously needing to edit 3. It is still a work in progress - the
code looks complete, but I would probably need to at least document the
missing blob manifest format. I am providing it here just to show the
effectiveness of the refactoring in patches 2-3.

I am hoping for reviews on patches 1-3 to be included into the tree.

[1] https://public-inbox.org/git/20170426221346.25337-1-jonathantanmy@google.com/

Jonathan Tan (4):
  sha1_file: teach packed_object_info about typename
  sha1_file: extract type and size from object_info
  sha1_file: consolidate storage-agnostic object fns
  sha1_file, fsck: add missing blob support

 Documentation/config.txt |  10 +
 builtin/cat-file.c       |  29 +--
 builtin/fsck.c           |   7 +
 builtin/pack-objects.c   |   5 +-
 cache.h                  |  12 +-
 sha1_file.c              | 484 +++++++++++++++++++++++++++++++----------------
 streaming.c              |   4 +-
 t/t3907-missing-blob.sh  |  69 +++++++
 8 files changed, 439 insertions(+), 181 deletions(-)
 create mode 100755 t/t3907-missing-blob.sh

-- 
2.13.1.508.gb3defc5cc-goog


^ permalink raw reply	[flat|nested] 70+ messages in thread

* [RFC PATCH 1/4] sha1_file: teach packed_object_info about typename
  2017-06-09 19:23 [RFC PATCH 0/4] Improvements to sha1_file Jonathan Tan
@ 2017-06-09 19:23 ` Jonathan Tan
  2017-06-12 20:55   ` Junio C Hamano
  2017-06-09 19:23 ` [RFC PATCH 2/4] sha1_file: extract type and size from object_info Jonathan Tan
                   ` (30 subsequent siblings)
  31 siblings, 1 reply; 70+ messages in thread
From: Jonathan Tan @ 2017-06-09 19:23 UTC (permalink / raw)
  To: git; +Cc: Jonathan Tan

In commit 46f0344 ("sha1_file: support reading from a loose object of
unknown type", 2015-05-06), "struct object_info" gained a "typename"
field that could represent a type name from a loose object file, whether
valid or invalid, as opposed to the existing "typep" which could only
represent valid types. Some relatively complex manipulations were added
to avoid breaking packed_object_info() without modifying it, but it is
much easier to just teach packed_object_info() about the new field.
Therefore, teach packed_object_info() as described above.

Signed-off-by: Jonathan Tan <jonathantanmy@google.com>
---
 sha1_file.c | 29 ++++++++++++-----------------
 1 file changed, 12 insertions(+), 17 deletions(-)

diff --git a/sha1_file.c b/sha1_file.c
index 59a4ed2ed..a52b27541 100644
--- a/sha1_file.c
+++ b/sha1_file.c
@@ -2277,9 +2277,18 @@ int packed_object_info(struct packed_git *p, off_t obj_offset,
 		*oi->disk_sizep = revidx[1].offset - obj_offset;
 	}
 
-	if (oi->typep) {
-		*oi->typep = packed_to_object_type(p, obj_offset, type, &w_curs, curpos);
-		if (*oi->typep < 0) {
+	if (oi->typep || oi->typename) {
+		enum object_type ptot;
+		ptot = packed_to_object_type(p, obj_offset, type, &w_curs,
+					     curpos);
+		if (oi->typep)
+			*oi->typep = ptot;
+		if (oi->typename) {
+			const char *tn = typename(ptot);
+			if (tn)
+				strbuf_addstr(oi->typename, tn);
+		}
+		if (ptot < 0) {
 			type = OBJ_BAD;
 			goto out;
 		}
@@ -2960,7 +2969,6 @@ int sha1_object_info_extended(const unsigned char *sha1, struct object_info *oi,
 	struct cached_object *co;
 	struct pack_entry e;
 	int rtype;
-	enum object_type real_type;
 	const unsigned char *real = lookup_replace_object_extended(sha1, flags);
 
 	co = find_cached_object(real);
@@ -2992,18 +3000,9 @@ int sha1_object_info_extended(const unsigned char *sha1, struct object_info *oi,
 			return -1;
 	}
 
-	/*
-	 * packed_object_info() does not follow the delta chain to
-	 * find out the real type, unless it is given oi->typep.
-	 */
-	if (oi->typename && !oi->typep)
-		oi->typep = &real_type;
-
 	rtype = packed_object_info(e.p, e.offset, oi);
 	if (rtype < 0) {
 		mark_bad_packed_object(e.p, real);
-		if (oi->typep == &real_type)
-			oi->typep = NULL;
 		return sha1_object_info_extended(real, oi, 0);
 	} else if (in_delta_base_cache(e.p, e.offset)) {
 		oi->whence = OI_DBCACHED;
@@ -3014,10 +3013,6 @@ int sha1_object_info_extended(const unsigned char *sha1, struct object_info *oi,
 		oi->u.packed.is_delta = (rtype == OBJ_REF_DELTA ||
 					 rtype == OBJ_OFS_DELTA);
 	}
-	if (oi->typename)
-		strbuf_addstr(oi->typename, typename(*oi->typep));
-	if (oi->typep == &real_type)
-		oi->typep = NULL;
 
 	return 0;
 }
-- 
2.13.1.508.gb3defc5cc-goog


^ permalink raw reply related	[flat|nested] 70+ messages in thread

* [RFC PATCH 2/4] sha1_file: extract type and size from object_info
  2017-06-09 19:23 [RFC PATCH 0/4] Improvements to sha1_file Jonathan Tan
  2017-06-09 19:23 ` [RFC PATCH 1/4] sha1_file: teach packed_object_info about typename Jonathan Tan
@ 2017-06-09 19:23 ` Jonathan Tan
  2017-06-10  7:01   ` Jeff King
  2017-06-09 19:23 ` [RFC PATCH 3/4] sha1_file: consolidate storage-agnostic object fns Jonathan Tan
                   ` (29 subsequent siblings)
  31 siblings, 1 reply; 70+ messages in thread
From: Jonathan Tan @ 2017-06-09 19:23 UTC (permalink / raw)
  To: git; +Cc: Jonathan Tan

This is patch 1 of 2 to consolidate all storage-agnostic object
information functions.

In sha1_file.c, there are a few functions that provide information on an
object regardless of its storage (cached, loose, or packed). Looking
through all non-static functions in sha1_file.c that take in an unsigned
char * pointer, the relevant ones are:
 - sha1_object_info_extended
 - sha1_object_info (auto-fixed by sha1_object_info_extended)
 - read_sha1_file_extended (uses read_object)
 - read_object_with_reference (auto-fixed by read_sha1_file_extended)
 - has_sha1_file_with_flags
 - assert_sha1_type (auto-fixed by sha1_object_info)

Looking at the 3 primary functions (sha1_object_info_extended,
read_object, has_sha1_file_with_flags), they independently implement
mechanisms such as object replacement, retrying the packed store after
failing to find the object in the packed store then the loose store, and
being able to mark a packed object as bad and then retrying the whole
process. Consolidating these mechanisms would be a great help to
maintainability.

Such a consolidated function would need to handle the read_object() case
(which returns the object data, type, and size) and the
sha1_object_info_extended() case (which returns the object type, size,
and some additional information, not all of which can be "turned off" by
passing NULL in "struct object_info").

To make it easier to implement and use such a function, remove the type
and size fields from "struct object_info", making them additional
parameters in sha1_object_info_extended (and related functions) instead.

Signed-off-by: Jonathan Tan <jonathantanmy@google.com>
---
 builtin/cat-file.c     | 29 +++++++++++---------
 builtin/pack-objects.c |  5 ++--
 cache.h                |  6 ++---
 sha1_file.c            | 72 ++++++++++++++++++++++++++++----------------------
 streaming.c            |  4 +--
 5 files changed, 62 insertions(+), 54 deletions(-)

diff --git a/builtin/cat-file.c b/builtin/cat-file.c
index 4bffd7a2d..5bb16c045 100644
--- a/builtin/cat-file.c
+++ b/builtin/cat-file.c
@@ -75,7 +75,8 @@ static int cat_one_file(int opt, const char *exp_type, const char *obj_name,
 	switch (opt) {
 	case 't':
 		oi.typename = &sb;
-		if (sha1_object_info_extended(oid.hash, &oi, flags) < 0)
+		if (sha1_object_info_extended(oid.hash, NULL, NULL, &oi,
+					      flags) < 0)
 			die("git cat-file: could not get object info");
 		if (sb.len) {
 			printf("%s\n", sb.buf);
@@ -85,8 +86,8 @@ static int cat_one_file(int opt, const char *exp_type, const char *obj_name,
 		break;
 
 	case 's':
-		oi.sizep = &size;
-		if (sha1_object_info_extended(oid.hash, &oi, flags) < 0)
+		if (sha1_object_info_extended(oid.hash, NULL, &size, &oi,
+					      flags) < 0)
 			die("git cat-file: could not get object info");
 		printf("%lu\n", size);
 		return 0;
@@ -194,10 +195,12 @@ struct expand_data {
 	int split_on_whitespace;
 
 	/*
-	 * After a mark_query run, this object_info is set up to be
-	 * passed to sha1_object_info_extended. It will point to the data
+	 * After a mark_query run, these fields are set up to be
+	 * passed to sha1_object_info_extended. They will point to the data
 	 * elements above, so you can retrieve the response from there.
 	 */
+	enum object_type *typep;
+	unsigned long *sizep;
 	struct object_info info;
 
 	/*
@@ -224,12 +227,12 @@ static void expand_atom(struct strbuf *sb, const char *atom, int len,
 			strbuf_addstr(sb, oid_to_hex(&data->oid));
 	} else if (is_atom("objecttype", atom, len)) {
 		if (data->mark_query)
-			data->info.typep = &data->type;
+			data->typep = &data->type;
 		else
 			strbuf_addstr(sb, typename(data->type));
 	} else if (is_atom("objectsize", atom, len)) {
 		if (data->mark_query)
-			data->info.sizep = &data->size;
+			data->sizep = &data->size;
 		else
 			strbuf_addf(sb, "%lu", data->size);
 	} else if (is_atom("objectsize:disk", atom, len)) {
@@ -280,7 +283,7 @@ static void print_object_or_die(struct batch_options *opt, struct expand_data *d
 {
 	const struct object_id *oid = &data->oid;
 
-	assert(data->info.typep);
+	assert(data->typep);
 
 	if (data->type == OBJ_BLOB) {
 		if (opt->buffer_output)
@@ -323,7 +326,7 @@ static void print_object_or_die(struct batch_options *opt, struct expand_data *d
 			die("object %s disappeared", oid_to_hex(oid));
 		if (type != data->type)
 			die("object %s changed type!?", oid_to_hex(oid));
-		if (data->info.sizep && size != data->size)
+		if (data->sizep && size != data->size)
 			die("object %s changed size!?", oid_to_hex(oid));
 
 		batch_write(opt, contents, size);
@@ -337,7 +340,8 @@ static void batch_object_write(const char *obj_name, struct batch_options *opt,
 	struct strbuf buf = STRBUF_INIT;
 
 	if (!data->skip_object_info &&
-	    sha1_object_info_extended(data->oid.hash, &data->info, LOOKUP_REPLACE_OBJECT) < 0) {
+	    sha1_object_info_extended(data->oid.hash, data->typep, data->sizep,
+				      &data->info, LOOKUP_REPLACE_OBJECT) < 0) {
 		printf("%s missing\n",
 		       obj_name ? obj_name : oid_to_hex(&data->oid));
 		fflush(stdout);
@@ -454,7 +458,8 @@ static int batch_objects(struct batch_options *opt)
 
 	if (opt->all_objects) {
 		struct object_info empty = OBJECT_INFO_INIT;
-		if (!memcmp(&data.info, &empty, sizeof(empty)))
+		if (!data.typep && !data.sizep &&
+		    !memcmp(&data.info, &empty, sizeof(empty)))
 			data.skip_object_info = 1;
 	}
 
@@ -463,7 +468,7 @@ static int batch_objects(struct batch_options *opt)
 	 * since we will want to decide whether or not to stream.
 	 */
 	if (opt->print_contents)
-		data.info.typep = &data.type;
+		data.typep = &data.type;
 
 	if (opt->all_objects) {
 		struct oid_array sa = OID_ARRAY_INIT;
diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index f672225de..9cecc82b2 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -1557,9 +1557,8 @@ static void drop_reused_delta(struct object_entry *entry)
 	entry->delta = NULL;
 	entry->depth = 0;
 
-	oi.sizep = &entry->size;
-	oi.typep = &entry->type;
-	if (packed_object_info(entry->in_pack, entry->in_pack_offset, &oi) < 0) {
+	if (packed_object_info(entry->in_pack, entry->in_pack_offset,
+			       &entry->type, &entry->size, &oi) < 0) {
 		/*
 		 * We failed to get the info from this pack for some reason;
 		 * fall back to sha1_object_info, which may find another copy.
diff --git a/cache.h b/cache.h
index 4d92aae0e..bf09962e4 100644
--- a/cache.h
+++ b/cache.h
@@ -1830,8 +1830,6 @@ extern int for_each_packed_object(each_packed_object_fn, void *, unsigned flags)
 
 struct object_info {
 	/* Request */
-	enum object_type *typep;
-	unsigned long *sizep;
 	off_t *disk_sizep;
 	unsigned char *delta_base_sha1;
 	struct strbuf *typename;
@@ -1866,8 +1864,8 @@ struct object_info {
  */
 #define OBJECT_INFO_INIT {NULL}
 
-extern int sha1_object_info_extended(const unsigned char *, struct object_info *, unsigned flags);
-extern int packed_object_info(struct packed_git *pack, off_t offset, struct object_info *);
+extern int sha1_object_info_extended(const unsigned char *, enum object_type *typep, unsigned long *sizep, struct object_info *, unsigned flags);
+extern int packed_object_info(struct packed_git *pack, off_t offset, enum object_type *typep, unsigned long *sizep, struct object_info *);
 
 /* Dumb servers support */
 extern int update_server_info(int);
diff --git a/sha1_file.c b/sha1_file.c
index a52b27541..ac4d77ccc 100644
--- a/sha1_file.c
+++ b/sha1_file.c
@@ -1936,8 +1936,10 @@ static void *unpack_sha1_rest(git_zstream *stream, void *buffer, unsigned long s
  * too permissive for what we want to check. So do an anal
  * object header parse by hand.
  */
-static int parse_sha1_header_extended(const char *hdr, struct object_info *oi,
-			       unsigned int flags)
+static int parse_sha1_header_extended(const char *hdr, enum object_type *typep,
+				      unsigned long *sizep,
+				      struct object_info *oi,
+				      unsigned int flags)
 {
 	const char *type_buf = hdr;
 	unsigned long size;
@@ -1968,8 +1970,8 @@ static int parse_sha1_header_extended(const char *hdr, struct object_info *oi,
 		type = 0;
 	else if (type < 0)
 		die("invalid object type");
-	if (oi->typep)
-		*oi->typep = type;
+	if (typep)
+		*typep = type;
 
 	/*
 	 * The length must follow immediately, and be in canonical
@@ -1988,8 +1990,8 @@ static int parse_sha1_header_extended(const char *hdr, struct object_info *oi,
 		}
 	}
 
-	if (oi->sizep)
-		*oi->sizep = size;
+	if (sizep)
+		*sizep = size;
 
 	/*
 	 * The length must be followed by a zero byte
@@ -2000,9 +2002,8 @@ static int parse_sha1_header_extended(const char *hdr, struct object_info *oi,
 int parse_sha1_header(const char *hdr, unsigned long *sizep)
 {
 	struct object_info oi = OBJECT_INFO_INIT;
-
-	oi.sizep = sizep;
-	return parse_sha1_header_extended(hdr, &oi, LOOKUP_REPLACE_OBJECT);
+	return parse_sha1_header_extended(hdr, NULL, sizep, &oi,
+					  LOOKUP_REPLACE_OBJECT);
 }
 
 static void *unpack_sha1_file(void *map, unsigned long mapsize, enum object_type *type, unsigned long *size, const unsigned char *sha1)
@@ -2240,6 +2241,7 @@ static enum object_type packed_to_object_type(struct packed_git *p,
 }
 
 int packed_object_info(struct packed_git *p, off_t obj_offset,
+		       enum object_type *typep, unsigned long *sizep,
 		       struct object_info *oi)
 {
 	struct pack_window *w_curs = NULL;
@@ -2253,7 +2255,7 @@ int packed_object_info(struct packed_git *p, off_t obj_offset,
 	 */
 	type = unpack_object_header(p, &w_curs, &curpos, &size);
 
-	if (oi->sizep) {
+	if (sizep) {
 		if (type == OBJ_OFS_DELTA || type == OBJ_REF_DELTA) {
 			off_t tmp_pos = curpos;
 			off_t base_offset = get_delta_base(p, &w_curs, &tmp_pos,
@@ -2262,13 +2264,13 @@ int packed_object_info(struct packed_git *p, off_t obj_offset,
 				type = OBJ_BAD;
 				goto out;
 			}
-			*oi->sizep = get_size_from_delta(p, &w_curs, tmp_pos);
-			if (*oi->sizep == 0) {
+			*sizep = get_size_from_delta(p, &w_curs, tmp_pos);
+			if (*sizep == 0) {
 				type = OBJ_BAD;
 				goto out;
 			}
 		} else {
-			*oi->sizep = size;
+			*sizep = size;
 		}
 	}
 
@@ -2277,12 +2279,12 @@ int packed_object_info(struct packed_git *p, off_t obj_offset,
 		*oi->disk_sizep = revidx[1].offset - obj_offset;
 	}
 
-	if (oi->typep || oi->typename) {
+	if (typep || oi->typename) {
 		enum object_type ptot;
 		ptot = packed_to_object_type(p, obj_offset, type, &w_curs,
 					     curpos);
-		if (oi->typep)
-			*oi->typep = ptot;
+		if (typep)
+			*typep = ptot;
 		if (oi->typename) {
 			const char *tn = typename(ptot);
 			if (tn)
@@ -2905,6 +2907,8 @@ struct packed_git *find_sha1_pack(const unsigned char *sha1,
 }
 
 static int sha1_loose_object_info(const unsigned char *sha1,
+				  enum object_type *typep,
+				  unsigned long *sizep,
 				  struct object_info *oi,
 				  int flags)
 {
@@ -2926,7 +2930,7 @@ static int sha1_loose_object_info(const unsigned char *sha1,
 	 * return value implicitly indicates whether the
 	 * object even exists.
 	 */
-	if (!oi->typep && !oi->typename && !oi->sizep) {
+	if (!typep && !oi->typename && !sizep) {
 		const char *path;
 		struct stat st;
 		if (stat_sha1_file(sha1, &st, &path) < 0)
@@ -2951,20 +2955,25 @@ static int sha1_loose_object_info(const unsigned char *sha1,
 	if (status < 0)
 		; /* Do nothing */
 	else if (hdrbuf.len) {
-		if ((status = parse_sha1_header_extended(hdrbuf.buf, oi, flags)) < 0)
+		if ((status = parse_sha1_header_extended(hdrbuf.buf, typep,
+							 sizep, oi, flags)) < 0)
 			status = error("unable to parse %s header with --allow-unknown-type",
 				       sha1_to_hex(sha1));
-	} else if ((status = parse_sha1_header_extended(hdr, oi, flags)) < 0)
+	} else if ((status = parse_sha1_header_extended(hdr, typep, sizep, oi,
+							flags)) < 0)
 		status = error("unable to parse %s header", sha1_to_hex(sha1));
 	git_inflate_end(&stream);
 	munmap(map, mapsize);
-	if (status && oi->typep)
-		*oi->typep = status;
+	if (status && typep)
+		*typep = status;
 	strbuf_release(&hdrbuf);
 	return (status < 0) ? status : 0;
 }
 
-int sha1_object_info_extended(const unsigned char *sha1, struct object_info *oi, unsigned flags)
+int sha1_object_info_extended(const unsigned char *sha1,
+			      enum object_type *typep,
+			      unsigned long *sizep, struct object_info *oi,
+			      unsigned flags)
 {
 	struct cached_object *co;
 	struct pack_entry e;
@@ -2973,10 +2982,10 @@ int sha1_object_info_extended(const unsigned char *sha1, struct object_info *oi,
 
 	co = find_cached_object(real);
 	if (co) {
-		if (oi->typep)
-			*(oi->typep) = co->type;
-		if (oi->sizep)
-			*(oi->sizep) = co->size;
+		if (typep)
+			*typep = co->type;
+		if (sizep)
+			*sizep = co->size;
 		if (oi->disk_sizep)
 			*(oi->disk_sizep) = 0;
 		if (oi->delta_base_sha1)
@@ -2989,7 +2998,7 @@ int sha1_object_info_extended(const unsigned char *sha1, struct object_info *oi,
 
 	if (!find_pack_entry(real, &e)) {
 		/* Most likely it's a loose object. */
-		if (!sha1_loose_object_info(real, oi, flags)) {
+		if (!sha1_loose_object_info(real, typep, sizep, oi, flags)) {
 			oi->whence = OI_LOOSE;
 			return 0;
 		}
@@ -3000,10 +3009,10 @@ int sha1_object_info_extended(const unsigned char *sha1, struct object_info *oi,
 			return -1;
 	}
 
-	rtype = packed_object_info(e.p, e.offset, oi);
+	rtype = packed_object_info(e.p, e.offset, typep, sizep, oi);
 	if (rtype < 0) {
 		mark_bad_packed_object(e.p, real);
-		return sha1_object_info_extended(real, oi, 0);
+		return sha1_object_info_extended(real, typep, sizep, oi, 0);
 	} else if (in_delta_base_cache(e.p, e.offset)) {
 		oi->whence = OI_DBCACHED;
 	} else {
@@ -3023,9 +3032,8 @@ int sha1_object_info(const unsigned char *sha1, unsigned long *sizep)
 	enum object_type type;
 	struct object_info oi = OBJECT_INFO_INIT;
 
-	oi.typep = &type;
-	oi.sizep = sizep;
-	if (sha1_object_info_extended(sha1, &oi, LOOKUP_REPLACE_OBJECT) < 0)
+	if (sha1_object_info_extended(sha1, &type, sizep, &oi,
+				      LOOKUP_REPLACE_OBJECT) < 0)
 		return -1;
 	return type;
 }
diff --git a/streaming.c b/streaming.c
index 9afa66b8b..ee5d1f684 100644
--- a/streaming.c
+++ b/streaming.c
@@ -111,9 +111,7 @@ static enum input_source istream_source(const unsigned char *sha1,
 	unsigned long size;
 	int status;
 
-	oi->typep = type;
-	oi->sizep = &size;
-	status = sha1_object_info_extended(sha1, oi, 0);
+	status = sha1_object_info_extended(sha1, type, &size, oi, 0);
 	if (status < 0)
 		return stream_error;
 
-- 
2.13.1.508.gb3defc5cc-goog


^ permalink raw reply related	[flat|nested] 70+ messages in thread

* [RFC PATCH 3/4] sha1_file: consolidate storage-agnostic object fns
  2017-06-09 19:23 [RFC PATCH 0/4] Improvements to sha1_file Jonathan Tan
  2017-06-09 19:23 ` [RFC PATCH 1/4] sha1_file: teach packed_object_info about typename Jonathan Tan
  2017-06-09 19:23 ` [RFC PATCH 2/4] sha1_file: extract type and size from object_info Jonathan Tan
@ 2017-06-09 19:23 ` Jonathan Tan
  2017-06-09 19:23 ` [RFC PATCH 4/4] sha1_file, fsck: add missing blob support Jonathan Tan
                   ` (28 subsequent siblings)
  31 siblings, 0 replies; 70+ messages in thread
From: Jonathan Tan @ 2017-06-09 19:23 UTC (permalink / raw)
  To: git; +Cc: Jonathan Tan

This is patch 2 of 2 to consolidate all storage-agnostic object
information functions.

In sha1_file.c, there are a few functions that provide information on an
object regardless of its storage (cached, loose, or packed). Looking
through all non-static functions in sha1_file.c that take in an unsigned
char * pointer, the relevant ones are:
 - sha1_object_info_extended
 - sha1_object_info (auto-fixed by sha1_object_info_extended)
 - read_sha1_file_extended (uses read_object)
 - read_object_with_reference (auto-fixed by read_sha1_file_extended)
 - has_sha1_file_with_flags
 - assert_sha1_type (auto-fixed by sha1_object_info)

Looking at the 3 primary functions (sha1_object_info_extended,
read_object, has_sha1_file_with_flags), they independently implement
mechanisms such as object replacement, retrying the packed store after
failing to find the object in the packed store then the loose store, and
being able to mark a packed object as bad and then retrying the whole
process. Consolidating these mechanisms would be a great help to
maintainability.

Therefore, consolidate all 3 functions into 1 function.

Note that has_sha1_file_with_flags() does not try cached storage,
whereas the other 2 functions do - this functionality is preserved.

Signed-off-by: Jonathan Tan <jonathantanmy@google.com>
---
 sha1_file.c | 294 ++++++++++++++++++++++++++++++++++--------------------------
 1 file changed, 165 insertions(+), 129 deletions(-)

diff --git a/sha1_file.c b/sha1_file.c
index ac4d77ccc..deb08b0f1 100644
--- a/sha1_file.c
+++ b/sha1_file.c
@@ -1959,7 +1959,7 @@ static int parse_sha1_header_extended(const char *hdr, enum object_type *typep,
 	}
 
 	type = type_from_string_gently(type_buf, type_len, 1);
-	if (oi->typename)
+	if (oi && oi->typename)
 		strbuf_add(oi->typename, type_buf, type_len);
 	/*
 	 * Set type to 0 if its an unknown object and
@@ -2001,12 +2001,13 @@ static int parse_sha1_header_extended(const char *hdr, enum object_type *typep,
 
 int parse_sha1_header(const char *hdr, unsigned long *sizep)
 {
-	struct object_info oi = OBJECT_INFO_INIT;
-	return parse_sha1_header_extended(hdr, NULL, sizep, &oi,
+	return parse_sha1_header_extended(hdr, NULL, sizep, NULL,
 					  LOOKUP_REPLACE_OBJECT);
 }
 
-static void *unpack_sha1_file(void *map, unsigned long mapsize, enum object_type *type, unsigned long *size, const unsigned char *sha1)
+static void *unpack_sha1_file(void *map, unsigned long mapsize,
+			      enum object_type *type, unsigned long *size,
+			      const unsigned char *sha1)
 {
 	int ret;
 	git_zstream stream;
@@ -2274,18 +2275,18 @@ int packed_object_info(struct packed_git *p, off_t obj_offset,
 		}
 	}
 
-	if (oi->disk_sizep) {
+	if (oi && oi->disk_sizep) {
 		struct revindex_entry *revidx = find_pack_revindex(p, obj_offset);
 		*oi->disk_sizep = revidx[1].offset - obj_offset;
 	}
 
-	if (typep || oi->typename) {
+	if (typep || (oi && oi->typename)) {
 		enum object_type ptot;
 		ptot = packed_to_object_type(p, obj_offset, type, &w_curs,
 					     curpos);
 		if (typep)
 			*typep = ptot;
-		if (oi->typename) {
+		if (oi && oi->typename) {
 			const char *tn = typename(ptot);
 			if (tn)
 				strbuf_addstr(oi->typename, tn);
@@ -2296,7 +2297,7 @@ int packed_object_info(struct packed_git *p, off_t obj_offset,
 		}
 	}
 
-	if (oi->delta_base_sha1) {
+	if (oi && oi->delta_base_sha1) {
 		if (type == OBJ_OFS_DELTA || type == OBJ_REF_DELTA) {
 			const unsigned char *base;
 
@@ -2438,8 +2439,10 @@ static void *cache_or_unpack_entry(struct packed_git *p, off_t base_offset,
 	if (!ent)
 		return unpack_entry(p, base_offset, type, base_size);
 
-	*type = ent->type;
-	*base_size = ent->size;
+	if (type)
+		*type = ent->type;
+	if (base_size)
+		*base_size = ent->size;
 	return xmemdupz(ent->data, ent->size);
 }
 
@@ -2907,43 +2910,20 @@ struct packed_git *find_sha1_pack(const unsigned char *sha1,
 }
 
 static int sha1_loose_object_info(const unsigned char *sha1,
+				  void *map, unsigned long mapsize,
 				  enum object_type *typep,
 				  unsigned long *sizep,
 				  struct object_info *oi,
 				  int flags)
 {
 	int status = 0;
-	unsigned long mapsize;
-	void *map;
 	git_zstream stream;
 	char hdr[32];
 	struct strbuf hdrbuf = STRBUF_INIT;
 
-	if (oi->delta_base_sha1)
+	if (oi && oi->delta_base_sha1)
 		hashclr(oi->delta_base_sha1);
-
-	/*
-	 * If we don't care about type or size, then we don't
-	 * need to look inside the object at all. Note that we
-	 * do not optimize out the stat call, even if the
-	 * caller doesn't care about the disk-size, since our
-	 * return value implicitly indicates whether the
-	 * object even exists.
-	 */
-	if (!typep && !oi->typename && !sizep) {
-		const char *path;
-		struct stat st;
-		if (stat_sha1_file(sha1, &st, &path) < 0)
-			return -1;
-		if (oi->disk_sizep)
-			*oi->disk_sizep = st.st_size;
-		return 0;
-	}
-
-	map = map_sha1_file(sha1, &mapsize);
-	if (!map)
-		return -1;
-	if (oi->disk_sizep)
+	if (oi && oi->disk_sizep)
 		*oi->disk_sizep = mapsize;
 	if ((flags & LOOKUP_UNKNOWN_OBJECT)) {
 		if (unpack_sha1_header_to_strbuf(&stream, map, mapsize, hdr, sizeof(hdr), &hdrbuf) < 0)
@@ -2963,29 +2943,25 @@ static int sha1_loose_object_info(const unsigned char *sha1,
 							flags)) < 0)
 		status = error("unable to parse %s header", sha1_to_hex(sha1));
 	git_inflate_end(&stream);
-	munmap(map, mapsize);
 	if (status && typep)
 		*typep = status;
 	strbuf_release(&hdrbuf);
 	return (status < 0) ? status : 0;
 }
 
-int sha1_object_info_extended(const unsigned char *sha1,
-			      enum object_type *typep,
-			      unsigned long *sizep, struct object_info *oi,
-			      unsigned flags)
+static int get_cached_object(const unsigned char *sha1, enum object_type *typep,
+			     unsigned long *sizep, struct object_info *oi,
+			     void **buf)
 {
-	struct cached_object *co;
-	struct pack_entry e;
-	int rtype;
-	const unsigned char *real = lookup_replace_object_extended(sha1, flags);
+	struct cached_object *co = find_cached_object(sha1);
+	if (!co)
+		return 0;
 
-	co = find_cached_object(real);
-	if (co) {
-		if (typep)
-			*typep = co->type;
-		if (sizep)
-			*sizep = co->size;
+	if (typep)
+		*typep = co->type;
+	if (sizep)
+		*sizep = co->size;
+	if (oi) {
 		if (oi->disk_sizep)
 			*(oi->disk_sizep) = 0;
 		if (oi->delta_base_sha1)
@@ -2993,75 +2969,160 @@ int sha1_object_info_extended(const unsigned char *sha1,
 		if (oi->typename)
 			strbuf_addstr(oi->typename, typename(co->type));
 		oi->whence = OI_CACHED;
+	}
+	if (buf)
+		*buf = xmemdupz(co->buf, co->size);
+	return 1;
+}
+
+static int get_loose_object(const unsigned char *sha1, enum object_type *typep,
+			    unsigned long *sizep, struct object_info *oi,
+			    void **buf, int tolerate_bad_type)
+{
+	const char *path;
+	struct stat st;
+
+	if (!typep && !sizep && !oi && !buf)
+		return has_loose_object(sha1);
+
+	if (buf || typep || sizep || (oi && oi->typename)) {
+		/* Need to look inside the object */
+		unsigned long mapsize;
+		int ret = 1;
+		void *map = map_sha1_file(sha1, &mapsize);
+		if (!map)
+			return 0;
+		if (buf) {
+			*buf = unpack_sha1_file(map, mapsize, typep, sizep,
+						sha1);
+			if (!*buf)
+				return 0;
+			/* avoid redundant type and size calculations */
+			typep = NULL;
+			sizep = NULL;
+		}
+		if (typep || sizep || oi) {
+			int f = tolerate_bad_type ? LOOKUP_UNKNOWN_OBJECT : 0;
+			if (sha1_loose_object_info(sha1, map, mapsize, typep,
+						   sizep, oi, f)) {
+				ret = 0;
+				goto cleanup;
+			}
+			if (oi)
+				oi->whence = OI_LOOSE;
+		}
+cleanup:
+		munmap(map, mapsize);
+		return ret;
+	}
+
+	/*
+	 * If we don't care about type or size, then we don't
+	 * need to look inside the object at all. Note that we
+	 * do not optimize out the stat call, even if the
+	 * caller doesn't care about the disk-size, since our
+	 * return value implicitly indicates whether the
+	 * object even exists.
+	 */
+	if (stat_sha1_file(sha1, &st, &path) < 0)
 		return 0;
+	if (oi) {
+		if (oi->disk_sizep)
+			*oi->disk_sizep = st.st_size;
+		oi->whence = OI_LOOSE;
 	}
+	return 1;
+}
 
-	if (!find_pack_entry(real, &e)) {
-		/* Most likely it's a loose object. */
-		if (!sha1_loose_object_info(real, typep, sizep, oi, flags)) {
-			oi->whence = OI_LOOSE;
+static int get_packed_object(struct pack_entry *e, enum object_type *typep,
+			     unsigned long *sizep, struct object_info *oi,
+			     void **buf)
+{
+	int rtype;
+	if (buf) {
+		*buf = cache_or_unpack_entry(e->p, e->offset, sizep, typep);
+		if (!*buf)
 			return 0;
+		/* avoid redundant type and size calculations */
+		typep = NULL;
+		sizep = NULL;
+	}
+	if (typep || sizep || oi) {
+		rtype = packed_object_info(e->p, e->offset, typep, sizep, oi);
+		if (rtype < 0)
+			return 0;
+	}
+	if (oi) {
+		if (in_delta_base_cache(e->p, e->offset)) {
+			oi->whence = OI_DBCACHED;
+		} else {
+			oi->whence = OI_PACKED;
+			oi->u.packed.offset = e->offset;
+			oi->u.packed.pack = e->p;
+			oi->u.packed.is_delta = (rtype == OBJ_REF_DELTA ||
+						 rtype == OBJ_OFS_DELTA);
 		}
+	}
+	return 1;
+}
+
+/* start at 1 << 5 to leave room for LOOKUP_ flags */
+#define GET_OBJECT_QUICK (1 << 5)
+#define GET_OBJECT_IGNORE_CACHED (1 << 6)
+static int get_object(const unsigned char *sha1, enum object_type *typep,
+		      unsigned long *sizep, struct object_info *oi,
+		      void **buf, unsigned flags)
+{
+	struct pack_entry e;
+	const unsigned char *real = lookup_replace_object_extended(sha1, flags);
+
+	if (!(flags & GET_OBJECT_IGNORE_CACHED) &&
+	    get_cached_object(real, typep, sizep, oi, buf))
+		return 1;
+
+	if (!find_pack_entry(real, &e)) {
+		/* Most likely it's a loose object. */
+		if (get_loose_object(real, typep, sizep, oi, buf,
+				     flags & LOOKUP_UNKNOWN_OBJECT))
+			return 1;
 
 		/* Not a loose object; someone else may have just packed it. */
+		if (flags & GET_OBJECT_QUICK)
+			return 0;
 		reprepare_packed_git();
 		if (!find_pack_entry(real, &e))
-			return -1;
+			return 0;
 	}
 
-	rtype = packed_object_info(e.p, e.offset, typep, sizep, oi);
-	if (rtype < 0) {
-		mark_bad_packed_object(e.p, real);
-		return sha1_object_info_extended(real, typep, sizep, oi, 0);
-	} else if (in_delta_base_cache(e.p, e.offset)) {
-		oi->whence = OI_DBCACHED;
-	} else {
-		oi->whence = OI_PACKED;
-		oi->u.packed.offset = e.offset;
-		oi->u.packed.pack = e.p;
-		oi->u.packed.is_delta = (rtype == OBJ_REF_DELTA ||
-					 rtype == OBJ_OFS_DELTA);
-	}
+	if (get_packed_object(&e, typep, sizep, oi, buf))
+		return 1;
 
-	return 0;
+	/*
+	 * Try to fetch the required object anyway from another pack or loose.
+	 * This should happen only in the presence of a corrupted
+	 * pack, and is better than failing outright.
+	 */
+	mark_bad_packed_object(e.p, real);
+	return get_object(real, typep, sizep, oi, buf, flags);
+}
+
+int sha1_object_info_extended(const unsigned char *sha1,
+			      enum object_type *typep, unsigned long *sizep,
+			      struct object_info *oi, unsigned flags)
+{
+	return get_object(sha1, typep, sizep, oi, NULL, flags) ? 0 : -1;
 }
 
 /* returns enum object_type or negative */
 int sha1_object_info(const unsigned char *sha1, unsigned long *sizep)
 {
 	enum object_type type;
-	struct object_info oi = OBJECT_INFO_INIT;
-
-	if (sha1_object_info_extended(sha1, &type, sizep, &oi,
+	if (sha1_object_info_extended(sha1, &type, sizep, NULL,
 				      LOOKUP_REPLACE_OBJECT) < 0)
 		return -1;
 	return type;
 }
 
-static void *read_packed_sha1(const unsigned char *sha1,
-			      enum object_type *type, unsigned long *size)
-{
-	struct pack_entry e;
-	void *data;
-
-	if (!find_pack_entry(sha1, &e))
-		return NULL;
-	data = cache_or_unpack_entry(e.p, e.offset, size, type);
-	if (!data) {
-		/*
-		 * We're probably in deep shit, but let's try to fetch
-		 * the required object anyway from another pack or loose.
-		 * This should happen only in the presence of a corrupted
-		 * pack, and is better than failing outright.
-		 */
-		error("failed to read object %s at offset %"PRIuMAX" from %s",
-		      sha1_to_hex(sha1), (uintmax_t)e.offset, e.p->pack_name);
-		mark_bad_packed_object(e.p, sha1);
-		data = read_object(sha1, type, size);
-	}
-	return data;
-}
-
 int pretend_sha1_file(void *buf, unsigned long len, enum object_type type,
 		      unsigned char *sha1)
 {
@@ -3083,28 +3144,10 @@ int pretend_sha1_file(void *buf, unsigned long len, enum object_type type,
 static void *read_object(const unsigned char *sha1, enum object_type *type,
 			 unsigned long *size)
 {
-	unsigned long mapsize;
-	void *map, *buf;
-	struct cached_object *co;
-
-	co = find_cached_object(sha1);
-	if (co) {
-		*type = co->type;
-		*size = co->size;
-		return xmemdupz(co->buf, co->size);
-	}
-
-	buf = read_packed_sha1(sha1, type, size);
-	if (buf)
-		return buf;
-	map = map_sha1_file(sha1, &mapsize);
-	if (map) {
-		buf = unpack_sha1_file(map, mapsize, type, size, sha1);
-		munmap(map, mapsize);
+	void *buf;
+	if (get_object(sha1, type, size, NULL, &buf, 0))
 		return buf;
-	}
-	reprepare_packed_git();
-	return read_packed_sha1(sha1, type, size);
+	return NULL;
 }
 
 /*
@@ -3456,7 +3499,7 @@ int force_object_loose(const unsigned char *sha1, time_t mtime)
 
 	if (has_loose_object(sha1))
 		return 0;
-	buf = read_packed_sha1(sha1, &type, &len);
+	buf = read_object(sha1, &type, &len);
 	if (!buf)
 		return error("cannot read sha1_file for %s", sha1_to_hex(sha1));
 	hdrlen = xsnprintf(hdr, sizeof(hdr), "%s %lu", typename(type), len) + 1;
@@ -3482,18 +3525,11 @@ int has_sha1_pack(const unsigned char *sha1)
 
 int has_sha1_file_with_flags(const unsigned char *sha1, int flags)
 {
-	struct pack_entry e;
-
+	int f = GET_OBJECT_IGNORE_CACHED |
+		(flags & HAS_SHA1_QUICK ? GET_OBJECT_QUICK : 0);
 	if (!startup_info->have_repository)
 		return 0;
-	if (find_pack_entry(sha1, &e))
-		return 1;
-	if (has_loose_object(sha1))
-		return 1;
-	if (flags & HAS_SHA1_QUICK)
-		return 0;
-	reprepare_packed_git();
-	return find_pack_entry(sha1, &e);
+	return get_object(sha1, NULL, NULL, NULL, NULL, f);
 }
 
 int has_object_file(const struct object_id *oid)
-- 
2.13.1.508.gb3defc5cc-goog


^ permalink raw reply related	[flat|nested] 70+ messages in thread

* [RFC PATCH 4/4] sha1_file, fsck: add missing blob support
  2017-06-09 19:23 [RFC PATCH 0/4] Improvements to sha1_file Jonathan Tan
                   ` (2 preceding siblings ...)
  2017-06-09 19:23 ` [RFC PATCH 3/4] sha1_file: consolidate storage-agnostic object fns Jonathan Tan
@ 2017-06-09 19:23 ` Jonathan Tan
  2017-06-13 21:05 ` [PATCH v2 0/4] Improvements to sha1_file Jonathan Tan
                   ` (27 subsequent siblings)
  31 siblings, 0 replies; 70+ messages in thread
From: Jonathan Tan @ 2017-06-09 19:23 UTC (permalink / raw)
  To: git; +Cc: Jonathan Tan

Currently, Git does not support repos with very large numbers of blobs
or repos that wish to minimize manipulation of certain blobs (for
example, because they are very large) very well, even if the user
operates mostly on part of the repo, because Git is designed on the
assumption that every blob referenced by a tree object is available
somewhere in the repo storage.

As a first step to reducing this problem, add rudimentary support for
missing blobs by teaching sha1_file to invoke a hook whenever a blob is
requested and unavailable but registered to be missing, and by updating
fsck to tolerate such blobs.  The hook is a shell command that can be
configured through "git config"; this hook takes in a list of hashes and
writes (if successful) the corresponding objects to the repo's local
storage.

This commit does not include support for generating such a repo; neither
has any command (other than fsck) been modified to either tolerate
missing blobs (without invoking the hook) or be more efficient in
invoking the missing blob hook. Only a fallback is provided in the form
of sha1_file invoking the missing blob hook when necessary.

In order to determine the code changes in sha1_file.c necessary, I
investigated the following:
 (1) functions in sha1_file that take in a hash, without the user
     regarding how the object is stored (loose or packed)
 (2) functions in sha1_file that operate on packed objects (because I
     need to check callers that know about the loose/packed distinction
     and operate on both differently, and ensure that they can handle
     the concept of objects that are neither loose nor packed)

(1) is handled by the modification to get_object().

For (2), I looked through the same functions as in (1) and also
for_each_packed_object. The ones that are relevant are:
 - parse_pack_index
   - http - indirectly from http_get_info_packs
 - find_pack_entry_one
   - this searches a single pack that is provided as an argument; the
     caller already knows (through other means) that the sought object
     is in a specific pack
 - find_sha1_pack
   - fast-import - appears to be an optimization to not store a
     file if it is already in a pack
   - http-walker - to search through a struct alt_base
   - http-push - to search through remote packs
 - has_sha1_pack
   - builtin/fsck - fixed in this commit
   - builtin/count-objects - informational purposes only (check if loose
     object is also packed)
   - builtin/prune-packed - check if object to be pruned is packed (if
     not, don't prune it)
   - revision - used to exclude packed objects if requested by user
   - diff - just for optimization
 - for_each_packed_object
   - reachable - only to find recent objects
   - builtin/fsck - fixed in this commit
   - builtin/cat-file - see below

As described in the list above, builtin/fsck has been updated. I have
left builtin/cat-file alone; this means that cat-file
--batch-all-objects will only operate on objects physically in the repo.

An alternative design that I considered but rejected:

 - Adding a hook whenever a packed blob is requested, not on any blob.
   That is, whenever we attempt to search the packfiles for a blob, if
   it is missing (from the packfiles and from the loose object storage),
   to invoke the hook (which must then store it as a packfile), open the
   packfile the hook generated, and report that the blob is found in
   that new packfile. This reduces the amount of analysis needed (in
   that we only need to look at how packed blobs are handled), but
   requires that the hook generate packfiles (or for sha1_file to pack
   whatever loose objects are generated), creating one packfile for each
   missing blob and potentially very many packfiles that must be
   linearly searched. This may be tolerable now for repos that only have
   a few missing blobs (for example, repos that only want to exclude
   large blobs), and might be tolerable in the future if we have
   batching support for the most commonly used commands, but is not
   tolerable now for repos that exclude a large amount of blobs.

Signed-off-by: Jonathan Tan <jonathantanmy@google.com>
---
 Documentation/config.txt |  10 ++++
 builtin/fsck.c           |   7 +++
 cache.h                  |   6 ++
 sha1_file.c              | 147 +++++++++++++++++++++++++++++++++++++++++++----
 t/t3907-missing-blob.sh  |  69 ++++++++++++++++++++++
 5 files changed, 229 insertions(+), 10 deletions(-)
 create mode 100755 t/t3907-missing-blob.sh

diff --git a/Documentation/config.txt b/Documentation/config.txt
index dd4beec39..10da5fde1 100644
--- a/Documentation/config.txt
+++ b/Documentation/config.txt
@@ -390,6 +390,16 @@ The default is false, except linkgit:git-clone[1] or linkgit:git-init[1]
 will probe and set core.ignoreCase true if appropriate when the repository
 is created.
 
+core.missingBlobCommand::
+	If set, whenever a blob in the local repo is attempted to be
+	read but is missing, invoke this shell command to generate or
+	obtain that blob before reporting an error. This shell command
+	should take one or more hashes, each terminated by a newline, as
+	standard input, and (if successful) should write the
+	corresponding objects to the local repo (packed or loose).
++
+If set, fsck will not treat a missing blob as an error condition.
+
 core.precomposeUnicode::
 	This option is only used by Mac OS implementation of Git.
 	When core.precomposeUnicode=true, Git reverts the unicode decomposition
diff --git a/builtin/fsck.c b/builtin/fsck.c
index cb2ba6cd1..6f10d6034 100644
--- a/builtin/fsck.c
+++ b/builtin/fsck.c
@@ -37,6 +37,7 @@ static int verbose;
 static int show_progress = -1;
 static int show_dangling = 1;
 static int name_objects;
+static int missing_blob_ok;
 #define ERROR_OBJECT 01
 #define ERROR_REACHABLE 02
 #define ERROR_PACK 04
@@ -93,6 +94,9 @@ static int fsck_config(const char *var, const char *value, void *cb)
 		return 0;
 	}
 
+	if (!strcmp(var, "core.missingblobcommand"))
+		missing_blob_ok = 1;
+
 	return git_default_config(var, value, cb);
 }
 
@@ -222,6 +226,9 @@ static void check_reachable_object(struct object *obj)
 	if (!(obj->flags & HAS_OBJ)) {
 		if (has_sha1_pack(obj->oid.hash))
 			return; /* it is in pack - forget about it */
+		if (missing_blob_ok && obj->type == OBJ_BLOB &&
+		    has_missing_blob(obj->oid.hash, NULL))
+			return;
 		printf("missing %s %s\n", printable_type(obj),
 			describe_object(obj));
 		errors_found |= ERROR_REACHABLE;
diff --git a/cache.h b/cache.h
index bf09962e4..b9221b7e2 100644
--- a/cache.h
+++ b/cache.h
@@ -1867,6 +1867,12 @@ struct object_info {
 extern int sha1_object_info_extended(const unsigned char *, enum object_type *typep, unsigned long *sizep, struct object_info *, unsigned flags);
 extern int packed_object_info(struct packed_git *pack, off_t offset, enum object_type *typep, unsigned long *sizep, struct object_info *);
 
+/*
+ * Returns 1 if sha1 is the hash of a known missing blob. If size is not NULL,
+ * also returns its size.
+ */
+extern int has_missing_blob(const unsigned char *sha1, unsigned long *size);
+
 /* Dumb servers support */
 extern int update_server_info(int);
 
diff --git a/sha1_file.c b/sha1_file.c
index deb08b0f1..87dc0a393 100644
--- a/sha1_file.c
+++ b/sha1_file.c
@@ -27,6 +27,9 @@
 #include "list.h"
 #include "mergesort.h"
 #include "quote.h"
+#include "iterator.h"
+#include "dir-iterator.h"
+#include "sha1-lookup.h"
 
 #define SZ_FMT PRIuMAX
 static inline uintmax_t sz_fmt(size_t s) { return s; }
@@ -1624,6 +1627,72 @@ static const struct packed_git *has_packed_and_bad(const unsigned char *sha1)
 	return NULL;
 }
 
+struct missing_blob_manifest {
+	struct missing_blob_manifest *next;
+	const char *data;
+};
+struct missing_blob_manifest *missing_blobs;
+int missing_blobs_initialized;
+
+static void prepare_missing_blobs(void)
+{
+	int ok;
+	char *dirname;
+	struct dir_iterator *iter;
+
+	if (missing_blobs_initialized)
+		return;
+
+	missing_blobs_initialized = 1;
+
+	dirname = xstrfmt("%s/missing", get_object_directory());
+	iter = dir_iterator_begin(dirname);
+
+	while ((ok = dir_iterator_advance(iter)) == ITER_OK) {
+		int fd;
+		const char *data;
+		struct missing_blob_manifest *m;
+		if (!S_ISREG(iter->st.st_mode))
+			continue;
+		fd = git_open(iter->path.buf);
+		data = xmmap(NULL, iter->st.st_size, PROT_READ, MAP_PRIVATE, fd, 0);
+		close(fd);
+
+		m = xmalloc(sizeof(*m));
+		m->next = missing_blobs;
+		m->data = data;
+		missing_blobs = m;
+	}
+
+	if (ok != ITER_DONE) {
+		/* do something */
+	}
+
+	free(dirname);
+}
+
+int has_missing_blob(const unsigned char *sha1, unsigned long *size)
+{
+	struct missing_blob_manifest *m;
+	prepare_missing_blobs();
+	for (m = missing_blobs; m; m = m->next) {
+		uint64_t nr_nbo, nr;
+		int result;
+		memcpy(&nr_nbo, m->data, sizeof(nr_nbo));
+		nr = htonll(nr_nbo);
+		result = sha1_entry_pos(m->data, GIT_SHA1_RAWSZ + 8, 8, 0, nr, nr, sha1);
+		if (result >= 0) {
+			if (size) {
+				uint64_t size_nbo;
+				memcpy(&size_nbo, m->data + 8 + result * (GIT_SHA1_RAWSZ + 8) + GIT_SHA1_RAWSZ, sizeof(size_nbo));
+				*size = ntohll(size_nbo);
+			}
+			return 1;
+		}
+	}
+	return 0;
+}
+
 /*
  * With an in-core object data in "map", rehash it to make sure the
  * object name actually matches "sha1" to detect object corruption.
@@ -2949,6 +3018,49 @@ static int sha1_loose_object_info(const unsigned char *sha1,
 	return (status < 0) ? status : 0;
 }
 
+static char *missing_blob_command;
+static int sha1_file_config(const char *conf_key, const char *value, void *cb)
+{
+	if (!strcmp(conf_key, "core.missingblobcommand")) {
+		missing_blob_command = xstrdup(value);
+	}
+	return 0;
+}
+
+static int configured;
+static void ensure_configured(void)
+{
+	if (configured)
+		return;
+
+	git_config(sha1_file_config, NULL);
+	configured = 1;
+}
+
+static void handle_missing_blob(const unsigned char *sha1)
+{
+	struct child_process cp = CHILD_PROCESS_INIT;
+	const char *argv[] = {missing_blob_command, NULL};
+	char input[GIT_MAX_HEXSZ + 1];
+
+	memcpy(input, sha1_to_hex(sha1), 40);
+	input[40] = '\n';
+
+	cp.argv = argv;
+	cp.env = local_repo_env;
+	cp.use_shell = 1;
+
+	if (pipe_command(&cp, input, sizeof(input), NULL, 0, NULL, 0)) {
+		die("failed to load blob %s", sha1_to_hex(sha1));
+	}
+
+	/*
+	 * The command above may have updated packfiles, so update our record
+	 * of them.
+	 */
+	reprepare_packed_git();
+}
+
 static int get_cached_object(const unsigned char *sha1, enum object_type *typep,
 			     unsigned long *sizep, struct object_info *oi,
 			     void **buf)
@@ -3075,25 +3187,40 @@ static int get_object(const unsigned char *sha1, enum object_type *typep,
 {
 	struct pack_entry e;
 	const unsigned char *real = lookup_replace_object_extended(sha1, flags);
+	int already_retried = 0;
 
 	if (!(flags & GET_OBJECT_IGNORE_CACHED) &&
 	    get_cached_object(real, typep, sizep, oi, buf))
 		return 1;
+retry:
+	if (find_pack_entry(real, &e))
+		goto found_packed;
 
-	if (!find_pack_entry(real, &e)) {
-		/* Most likely it's a loose object. */
-		if (get_loose_object(real, typep, sizep, oi, buf,
-				     flags & LOOKUP_UNKNOWN_OBJECT))
-			return 1;
+	/* Most likely it's a loose object. */
+	if (get_loose_object(real, typep, sizep, oi, buf,
+			     flags & LOOKUP_UNKNOWN_OBJECT))
+		return 1;
 
-		/* Not a loose object; someone else may have just packed it. */
-		if (flags & GET_OBJECT_QUICK)
-			return 0;
+	/* Not a loose object; someone else may have just packed it. */
+	if (!(flags & GET_OBJECT_QUICK)) {
 		reprepare_packed_git();
-		if (!find_pack_entry(real, &e))
-			return 0;
+		if (find_pack_entry(real, &e))
+			goto found_packed;
+	}
+	
+	/* Try the missing blobs */
+	if (!already_retried) {
+		ensure_configured();
+		if (missing_blob_command && has_missing_blob(sha1, NULL)) {
+			already_retried = 1;
+			handle_missing_blob(sha1);
+			goto retry;
+		}
 	}
 
+	return 0;
+
+found_packed:
 	if (get_packed_object(&e, typep, sizep, oi, buf))
 		return 1;
 
diff --git a/t/t3907-missing-blob.sh b/t/t3907-missing-blob.sh
new file mode 100755
index 000000000..e0ce0942d
--- /dev/null
+++ b/t/t3907-missing-blob.sh
@@ -0,0 +1,69 @@
+#!/bin/sh
+
+test_description='core.missingblobcommand option'
+
+. ./test-lib.sh
+
+pack() {
+	perl -e '$/ = undef; $input = <>; print pack("H*", $input)'
+}
+
+test_expect_success 'sha1_object_info_extended and read_sha1_file (through git cat-file -p)' '
+	rm -rf server client &&
+
+	git init server &&
+	test_commit -C server 1 &&
+	test_config -C server uploadpack.allowanysha1inwant 1 &&
+	HASH=$(git hash-object server/1.t) &&
+
+	git init client &&
+	test_config -C client core.missingblobcommand \
+		"git -C \"$(pwd)/server\" pack-objects --stdout | git unpack-objects" &&
+
+	# does not work if missing blob is not registered
+	test_must_fail git -C client cat-file -p "$HASH" &&
+
+	mkdir -p client/.git/objects/missing &&
+	printf "%016x%s%016x" 1 "$HASH" "$(wc -c <server/1.t)" |
+		pack >client/.git/objects/missing/x &&
+
+	# works when missing blob is registered
+	git -C client cat-file -p "$HASH"
+'
+
+test_expect_success 'has_sha1_file (through git cat-file -e)' '
+	rm -rf server client &&
+
+	git init server &&
+	test_commit -C server 1 &&
+	test_config -C server uploadpack.allowanysha1inwant 1 &&
+	HASH=$(git hash-object server/1.t) &&
+
+	git init client &&
+	test_config -C client core.missingblobcommand \
+		"git -C \"$(pwd)/server\" pack-objects --stdout | git unpack-objects" &&
+	mkdir -p client/.git/objects/missing &&
+	printf "%016x%s%016x" 1 "$HASH" "$(wc -c <server/1.t)" |
+		pack >client/.git/objects/missing/x &&
+	git -C client cat-file -e "$HASH"
+'
+
+test_expect_success 'fsck' '
+	rm -rf server client &&
+
+	git init server &&
+	test_commit -C server 1 &&
+	test_config -C server uploadpack.allowanysha1inwant 1 &&
+	HASH=$(git hash-object server/1.t) &&
+	echo hash is $HASH &&
+
+	cp -r server client &&
+	test_config -C client core.missingblobcommand "this-command-is-not-actually-run" &&
+	mkdir -p client/.git/objects/missing &&
+	printf "%016x%s%016x" 1 "$HASH" "$(wc -c <server/1.t)" |
+		pack >client/.git/objects/missing/x &&
+	rm client/.git/objects/$(echo $HASH | cut -c1-2)/$(echo $HASH | cut -c3-40) &&
+	git -C client fsck
+'
+
+test_done
-- 
2.13.1.508.gb3defc5cc-goog


^ permalink raw reply related	[flat|nested] 70+ messages in thread

* Re: [RFC PATCH 2/4] sha1_file: extract type and size from object_info
  2017-06-09 19:23 ` [RFC PATCH 2/4] sha1_file: extract type and size from object_info Jonathan Tan
@ 2017-06-10  7:01   ` Jeff King
  2017-06-12 19:52     ` Jonathan Tan
  0 siblings, 1 reply; 70+ messages in thread
From: Jeff King @ 2017-06-10  7:01 UTC (permalink / raw)
  To: Jonathan Tan; +Cc: git

On Fri, Jun 09, 2017 at 12:23:24PM -0700, Jonathan Tan wrote:

> Looking at the 3 primary functions (sha1_object_info_extended,
> read_object, has_sha1_file_with_flags), they independently implement
> mechanisms such as object replacement, retrying the packed store after
> failing to find the object in the packed store then the loose store, and
> being able to mark a packed object as bad and then retrying the whole
> process. Consolidating these mechanisms would be a great help to
> maintainability.
> 
> Such a consolidated function would need to handle the read_object() case
> (which returns the object data, type, and size) and the
> sha1_object_info_extended() case (which returns the object type, size,
> and some additional information, not all of which can be "turned off" by
> passing NULL in "struct object_info").

I like the idea of consolidating the logic. But I can't help but feel
that pulling these fields out of object_info is a step backwards. The
point of that struct is to let the caller specify which aspects of the
object they're interested in, and let the lookup function decide how
best to optimize the query.

So it seems like places which actually want to read the object should be
passing in a new field in the object_info for "yes, I actually want the
object contents, too", and then the consolidated function can decide
which approach to take based on whether or not the contents are
requested (e.g., unpacking the whole thing, or just the header).

If a caller asks for the contents but not the size, that's OK. We'd find
the size incidentally while unpacking the contents, but just not include
it in the returned object_info.

Another approach to this whole mess is to have a single function for
acquiring a "handle" to an object, along with functions to query aspects
of a handle. That would let callers iteratively ask for the parts they
care about, and we could lazily fill the handle info (so information we
pick up while servicing one property of the object gets cached and
returned for free if the caller asks for it later).

That's a much bigger change, though it may have other benefits (e.g., we
could be passing around handles instead of object buffers, which would
make it more natural to stream object content in many cases).

-Peff

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [RFC PATCH 2/4] sha1_file: extract type and size from object_info
  2017-06-10  7:01   ` Jeff King
@ 2017-06-12 19:52     ` Jonathan Tan
  2017-06-12 21:13       ` Jeff King
  0 siblings, 1 reply; 70+ messages in thread
From: Jonathan Tan @ 2017-06-12 19:52 UTC (permalink / raw)
  To: Jeff King; +Cc: git

On Sat, 10 Jun 2017 03:01:33 -0400
Jeff King <peff@peff.net> wrote:

> On Fri, Jun 09, 2017 at 12:23:24PM -0700, Jonathan Tan wrote:
> 
> > Looking at the 3 primary functions (sha1_object_info_extended,
> > read_object, has_sha1_file_with_flags), they independently implement
> > mechanisms such as object replacement, retrying the packed store after
> > failing to find the object in the packed store then the loose store, and
> > being able to mark a packed object as bad and then retrying the whole
> > process. Consolidating these mechanisms would be a great help to
> > maintainability.
> > 
> > Such a consolidated function would need to handle the read_object() case
> > (which returns the object data, type, and size) and the
> > sha1_object_info_extended() case (which returns the object type, size,
> > and some additional information, not all of which can be "turned off" by
> > passing NULL in "struct object_info").
> 
> I like the idea of consolidating the logic. But I can't help but feel
> that pulling these fields out of object_info is a step backwards. The
> point of that struct is to let the caller specify which aspects of the
> object they're interested in

My issue was that there are some parts that cannot be turned off (in
particular, the object_info.u.packed part). Having said that, reading
the packed object itself should give us enough information to populate
that, so I'll take a look and see if this is possible.

> Another approach to this whole mess is to have a single function for
> acquiring a "handle" to an object, along with functions to query aspects
> of a handle. That would let callers iteratively ask for the parts they
> care about, and we could lazily fill the handle info (so information we
> pick up while servicing one property of the object gets cached and
> returned for free if the caller asks for it later).
> 
> That's a much bigger change, though it may have other benefits (e.g., we
> could be passing around handles instead of object buffers, which would
> make it more natural to stream object content in many cases).

There are a few safeguards that, I think, only work with the current
get-everything-then-forget-about-it approach (the packed-loose-packed
retry mechanism, and the desperate retry-if-corrupt-packed-object one).
If we have a handle with a cache, then, for example, we would lose the
ability to retry packed after loose if the handle has already declared
that the object is loose.

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [RFC PATCH 1/4] sha1_file: teach packed_object_info about typename
  2017-06-09 19:23 ` [RFC PATCH 1/4] sha1_file: teach packed_object_info about typename Jonathan Tan
@ 2017-06-12 20:55   ` Junio C Hamano
  0 siblings, 0 replies; 70+ messages in thread
From: Junio C Hamano @ 2017-06-12 20:55 UTC (permalink / raw)
  To: Jonathan Tan; +Cc: git

Jonathan Tan <jonathantanmy@google.com> writes:

> In commit 46f0344 ("sha1_file: support reading from a loose object of
> unknown type", 2015-05-06), "struct object_info" gained a "typename"
> field that could represent a type name from a loose object file, whether
> valid or invalid, as opposed to the existing "typep" which could only
> represent valid types. Some relatively complex manipulations were added
> to avoid breaking packed_object_info() without modifying it, but it is
> much easier to just teach packed_object_info() about the new field.
> Therefore, teach packed_object_info() as described above.
>
> Signed-off-by: Jonathan Tan <jonathantanmy@google.com>
> ---
>  sha1_file.c | 29 ++++++++++++-----------------
>  1 file changed, 12 insertions(+), 17 deletions(-)
>
> diff --git a/sha1_file.c b/sha1_file.c
> index 59a4ed2ed..a52b27541 100644
> --- a/sha1_file.c
> +++ b/sha1_file.c
> @@ -2277,9 +2277,18 @@ int packed_object_info(struct packed_git *p, off_t obj_offset,
>  		*oi->disk_sizep = revidx[1].offset - obj_offset;
>  	}
>  
> -	if (oi->typep) {
> -		*oi->typep = packed_to_object_type(p, obj_offset, type, &w_curs, curpos);
> -		if (*oi->typep < 0) {
> +	if (oi->typep || oi->typename) {
> +		enum object_type ptot;
> +		ptot = packed_to_object_type(p, obj_offset, type, &w_curs,
> +					     curpos);
> +		if (oi->typep)
> +			*oi->typep = ptot;
> +		if (oi->typename) {
> +			const char *tn = typename(ptot);
> +			if (tn)
> +				strbuf_addstr(oi->typename, tn);
> +		}
> +		if (ptot < 0) {
>  			type = OBJ_BAD;
>  			goto out;
>  		}

OK.  When the caller wants to learn typename, we need to do this
type-to-string conversion somewhere anyway, and I agree that it is
better to do it here, instead of in the caller.





> @@ -2960,7 +2969,6 @@ int sha1_object_info_extended(const unsigned char *sha1, struct object_info *oi,
>  	struct cached_object *co;
>  	struct pack_entry e;
>  	int rtype;
> -	enum object_type real_type;
>  	const unsigned char *real = lookup_replace_object_extended(sha1, flags);
>  
>  	co = find_cached_object(real);
> @@ -2992,18 +3000,9 @@ int sha1_object_info_extended(const unsigned char *sha1, struct object_info *oi,
>  			return -1;
>  	}
>  
> -	/*
> -	 * packed_object_info() does not follow the delta chain to
> -	 * find out the real type, unless it is given oi->typep.
> -	 */
> -	if (oi->typename && !oi->typep)
> -		oi->typep = &real_type;
> -
>  	rtype = packed_object_info(e.p, e.offset, oi);
>  	if (rtype < 0) {
>  		mark_bad_packed_object(e.p, real);
> -		if (oi->typep == &real_type)
> -			oi->typep = NULL;
>  		return sha1_object_info_extended(real, oi, 0);
>  	} else if (in_delta_base_cache(e.p, e.offset)) {
>  		oi->whence = OI_DBCACHED;
> @@ -3014,10 +3013,6 @@ int sha1_object_info_extended(const unsigned char *sha1, struct object_info *oi,
>  		oi->u.packed.is_delta = (rtype == OBJ_REF_DELTA ||
>  					 rtype == OBJ_OFS_DELTA);
>  	}
> -	if (oi->typename)
> -		strbuf_addstr(oi->typename, typename(*oi->typep));
> -	if (oi->typep == &real_type)
> -		oi->typep = NULL;
>  
>  	return 0;
>  }

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [RFC PATCH 2/4] sha1_file: extract type and size from object_info
  2017-06-12 19:52     ` Jonathan Tan
@ 2017-06-12 21:13       ` Jeff King
  0 siblings, 0 replies; 70+ messages in thread
From: Jeff King @ 2017-06-12 21:13 UTC (permalink / raw)
  To: Jonathan Tan; +Cc: git

On Mon, Jun 12, 2017 at 12:52:54PM -0700, Jonathan Tan wrote:

> > I like the idea of consolidating the logic. But I can't help but feel
> > that pulling these fields out of object_info is a step backwards. The
> > point of that struct is to let the caller specify which aspects of the
> > object they're interested in
> 
> My issue was that there are some parts that cannot be turned off (in
> particular, the object_info.u.packed part). Having said that, reading
> the packed object itself should give us enough information to populate
> that, so I'll take a look and see if this is possible.

I think in general that the parts of object_info which aren't optional
should be largely "free" to set (or at least O(1)).

> > Another approach to this whole mess is to have a single function for
> > acquiring a "handle" to an object, along with functions to query aspects
> [...]
> 
> There are a few safeguards that, I think, only work with the current
> get-everything-then-forget-about-it approach (the packed-loose-packed
> retry mechanism, and the desperate retry-if-corrupt-packed-object one).
> If we have a handle with a cache, then, for example, we would lose the
> ability to retry packed after loose if the handle has already declared
> that the object is loose.

Yes, the handle would have to make some guarantee that it could access
the object. Which would generally involve holding open a descriptor or
mmap. That would probably take some surgery to make it work with the way
pack mmap windows work.

So the whole "handle" thing is how it probably _ought_ to work, but I
agree we may be too far down the other path to make it worth switching.

-Peff

^ permalink raw reply	[flat|nested] 70+ messages in thread

* [PATCH v2 0/4] Improvements to sha1_file
  2017-06-09 19:23 [RFC PATCH 0/4] Improvements to sha1_file Jonathan Tan
                   ` (3 preceding siblings ...)
  2017-06-09 19:23 ` [RFC PATCH 4/4] sha1_file, fsck: add missing blob support Jonathan Tan
@ 2017-06-13 21:05 ` Jonathan Tan
  2017-06-13 21:05 ` [PATCH v2 1/4] sha1_file: teach packed_object_info about typename Jonathan Tan
                   ` (26 subsequent siblings)
  31 siblings, 0 replies; 70+ messages in thread
From: Jonathan Tan @ 2017-06-13 21:05 UTC (permalink / raw)
  To: git; +Cc: Jonathan Tan, gitster, peff

Peff suggested putting in a new field in struct object_info for the
object contents; I tried it and it seems to work out quite well.

Patch 1 is unmodified from the previous version. Patches 2-3 have been
rewritten, and patch 4 is similar except that the missing-lookup change
is made to sha1_object_info_extended() instead of the now gone
get_object().

As before, I would like review on patches 1-3 to go into the tree.
(Patch 4 is a work in progress, and is here just to demonstrate the
effectiveness of the refactoring.)

Jonathan Tan (4):
  sha1_file: teach packed_object_info about typename
  sha1_file: move delta base cache code up
  sha1_file: consolidate storage-agnostic object fns
  sha1_file, fsck: add missing blob support

 Documentation/config.txt |  10 +
 builtin/fsck.c           |   7 +
 cache.h                  |  13 ++
 sha1_file.c              | 506 +++++++++++++++++++++++++++++------------------
 t/t3907-missing-blob.sh  |  69 +++++++
 5 files changed, 418 insertions(+), 187 deletions(-)
 create mode 100755 t/t3907-missing-blob.sh

-- 
2.13.1.518.g3df882009-goog


^ permalink raw reply	[flat|nested] 70+ messages in thread

* [PATCH v2 1/4] sha1_file: teach packed_object_info about typename
  2017-06-09 19:23 [RFC PATCH 0/4] Improvements to sha1_file Jonathan Tan
                   ` (4 preceding siblings ...)
  2017-06-13 21:05 ` [PATCH v2 0/4] Improvements to sha1_file Jonathan Tan
@ 2017-06-13 21:05 ` Jonathan Tan
  2017-06-13 21:05 ` [PATCH v2 2/4] sha1_file: move delta base cache code up Jonathan Tan
                   ` (25 subsequent siblings)
  31 siblings, 0 replies; 70+ messages in thread
From: Jonathan Tan @ 2017-06-13 21:05 UTC (permalink / raw)
  To: git; +Cc: Jonathan Tan, gitster, peff

In commit 46f0344 ("sha1_file: support reading from a loose object of
unknown type", 2015-05-06), "struct object_info" gained a "typename"
field that could represent a type name from a loose object file, whether
valid or invalid, as opposed to the existing "typep" which could only
represent valid types. Some relatively complex manipulations were added
to avoid breaking packed_object_info() without modifying it, but it is
much easier to just teach packed_object_info() about the new field.
Therefore, teach packed_object_info() as described above.

Signed-off-by: Jonathan Tan <jonathantanmy@google.com>
---
 sha1_file.c | 29 ++++++++++++-----------------
 1 file changed, 12 insertions(+), 17 deletions(-)

diff --git a/sha1_file.c b/sha1_file.c
index 59a4ed2ed..a52b27541 100644
--- a/sha1_file.c
+++ b/sha1_file.c
@@ -2277,9 +2277,18 @@ int packed_object_info(struct packed_git *p, off_t obj_offset,
 		*oi->disk_sizep = revidx[1].offset - obj_offset;
 	}
 
-	if (oi->typep) {
-		*oi->typep = packed_to_object_type(p, obj_offset, type, &w_curs, curpos);
-		if (*oi->typep < 0) {
+	if (oi->typep || oi->typename) {
+		enum object_type ptot;
+		ptot = packed_to_object_type(p, obj_offset, type, &w_curs,
+					     curpos);
+		if (oi->typep)
+			*oi->typep = ptot;
+		if (oi->typename) {
+			const char *tn = typename(ptot);
+			if (tn)
+				strbuf_addstr(oi->typename, tn);
+		}
+		if (ptot < 0) {
 			type = OBJ_BAD;
 			goto out;
 		}
@@ -2960,7 +2969,6 @@ int sha1_object_info_extended(const unsigned char *sha1, struct object_info *oi,
 	struct cached_object *co;
 	struct pack_entry e;
 	int rtype;
-	enum object_type real_type;
 	const unsigned char *real = lookup_replace_object_extended(sha1, flags);
 
 	co = find_cached_object(real);
@@ -2992,18 +3000,9 @@ int sha1_object_info_extended(const unsigned char *sha1, struct object_info *oi,
 			return -1;
 	}
 
-	/*
-	 * packed_object_info() does not follow the delta chain to
-	 * find out the real type, unless it is given oi->typep.
-	 */
-	if (oi->typename && !oi->typep)
-		oi->typep = &real_type;
-
 	rtype = packed_object_info(e.p, e.offset, oi);
 	if (rtype < 0) {
 		mark_bad_packed_object(e.p, real);
-		if (oi->typep == &real_type)
-			oi->typep = NULL;
 		return sha1_object_info_extended(real, oi, 0);
 	} else if (in_delta_base_cache(e.p, e.offset)) {
 		oi->whence = OI_DBCACHED;
@@ -3014,10 +3013,6 @@ int sha1_object_info_extended(const unsigned char *sha1, struct object_info *oi,
 		oi->u.packed.is_delta = (rtype == OBJ_REF_DELTA ||
 					 rtype == OBJ_OFS_DELTA);
 	}
-	if (oi->typename)
-		strbuf_addstr(oi->typename, typename(*oi->typep));
-	if (oi->typep == &real_type)
-		oi->typep = NULL;
 
 	return 0;
 }
-- 
2.13.1.518.g3df882009-goog


^ permalink raw reply related	[flat|nested] 70+ messages in thread

* [PATCH v2 2/4] sha1_file: move delta base cache code up
  2017-06-09 19:23 [RFC PATCH 0/4] Improvements to sha1_file Jonathan Tan
                   ` (5 preceding siblings ...)
  2017-06-13 21:05 ` [PATCH v2 1/4] sha1_file: teach packed_object_info about typename Jonathan Tan
@ 2017-06-13 21:05 ` Jonathan Tan
  2017-06-15 17:00   ` Junio C Hamano
  2017-06-13 21:05 ` [PATCH v2 3/4] sha1_file: consolidate storage-agnostic object fns Jonathan Tan
                   ` (24 subsequent siblings)
  31 siblings, 1 reply; 70+ messages in thread
From: Jonathan Tan @ 2017-06-13 21:05 UTC (permalink / raw)
  To: git; +Cc: Jonathan Tan, gitster, peff

In a subsequent patch, packed_object_info() will be modified to use the
delta base cache, so move the relevant code to before
packed_object_info().

Signed-off-by: Jonathan Tan <jonathantanmy@google.com>
---
 sha1_file.c | 226 +++++++++++++++++++++++++++++++-----------------------------
 1 file changed, 116 insertions(+), 110 deletions(-)

diff --git a/sha1_file.c b/sha1_file.c
index a52b27541..a158907d1 100644
--- a/sha1_file.c
+++ b/sha1_file.c
@@ -2239,116 +2239,6 @@ static enum object_type packed_to_object_type(struct packed_git *p,
 	goto out;
 }
 
-int packed_object_info(struct packed_git *p, off_t obj_offset,
-		       struct object_info *oi)
-{
-	struct pack_window *w_curs = NULL;
-	unsigned long size;
-	off_t curpos = obj_offset;
-	enum object_type type;
-
-	/*
-	 * We always get the representation type, but only convert it to
-	 * a "real" type later if the caller is interested.
-	 */
-	type = unpack_object_header(p, &w_curs, &curpos, &size);
-
-	if (oi->sizep) {
-		if (type == OBJ_OFS_DELTA || type == OBJ_REF_DELTA) {
-			off_t tmp_pos = curpos;
-			off_t base_offset = get_delta_base(p, &w_curs, &tmp_pos,
-							   type, obj_offset);
-			if (!base_offset) {
-				type = OBJ_BAD;
-				goto out;
-			}
-			*oi->sizep = get_size_from_delta(p, &w_curs, tmp_pos);
-			if (*oi->sizep == 0) {
-				type = OBJ_BAD;
-				goto out;
-			}
-		} else {
-			*oi->sizep = size;
-		}
-	}
-
-	if (oi->disk_sizep) {
-		struct revindex_entry *revidx = find_pack_revindex(p, obj_offset);
-		*oi->disk_sizep = revidx[1].offset - obj_offset;
-	}
-
-	if (oi->typep || oi->typename) {
-		enum object_type ptot;
-		ptot = packed_to_object_type(p, obj_offset, type, &w_curs,
-					     curpos);
-		if (oi->typep)
-			*oi->typep = ptot;
-		if (oi->typename) {
-			const char *tn = typename(ptot);
-			if (tn)
-				strbuf_addstr(oi->typename, tn);
-		}
-		if (ptot < 0) {
-			type = OBJ_BAD;
-			goto out;
-		}
-	}
-
-	if (oi->delta_base_sha1) {
-		if (type == OBJ_OFS_DELTA || type == OBJ_REF_DELTA) {
-			const unsigned char *base;
-
-			base = get_delta_base_sha1(p, &w_curs, curpos,
-						   type, obj_offset);
-			if (!base) {
-				type = OBJ_BAD;
-				goto out;
-			}
-
-			hashcpy(oi->delta_base_sha1, base);
-		} else
-			hashclr(oi->delta_base_sha1);
-	}
-
-out:
-	unuse_pack(&w_curs);
-	return type;
-}
-
-static void *unpack_compressed_entry(struct packed_git *p,
-				    struct pack_window **w_curs,
-				    off_t curpos,
-				    unsigned long size)
-{
-	int st;
-	git_zstream stream;
-	unsigned char *buffer, *in;
-
-	buffer = xmallocz_gently(size);
-	if (!buffer)
-		return NULL;
-	memset(&stream, 0, sizeof(stream));
-	stream.next_out = buffer;
-	stream.avail_out = size + 1;
-
-	git_inflate_init(&stream);
-	do {
-		in = use_pack(p, w_curs, curpos, &stream.avail_in);
-		stream.next_in = in;
-		st = git_inflate(&stream, Z_FINISH);
-		if (!stream.avail_out)
-			break; /* the payload is larger than it should be */
-		curpos += stream.next_in - in;
-	} while (st == Z_OK || st == Z_BUF_ERROR);
-	git_inflate_end(&stream);
-	if ((st != Z_STREAM_END) || stream.total_out != size) {
-		free(buffer);
-		return NULL;
-	}
-
-	return buffer;
-}
-
 static struct hashmap delta_base_cache;
 static size_t delta_base_cached;
 
@@ -2486,6 +2376,122 @@ static void add_delta_base_cache(struct packed_git *p, off_t base_offset,
 	hashmap_add(&delta_base_cache, ent);
 }
 
+int packed_object_info(struct packed_git *p, off_t obj_offset,
+		       struct object_info *oi)
+{
+	struct pack_window *w_curs = NULL;
+	unsigned long size;
+	off_t curpos = obj_offset;
+	enum object_type type;
+
+	/*
+	 * We always get the representation type, but only convert it to
+	 * a "real" type later if the caller is interested.
+	 */
+	type = unpack_object_header(p, &w_curs, &curpos, &size);
+
+	if (oi->sizep) {
+		if (type == OBJ_OFS_DELTA || type == OBJ_REF_DELTA) {
+			off_t tmp_pos = curpos;
+			off_t base_offset = get_delta_base(p, &w_curs, &tmp_pos,
+							   type, obj_offset);
+			if (!base_offset) {
+				type = OBJ_BAD;
+				goto out;
+			}
+			*oi->sizep = get_size_from_delta(p, &w_curs, tmp_pos);
+			if (*oi->sizep == 0) {
+				type = OBJ_BAD;
+				goto out;
+			}
+		} else {
+			*oi->sizep = size;
+		}
+	}
+
+	if (oi->disk_sizep) {
+		struct revindex_entry *revidx = find_pack_revindex(p, obj_offset);
+		*oi->disk_sizep = revidx[1].offset - obj_offset;
+	}
+
+	if (oi->typep || oi->typename) {
+		enum object_type ptot;
+		ptot = packed_to_object_type(p, obj_offset, type, &w_curs,
+					     curpos);
+		if (oi->typep)
+			*oi->typep = ptot;
+		if (oi->typename) {
+			const char *tn = typename(ptot);
+			if (tn)
+				strbuf_addstr(oi->typename, tn);
+		}
+		if (ptot < 0) {
+			type = OBJ_BAD;
+			goto out;
+		}
+	}
+
+	if (oi->delta_base_sha1) {
+		if (type == OBJ_OFS_DELTA || type == OBJ_REF_DELTA) {
+			const unsigned char *base;
+
+			base = get_delta_base_sha1(p, &w_curs, curpos,
+						   type, obj_offset);
+			if (!base) {
+				type = OBJ_BAD;
+				goto out;
+			}
+
+			hashcpy(oi->delta_base_sha1, base);
+		} else
+			hashclr(oi->delta_base_sha1);
+	}
+
+	oi->whence = OI_PACKED;
+	oi->u.packed.offset = obj_offset;
+	oi->u.packed.pack = p;
+	oi->u.packed.is_delta = (type == OBJ_REF_DELTA ||
+				 type == OBJ_OFS_DELTA);
+
+out:
+	unuse_pack(&w_curs);
+	return type;
+}
+
+static void *unpack_compressed_entry(struct packed_git *p,
+				    struct pack_window **w_curs,
+				    off_t curpos,
+				    unsigned long size)
+{
+	int st;
+	git_zstream stream;
+	unsigned char *buffer, *in;
+
+	buffer = xmallocz_gently(size);
+	if (!buffer)
+		return NULL;
+	memset(&stream, 0, sizeof(stream));
+	stream.next_out = buffer;
+	stream.avail_out = size + 1;
+
+	git_inflate_init(&stream);
+	do {
+		in = use_pack(p, w_curs, curpos, &stream.avail_in);
+		stream.next_in = in;
+		st = git_inflate(&stream, Z_FINISH);
+		if (!stream.avail_out)
+			break; /* the payload is larger than it should be */
+		curpos += stream.next_in - in;
+	} while (st == Z_OK || st == Z_BUF_ERROR);
+	git_inflate_end(&stream);
+	if ((st != Z_STREAM_END) || stream.total_out != size) {
+		free(buffer);
+		return NULL;
+	}
+
+	return buffer;
+}
+
 static void *read_object(const unsigned char *sha1, enum object_type *type,
 			 unsigned long *size);
 
-- 
2.13.1.518.g3df882009-goog


^ permalink raw reply related	[flat|nested] 70+ messages in thread

* [PATCH v2 3/4] sha1_file: consolidate storage-agnostic object fns
  2017-06-09 19:23 [RFC PATCH 0/4] Improvements to sha1_file Jonathan Tan
                   ` (6 preceding siblings ...)
  2017-06-13 21:05 ` [PATCH v2 2/4] sha1_file: move delta base cache code up Jonathan Tan
@ 2017-06-13 21:05 ` Jonathan Tan
  2017-06-15 17:50   ` Junio C Hamano
  2017-06-13 21:06 ` [PATCH v2 4/4] sha1_file, fsck: add missing blob support Jonathan Tan
                   ` (23 subsequent siblings)
  31 siblings, 1 reply; 70+ messages in thread
From: Jonathan Tan @ 2017-06-13 21:05 UTC (permalink / raw)
  To: git; +Cc: Jonathan Tan, gitster, peff

In sha1_file.c, there are a few functions that provide information on an
object regardless of its storage (cached, loose, or packed). Looking
through all non-static functions in sha1_file.c that take in an unsigned
char * pointer, the relevant ones are:
 - sha1_object_info_extended
 - sha1_object_info (auto-fixed by sha1_object_info_extended)
 - read_sha1_file_extended (uses read_object)
 - read_object_with_reference (auto-fixed by read_sha1_file_extended)
 - has_sha1_file_with_flags
 - assert_sha1_type (auto-fixed by sha1_object_info)

Looking at the 3 primary functions (sha1_object_info_extended,
read_object, has_sha1_file_with_flags), they independently implement
mechanisms such as object replacement, retrying the packed store after
failing to find the object in the packed store then the loose store, and
being able to mark a packed object as bad and then retrying the whole
process. Consolidating these mechanisms would be a great help to
maintainability.

Therefore, consolidate all 3 functions by extending
sha1_object_info_extended() to support the functionality needed by all 3
functions, and then modifying the other 2 to use
sha1_object_info_extended().

Note that has_sha1_file_with_flags() does not try cached storage,
whereas the other 2 functions do - this functionality is preserved.

Signed-off-by: Jonathan Tan <jonathantanmy@google.com>
---
 cache.h     |   7 +++
 sha1_file.c | 143 +++++++++++++++++++++++++++++++-----------------------------
 2 files changed, 81 insertions(+), 69 deletions(-)

diff --git a/cache.h b/cache.h
index 4d92aae0e..3c85867c3 100644
--- a/cache.h
+++ b/cache.h
@@ -1835,6 +1835,7 @@ struct object_info {
 	off_t *disk_sizep;
 	unsigned char *delta_base_sha1;
 	struct strbuf *typename;
+	void **contentp;
 
 	/* Response */
 	enum {
@@ -1866,6 +1867,12 @@ struct object_info {
  */
 #define OBJECT_INFO_INIT {NULL}
 
+/*
+ * sha1_object_info_extended() supports the LOOKUP_ flags and the OBJECT_INFO_
+ * flags.
+ */
+#define OBJECT_INFO_QUICK 4
+#define OBJECT_INFO_SKIP_CACHED 8
 extern int sha1_object_info_extended(const unsigned char *, struct object_info *, unsigned flags);
 extern int packed_object_info(struct packed_git *pack, off_t offset, struct object_info *);
 
diff --git a/sha1_file.c b/sha1_file.c
index a158907d1..98086e21e 100644
--- a/sha1_file.c
+++ b/sha1_file.c
@@ -2005,19 +2005,6 @@ int parse_sha1_header(const char *hdr, unsigned long *sizep)
 	return parse_sha1_header_extended(hdr, &oi, LOOKUP_REPLACE_OBJECT);
 }
 
-static void *unpack_sha1_file(void *map, unsigned long mapsize, enum object_type *type, unsigned long *size, const unsigned char *sha1)
-{
-	int ret;
-	git_zstream stream;
-	char hdr[8192];
-
-	ret = unpack_sha1_header(&stream, map, mapsize, hdr, sizeof(hdr));
-	if (ret < Z_OK || (*type = parse_sha1_header(hdr, size)) < 0)
-		return NULL;
-
-	return unpack_sha1_rest(&stream, hdr, *size, sha1);
-}
-
 unsigned long get_size_from_delta(struct packed_git *p,
 				  struct pack_window **w_curs,
 			          off_t curpos)
@@ -2326,8 +2313,10 @@ static void *cache_or_unpack_entry(struct packed_git *p, off_t base_offset,
 	if (!ent)
 		return unpack_entry(p, base_offset, type, base_size);
 
-	*type = ent->type;
-	*base_size = ent->size;
+	if (type)
+		*type = ent->type;
+	if (base_size)
+		*base_size = ent->size;
 	return xmemdupz(ent->data, ent->size);
 }
 
@@ -2388,9 +2377,16 @@ int packed_object_info(struct packed_git *p, off_t obj_offset,
 	 * We always get the representation type, but only convert it to
 	 * a "real" type later if the caller is interested.
 	 */
-	type = unpack_object_header(p, &w_curs, &curpos, &size);
+	if (oi->contentp) {
+		*oi->contentp = cache_or_unpack_entry(p, obj_offset, oi->sizep,
+						      &type);
+		if (!*oi->contentp)
+			type = OBJ_BAD;
+	} else {
+		type = unpack_object_header(p, &w_curs, &curpos, &size);
+	}
 
-	if (oi->sizep) {
+	if (!oi->contentp && oi->sizep) {
 		if (type == OBJ_OFS_DELTA || type == OBJ_REF_DELTA) {
 			off_t tmp_pos = curpos;
 			off_t base_offset = get_delta_base(p, &w_curs, &tmp_pos,
@@ -2685,8 +2681,10 @@ void *unpack_entry(struct packed_git *p, off_t obj_offset,
 		free(external_base);
 	}
 
-	*final_type = type;
-	*final_size = size;
+	if (final_type)
+		*final_type = type;
+	if (final_size)
+		*final_size = size;
 
 	unuse_pack(&w_curs);
 
@@ -2920,6 +2918,7 @@ static int sha1_loose_object_info(const unsigned char *sha1,
 	git_zstream stream;
 	char hdr[32];
 	struct strbuf hdrbuf = STRBUF_INIT;
+	unsigned long size_scratch;
 
 	if (oi->delta_base_sha1)
 		hashclr(oi->delta_base_sha1);
@@ -2932,7 +2931,7 @@ static int sha1_loose_object_info(const unsigned char *sha1,
 	 * return value implicitly indicates whether the
 	 * object even exists.
 	 */
-	if (!oi->typep && !oi->typename && !oi->sizep) {
+	if (!oi->typep && !oi->typename && !oi->sizep && !oi->contentp) {
 		const char *path;
 		struct stat st;
 		if (stat_sha1_file(sha1, &st, &path) < 0)
@@ -2945,6 +2944,10 @@ static int sha1_loose_object_info(const unsigned char *sha1,
 	map = map_sha1_file(sha1, &mapsize);
 	if (!map)
 		return -1;
+
+	if (!oi->sizep)
+		oi->sizep = &size_scratch;
+
 	if (oi->disk_sizep)
 		*oi->disk_sizep = mapsize;
 	if ((flags & LOOKUP_UNKNOWN_OBJECT)) {
@@ -2962,50 +2965,71 @@ static int sha1_loose_object_info(const unsigned char *sha1,
 				       sha1_to_hex(sha1));
 	} else if ((status = parse_sha1_header_extended(hdr, oi, flags)) < 0)
 		status = error("unable to parse %s header", sha1_to_hex(sha1));
-	git_inflate_end(&stream);
+
+	if (status >= 0 && oi->contentp)
+		*oi->contentp = unpack_sha1_rest(&stream, hdr,
+						 *oi->sizep, sha1);
+	else
+		git_inflate_end(&stream);
+
 	munmap(map, mapsize);
 	if (status && oi->typep)
 		*oi->typep = status;
+	if (oi->sizep == &size_scratch)
+		oi->sizep = NULL;
 	strbuf_release(&hdrbuf);
 	return (status < 0) ? status : 0;
 }
 
 int sha1_object_info_extended(const unsigned char *sha1, struct object_info *oi, unsigned flags)
 {
-	struct cached_object *co;
 	struct pack_entry e;
 	int rtype;
 	const unsigned char *real = lookup_replace_object_extended(sha1, flags);
 
-	co = find_cached_object(real);
-	if (co) {
-		if (oi->typep)
-			*(oi->typep) = co->type;
-		if (oi->sizep)
-			*(oi->sizep) = co->size;
-		if (oi->disk_sizep)
-			*(oi->disk_sizep) = 0;
-		if (oi->delta_base_sha1)
-			hashclr(oi->delta_base_sha1);
-		if (oi->typename)
-			strbuf_addstr(oi->typename, typename(co->type));
-		oi->whence = OI_CACHED;
-		return 0;
+	if (!(flags & OBJECT_INFO_SKIP_CACHED)) {
+		struct cached_object *co = find_cached_object(real);
+		if (co) {
+			if (!oi)
+				return 0;
+			if (oi->typep)
+				*(oi->typep) = co->type;
+			if (oi->sizep)
+				*(oi->sizep) = co->size;
+			if (oi->disk_sizep)
+				*(oi->disk_sizep) = 0;
+			if (oi->delta_base_sha1)
+				hashclr(oi->delta_base_sha1);
+			if (oi->typename)
+				strbuf_addstr(oi->typename, typename(co->type));
+			if (oi->contentp)
+				*oi->contentp = xmemdupz(co->buf, co->size);
+			oi->whence = OI_CACHED;
+			return 0;
+		}
 	}
 
 	if (!find_pack_entry(real, &e)) {
 		/* Most likely it's a loose object. */
-		if (!sha1_loose_object_info(real, oi, flags)) {
+		if (oi && !sha1_loose_object_info(real, oi, flags)) {
 			oi->whence = OI_LOOSE;
 			return 0;
 		}
+		if (!oi && has_loose_object(real))
+			return 0;
 
 		/* Not a loose object; someone else may have just packed it. */
-		reprepare_packed_git();
-		if (!find_pack_entry(real, &e))
+		if (flags & OBJECT_INFO_QUICK) {
 			return -1;
+		} else {
+			reprepare_packed_git();
+			if (!find_pack_entry(real, &e))
+				return -1;
+		}
 	}
 
+	if (!oi)
+		return 0;
 	rtype = packed_object_info(e.p, e.offset, oi);
 	if (rtype < 0) {
 		mark_bad_packed_object(e.p, real);
@@ -3081,28 +3105,15 @@ int pretend_sha1_file(void *buf, unsigned long len, enum object_type type,
 static void *read_object(const unsigned char *sha1, enum object_type *type,
 			 unsigned long *size)
 {
-	unsigned long mapsize;
-	void *map, *buf;
-	struct cached_object *co;
-
-	co = find_cached_object(sha1);
-	if (co) {
-		*type = co->type;
-		*size = co->size;
-		return xmemdupz(co->buf, co->size);
-	}
+	struct object_info oi = OBJECT_INFO_INIT;
+	void *content;
+	oi.typep = type;
+	oi.sizep = size;
+	oi.contentp = &content;
 
-	buf = read_packed_sha1(sha1, type, size);
-	if (buf)
-		return buf;
-	map = map_sha1_file(sha1, &mapsize);
-	if (map) {
-		buf = unpack_sha1_file(map, mapsize, type, size, sha1);
-		munmap(map, mapsize);
-		return buf;
-	}
-	reprepare_packed_git();
-	return read_packed_sha1(sha1, type, size);
+	if (sha1_object_info_extended(sha1, &oi, 0))
+		return NULL;
+	return content;
 }
 
 /*
@@ -3480,18 +3491,12 @@ int has_sha1_pack(const unsigned char *sha1)
 
 int has_sha1_file_with_flags(const unsigned char *sha1, int flags)
 {
-	struct pack_entry e;
+	int f = OBJECT_INFO_SKIP_CACHED |
+		((flags & HAS_SHA1_QUICK) ? OBJECT_INFO_QUICK : 0);
 
 	if (!startup_info->have_repository)
 		return 0;
-	if (find_pack_entry(sha1, &e))
-		return 1;
-	if (has_loose_object(sha1))
-		return 1;
-	if (flags & HAS_SHA1_QUICK)
-		return 0;
-	reprepare_packed_git();
-	return find_pack_entry(sha1, &e);
+	return !sha1_object_info_extended(sha1, NULL, f);
 }
 
 int has_object_file(const struct object_id *oid)
-- 
2.13.1.518.g3df882009-goog


^ permalink raw reply related	[flat|nested] 70+ messages in thread

* [PATCH v2 4/4] sha1_file, fsck: add missing blob support
  2017-06-09 19:23 [RFC PATCH 0/4] Improvements to sha1_file Jonathan Tan
                   ` (7 preceding siblings ...)
  2017-06-13 21:05 ` [PATCH v2 3/4] sha1_file: consolidate storage-agnostic object fns Jonathan Tan
@ 2017-06-13 21:06 ` Jonathan Tan
  2017-06-15 18:34   ` Junio C Hamano
  2017-06-15 20:39 ` [PATCH v3 0/4] Improvements to sha1_file Jonathan Tan
                   ` (22 subsequent siblings)
  31 siblings, 1 reply; 70+ messages in thread
From: Jonathan Tan @ 2017-06-13 21:06 UTC (permalink / raw)
  To: git; +Cc: Jonathan Tan, gitster, peff

Currently, Git does not support repos with very large numbers of blobs
or repos that wish to minimize manipulation of certain blobs (for
example, because they are very large) very well, even if the user
operates mostly on part of the repo, because Git is designed on the
assumption that every blob referenced by a tree object is available
somewhere in the repo storage.

As a first step to reducing this problem, add rudimentary support for
missing blobs by teaching sha1_file to invoke a hook whenever a blob is
requested and unavailable but registered to be missing, and by updating
fsck to tolerate such blobs.  The hook is a shell command that can be
configured through "git config"; this hook takes in a list of hashes and
writes (if successful) the corresponding objects to the repo's local
storage.

This commit does not include support for generating such a repo; neither
has any command (other than fsck) been modified to either tolerate
missing blobs (without invoking the hook) or be more efficient in
invoking the missing blob hook. Only a fallback is provided in the form
of sha1_file invoking the missing blob hook when necessary.

In order to determine the code changes in sha1_file.c necessary, I
investigated the following:
 (1) functions in sha1_file that take in a hash, without the user
     regarding how the object is stored (loose or packed)
 (2) functions in sha1_file that operate on packed objects (because I
     need to check callers that know about the loose/packed distinction
     and operate on both differently, and ensure that they can handle
     the concept of objects that are neither loose nor packed)

(1) is handled by the modification to sha1_object_info_extended().

For (2), I looked through the same functions as in (1) and also
for_each_packed_object. The ones that are relevant are:
 - parse_pack_index
   - http - indirectly from http_get_info_packs
 - find_pack_entry_one
   - this searches a single pack that is provided as an argument; the
     caller already knows (through other means) that the sought object
     is in a specific pack
 - find_sha1_pack
   - fast-import - appears to be an optimization to not store a
     file if it is already in a pack
   - http-walker - to search through a struct alt_base
   - http-push - to search through remote packs
 - has_sha1_pack
   - builtin/fsck - fixed in this commit
   - builtin/count-objects - informational purposes only (check if loose
     object is also packed)
   - builtin/prune-packed - check if object to be pruned is packed (if
     not, don't prune it)
   - revision - used to exclude packed objects if requested by user
   - diff - just for optimization
 - for_each_packed_object
   - reachable - only to find recent objects
   - builtin/fsck - fixed in this commit
   - builtin/cat-file - see below

As described in the list above, builtin/fsck has been updated. I have
left builtin/cat-file alone; this means that cat-file
--batch-all-objects will only operate on objects physically in the repo.

An alternative design that I considered but rejected:

 - Adding a hook whenever a packed blob is requested, not on any blob.
   That is, whenever we attempt to search the packfiles for a blob, if
   it is missing (from the packfiles and from the loose object storage),
   to invoke the hook (which must then store it as a packfile), open the
   packfile the hook generated, and report that the blob is found in
   that new packfile. This reduces the amount of analysis needed (in
   that we only need to look at how packed blobs are handled), but
   requires that the hook generate packfiles (or for sha1_file to pack
   whatever loose objects are generated), creating one packfile for each
   missing blob and potentially very many packfiles that must be
   linearly searched. This may be tolerable now for repos that only have
   a few missing blobs (for example, repos that only want to exclude
   large blobs), and might be tolerable in the future if we have
   batching support for the most commonly used commands, but is not
   tolerable now for repos that exclude a large amount of blobs.

Signed-off-by: Jonathan Tan <jonathantanmy@google.com>
---
 Documentation/config.txt |  10 +++
 builtin/fsck.c           |   7 +++
 cache.h                  |   6 ++
 sha1_file.c              | 156 ++++++++++++++++++++++++++++++++++++++++++-----
 t/t3907-missing-blob.sh  |  69 +++++++++++++++++++++
 5 files changed, 233 insertions(+), 15 deletions(-)
 create mode 100755 t/t3907-missing-blob.sh

diff --git a/Documentation/config.txt b/Documentation/config.txt
index dd4beec39..10da5fde1 100644
--- a/Documentation/config.txt
+++ b/Documentation/config.txt
@@ -390,6 +390,16 @@ The default is false, except linkgit:git-clone[1] or linkgit:git-init[1]
 will probe and set core.ignoreCase true if appropriate when the repository
 is created.
 
+core.missingBlobCommand::
+	If set, whenever a blob in the local repo is attempted to be
+	read but is missing, invoke this shell command to generate or
+	obtain that blob before reporting an error. This shell command
+	should take one or more hashes, each terminated by a newline, as
+	standard input, and (if successful) should write the
+	corresponding objects to the local repo (packed or loose).
++
+If set, fsck will not treat a missing blob as an error condition.
+
 core.precomposeUnicode::
 	This option is only used by Mac OS implementation of Git.
 	When core.precomposeUnicode=true, Git reverts the unicode decomposition
diff --git a/builtin/fsck.c b/builtin/fsck.c
index cb2ba6cd1..6f10d6034 100644
--- a/builtin/fsck.c
+++ b/builtin/fsck.c
@@ -37,6 +37,7 @@ static int verbose;
 static int show_progress = -1;
 static int show_dangling = 1;
 static int name_objects;
+static int missing_blob_ok;
 #define ERROR_OBJECT 01
 #define ERROR_REACHABLE 02
 #define ERROR_PACK 04
@@ -93,6 +94,9 @@ static int fsck_config(const char *var, const char *value, void *cb)
 		return 0;
 	}
 
+	if (!strcmp(var, "core.missingblobcommand"))
+		missing_blob_ok = 1;
+
 	return git_default_config(var, value, cb);
 }
 
@@ -222,6 +226,9 @@ static void check_reachable_object(struct object *obj)
 	if (!(obj->flags & HAS_OBJ)) {
 		if (has_sha1_pack(obj->oid.hash))
 			return; /* it is in pack - forget about it */
+		if (missing_blob_ok && obj->type == OBJ_BLOB &&
+		    has_missing_blob(obj->oid.hash, NULL))
+			return;
 		printf("missing %s %s\n", printable_type(obj),
 			describe_object(obj));
 		errors_found |= ERROR_REACHABLE;
diff --git a/cache.h b/cache.h
index 3c85867c3..2853b39c4 100644
--- a/cache.h
+++ b/cache.h
@@ -1876,6 +1876,12 @@ struct object_info {
 extern int sha1_object_info_extended(const unsigned char *, struct object_info *, unsigned flags);
 extern int packed_object_info(struct packed_git *pack, off_t offset, struct object_info *);
 
+/*
+ * Returns 1 if sha1 is the hash of a known missing blob. If size is not NULL,
+ * also returns its size.
+ */
+extern int has_missing_blob(const unsigned char *sha1, unsigned long *size);
+
 /* Dumb servers support */
 extern int update_server_info(int);
 
diff --git a/sha1_file.c b/sha1_file.c
index 98086e21e..75fe2174d 100644
--- a/sha1_file.c
+++ b/sha1_file.c
@@ -27,6 +27,9 @@
 #include "list.h"
 #include "mergesort.h"
 #include "quote.h"
+#include "iterator.h"
+#include "dir-iterator.h"
+#include "sha1-lookup.h"
 
 #define SZ_FMT PRIuMAX
 static inline uintmax_t sz_fmt(size_t s) { return s; }
@@ -1624,6 +1627,72 @@ static const struct packed_git *has_packed_and_bad(const unsigned char *sha1)
 	return NULL;
 }
 
+struct missing_blob_manifest {
+	struct missing_blob_manifest *next;
+	const char *data;
+};
+struct missing_blob_manifest *missing_blobs;
+int missing_blobs_initialized;
+
+static void prepare_missing_blobs(void)
+{
+	int ok;
+	char *dirname;
+	struct dir_iterator *iter;
+
+	if (missing_blobs_initialized)
+		return;
+
+	missing_blobs_initialized = 1;
+
+	dirname = xstrfmt("%s/missing", get_object_directory());
+	iter = dir_iterator_begin(dirname);
+
+	while ((ok = dir_iterator_advance(iter)) == ITER_OK) {
+		int fd;
+		const char *data;
+		struct missing_blob_manifest *m;
+		if (!S_ISREG(iter->st.st_mode))
+			continue;
+		fd = git_open(iter->path.buf);
+		data = xmmap(NULL, iter->st.st_size, PROT_READ, MAP_PRIVATE, fd, 0);
+		close(fd);
+
+		m = xmalloc(sizeof(*m));
+		m->next = missing_blobs;
+		m->data = data;
+		missing_blobs = m;
+	}
+
+	if (ok != ITER_DONE) {
+		/* do something */
+	}
+
+	free(dirname);
+}
+
+int has_missing_blob(const unsigned char *sha1, unsigned long *size)
+{
+	struct missing_blob_manifest *m;
+	prepare_missing_blobs();
+	for (m = missing_blobs; m; m = m->next) {
+		uint64_t nr_nbo, nr;
+		int result;
+		memcpy(&nr_nbo, m->data, sizeof(nr_nbo));
+		nr = htonll(nr_nbo);
+		result = sha1_entry_pos(m->data, GIT_SHA1_RAWSZ + 8, 8, 0, nr, nr, sha1);
+		if (result >= 0) {
+			if (size) {
+				uint64_t size_nbo;
+				memcpy(&size_nbo, m->data + 8 + result * (GIT_SHA1_RAWSZ + 8) + GIT_SHA1_RAWSZ, sizeof(size_nbo));
+				*size = ntohll(size_nbo);
+			}
+			return 1;
+		}
+	}
+	return 0;
+}
+
 /*
  * With an in-core object data in "map", rehash it to make sure the
  * object name actually matches "sha1" to detect object corruption.
@@ -2981,11 +3050,55 @@ static int sha1_loose_object_info(const unsigned char *sha1,
 	return (status < 0) ? status : 0;
 }
 
+static char *missing_blob_command;
+static int sha1_file_config(const char *conf_key, const char *value, void *cb)
+{
+	if (!strcmp(conf_key, "core.missingblobcommand")) {
+		missing_blob_command = xstrdup(value);
+	}
+	return 0;
+}
+
+static int configured;
+static void ensure_configured(void)
+{
+	if (configured)
+		return;
+
+	git_config(sha1_file_config, NULL);
+	configured = 1;
+}
+
+static void handle_missing_blob(const unsigned char *sha1)
+{
+	struct child_process cp = CHILD_PROCESS_INIT;
+	const char *argv[] = {missing_blob_command, NULL};
+	char input[GIT_MAX_HEXSZ + 1];
+
+	memcpy(input, sha1_to_hex(sha1), 40);
+	input[40] = '\n';
+
+	cp.argv = argv;
+	cp.env = local_repo_env;
+	cp.use_shell = 1;
+
+	if (pipe_command(&cp, input, sizeof(input), NULL, 0, NULL, 0)) {
+		die("failed to load blob %s", sha1_to_hex(sha1));
+	}
+
+	/*
+	 * The command above may have updated packfiles, so update our record
+	 * of them.
+	 */
+	reprepare_packed_git();
+}
+
 int sha1_object_info_extended(const unsigned char *sha1, struct object_info *oi, unsigned flags)
 {
 	struct pack_entry e;
 	int rtype;
 	const unsigned char *real = lookup_replace_object_extended(sha1, flags);
+	int already_retried = 0;
 
 	if (!(flags & OBJECT_INFO_SKIP_CACHED)) {
 		struct cached_object *co = find_cached_object(real);
@@ -3009,25 +3122,38 @@ int sha1_object_info_extended(const unsigned char *sha1, struct object_info *oi,
 		}
 	}
 
-	if (!find_pack_entry(real, &e)) {
-		/* Most likely it's a loose object. */
-		if (oi && !sha1_loose_object_info(real, oi, flags)) {
-			oi->whence = OI_LOOSE;
-			return 0;
-		}
-		if (!oi && has_loose_object(real))
-			return 0;
+retry:
+	if (find_pack_entry(real, &e))
+		goto found_packed;
 
-		/* Not a loose object; someone else may have just packed it. */
-		if (flags & OBJECT_INFO_QUICK) {
-			return -1;
-		} else {
-			reprepare_packed_git();
-			if (!find_pack_entry(real, &e))
-				return -1;
+	/* Most likely it's a loose object. */
+	if (oi && !sha1_loose_object_info(real, oi, flags)) {
+		oi->whence = OI_LOOSE;
+		return 0;
+	}
+	if (!oi && has_loose_object(real))
+		return 0;
+
+	/* Not a loose object; someone else may have just packed it. */
+	if (!(flags & OBJECT_INFO_QUICK)) {
+		reprepare_packed_git();
+		if (find_pack_entry(real, &e))
+			goto found_packed;
+	}
+
+	/* Try the missing blobs */
+	if (!already_retried) {
+		ensure_configured();
+		if (missing_blob_command && has_missing_blob(real, NULL)) {
+			already_retried = 1;
+			handle_missing_blob(real);
+			goto retry;
 		}
 	}
 
+	return -1;
+
+found_packed:
 	if (!oi)
 		return 0;
 	rtype = packed_object_info(e.p, e.offset, oi);
diff --git a/t/t3907-missing-blob.sh b/t/t3907-missing-blob.sh
new file mode 100755
index 000000000..e0ce0942d
--- /dev/null
+++ b/t/t3907-missing-blob.sh
@@ -0,0 +1,69 @@
+#!/bin/sh
+
+test_description='core.missingblobcommand option'
+
+. ./test-lib.sh
+
+pack() {
+	perl -e '$/ = undef; $input = <>; print pack("H*", $input)'
+}
+
+test_expect_success 'sha1_object_info_extended and read_sha1_file (through git cat-file -p)' '
+	rm -rf server client &&
+
+	git init server &&
+	test_commit -C server 1 &&
+	test_config -C server uploadpack.allowanysha1inwant 1 &&
+	HASH=$(git hash-object server/1.t) &&
+
+	git init client &&
+	test_config -C client core.missingblobcommand \
+		"git -C \"$(pwd)/server\" pack-objects --stdout | git unpack-objects" &&
+
+	# does not work if missing blob is not registered
+	test_must_fail git -C client cat-file -p "$HASH" &&
+
+	mkdir -p client/.git/objects/missing &&
+	printf "%016x%s%016x" 1 "$HASH" "$(wc -c <server/1.t)" |
+		pack >client/.git/objects/missing/x &&
+
+	# works when missing blob is registered
+	git -C client cat-file -p "$HASH"
+'
+
+test_expect_success 'has_sha1_file (through git cat-file -e)' '
+	rm -rf server client &&
+
+	git init server &&
+	test_commit -C server 1 &&
+	test_config -C server uploadpack.allowanysha1inwant 1 &&
+	HASH=$(git hash-object server/1.t) &&
+
+	git init client &&
+	test_config -C client core.missingblobcommand \
+		"git -C \"$(pwd)/server\" pack-objects --stdout | git unpack-objects" &&
+	mkdir -p client/.git/objects/missing &&
+	printf "%016x%s%016x" 1 "$HASH" "$(wc -c <server/1.t)" |
+		pack >client/.git/objects/missing/x &&
+	git -C client cat-file -e "$HASH"
+'
+
+test_expect_success 'fsck' '
+	rm -rf server client &&
+
+	git init server &&
+	test_commit -C server 1 &&
+	test_config -C server uploadpack.allowanysha1inwant 1 &&
+	HASH=$(git hash-object server/1.t) &&
+	echo hash is $HASH &&
+
+	cp -r server client &&
+	test_config -C client core.missingblobcommand "this-command-is-not-actually-run" &&
+	mkdir -p client/.git/objects/missing &&
+	printf "%016x%s%016x" 1 "$HASH" "$(wc -c <server/1.t)" |
+		pack >client/.git/objects/missing/x &&
+	rm client/.git/objects/$(echo $HASH | cut -c1-2)/$(echo $HASH | cut -c3-40) &&
+	git -C client fsck
+'
+
+test_done
-- 
2.13.1.518.g3df882009-goog


^ permalink raw reply related	[flat|nested] 70+ messages in thread

* Re: [PATCH v2 2/4] sha1_file: move delta base cache code up
  2017-06-13 21:05 ` [PATCH v2 2/4] sha1_file: move delta base cache code up Jonathan Tan
@ 2017-06-15 17:00   ` Junio C Hamano
  0 siblings, 0 replies; 70+ messages in thread
From: Junio C Hamano @ 2017-06-15 17:00 UTC (permalink / raw)
  To: Jonathan Tan; +Cc: git, peff

Jonathan Tan <jonathantanmy@google.com> writes:

> In a subsequent patch, packed_object_info() will be modified to use the
> delta base cache, so move the relevant code to before
> packed_object_info().
>
> Signed-off-by: Jonathan Tan <jonathantanmy@google.com>
> ---
>  sha1_file.c | 226 +++++++++++++++++++++++++++++++-----------------------------
>  1 file changed, 116 insertions(+), 110 deletions(-)

Hmph, is this meant to be just moving two whole functions?

> diff --git a/sha1_file.c b/sha1_file.c
> index a52b27541..a158907d1 100644
> --- a/sha1_file.c
> +++ b/sha1_file.c
> @@ -2239,116 +2239,6 @@ static enum ...
> ...
> -int packed_object_info(struct packed_git *p, off_t obj_offset,
> -		       struct object_info *oi)
> -{
> -...
> -	if (oi->delta_base_sha1) {
> -		if (type == OBJ_OFS_DELTA || type == OBJ_REF_DELTA) {
> -			const unsigned char *base;
> -
> -			base = get_delta_base_sha1(p, &w_curs, curpos,
> -						   type, obj_offset);
> -			if (!base) {
> -				type = OBJ_BAD;
> -				goto out;
> -			}
> -
> -			hashcpy(oi->delta_base_sha1, base);
> -		} else
> -			hashclr(oi->delta_base_sha1);
> -	}
> -
> -out:
> -	unuse_pack(&w_curs);
> -	return type;
> -}
> -...

The above is what was removed, while ...

> @@ -2486,6 +2376,122 @@ static void ...
> ...
> +int packed_object_info(struct packed_git *p, off_t obj_offset,
> +		       struct object_info *oi)
> +{
> +...
> +	if (oi->delta_base_sha1) {
> +		if (type == OBJ_OFS_DELTA || type == OBJ_REF_DELTA) {
> +			const unsigned char *base;
> +
> +			base = get_delta_base_sha1(p, &w_curs, curpos,
> +						   type, obj_offset);
> +			if (!base) {
> +				type = OBJ_BAD;
> +				goto out;
> +			}
> +
> +			hashcpy(oi->delta_base_sha1, base);
> +		} else
> +			hashclr(oi->delta_base_sha1);
> +	}
> +
> +	oi->whence = OI_PACKED;
> +	oi->u.packed.offset = obj_offset;
> +	oi->u.packed.pack = p;
> +	oi->u.packed.is_delta = (type == OBJ_REF_DELTA ||
> +				 type == OBJ_OFS_DELTA);
> +
> +out:
> +	unuse_pack(&w_curs);
> +	return type;
> +}

... we somehow gained code to update *oi that used to be (and still
is) done by its sole caller, sha1_object_info_extended().

Perhaps this is a rebase-gotcha?

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH v2 3/4] sha1_file: consolidate storage-agnostic object fns
  2017-06-13 21:05 ` [PATCH v2 3/4] sha1_file: consolidate storage-agnostic object fns Jonathan Tan
@ 2017-06-15 17:50   ` Junio C Hamano
  2017-06-15 18:14     ` Jonathan Tan
  2017-06-17 12:19     ` Jeff King
  0 siblings, 2 replies; 70+ messages in thread
From: Junio C Hamano @ 2017-06-15 17:50 UTC (permalink / raw)
  To: Jonathan Tan; +Cc: git, peff

Jonathan Tan <jonathantanmy@google.com> writes:

> Looking at the 3 primary functions (sha1_object_info_extended,
> read_object, has_sha1_file_with_flags), they independently implement
> mechanisms such as object replacement, retrying the packed store after
> failing to find the object in the packed store then the loose store, and
> being able to mark a packed object as bad and then retrying the whole
> process. Consolidating these mechanisms would be a great help to
> maintainability.
>
> Therefore, consolidate all 3 functions by extending
> sha1_object_info_extended() to support the functionality needed by all 3
> functions, and then modifying the other 2 to use
> sha1_object_info_extended().

This is a rather "ugly" looking patch ;-) but I followed what
has_sha1_file_with_flags() and read_object() do before and after
this change, and I think this patch is a no-op wrt their behaviour
(which is a good thing).

But I have a very mixed feelings on one aspect of the resulting
sha1_object_info_extended().

>  int sha1_object_info_extended(const unsigned char *sha1, struct object_info *oi, unsigned flags)
>  {
> ...
>  	if (!find_pack_entry(real, &e)) {
>  		/* Most likely it's a loose object. */
> -		if (!sha1_loose_object_info(real, oi, flags)) {
> +		if (oi && !sha1_loose_object_info(real, oi, flags)) {
>  			oi->whence = OI_LOOSE;
>  			return 0;
>  		}
> +		if (!oi && has_loose_object(real))
> +			return 0;

This conversion is not incorrect per-se.  

We can see that has_sha1_file_with_flags() after this patch still
calls has_loose_object().  But it bothers me that there is no hint
to future developers to warn that a rewrite of the above like this
is incorrect:

        if (!find_pack_entry(read, &e)) {
                /* Most likely it's a loose object. */
       +        struct object_info dummy_oi;
       +        if (!oi) {
       +                memset(&dummy_oi, 0, sizeof(dummy_oi);
       +                oi = &dummy_oi;
       +        }
       -        if (oi && !sha1_loose_object_info(real, oi, flags)) {
       +        if (!sha1_loose_object_info(real, oi, flags)) {
                        oi->whence = OI_LOOSE;
                        return 0;
                }
       -        if (!oi && has_loose_object(real))
       -                return 0;

It used to be very easy to see that has_sha1_file_with_flags() will
call has_loose_object() when it does not find the object in a pack,
which will result in the loose object file freshened.  In the new
code, it is very subtle to see that---it will happen when the caller
passes oi == NULL, and has_sha1_file_with_flags() is such a caller,
but it is unclear if there are other callers of this "consolidated"
sha1_object_info_extended() that passes oi == NULL, and if they do
also want to freshen the loose object file when they do so.

> @@ -3480,18 +3491,12 @@ int has_sha1_pack(const unsigned char *sha1)
>  
>  int has_sha1_file_with_flags(const unsigned char *sha1, int flags)
>  {
> -	struct pack_entry e;
> +	int f = OBJECT_INFO_SKIP_CACHED |
> +		((flags & HAS_SHA1_QUICK) ? OBJECT_INFO_QUICK : 0);
>  
>  	if (!startup_info->have_repository)
>  		return 0;
> -	if (find_pack_entry(sha1, &e))
> -		return 1;
> -	if (has_loose_object(sha1))
> -		return 1;
> -	if (flags & HAS_SHA1_QUICK)
> -		return 0;
> -	reprepare_packed_git();
> -	return find_pack_entry(sha1, &e);
> +	return !sha1_object_info_extended(sha1, NULL, f);
>  }

I would have preferred to see the new variable not to be called 'f',
as that makes it unclear what it is (is that a callback function
pointer?).  In this case, uyou are forcing the flag bits passed
down, so perhaps you can reuse the same variable?  

If you allocated a separate variable because
has_sha1_file_with_flags() and sha1_object_info_extended() take flag
bits from two separate vocabularies, that is a valid reasoning, but
if that is the case, then I would have named 'f' to reflect that
fact that this is different from parameter 'flag' that is defined in
the has_sha1_file_with_flags() world, but a different thing that is
defined in sha1_object_info_extended() world, e.g. "soie_flag" or
something like that.

Thanks.

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH v2 3/4] sha1_file: consolidate storage-agnostic object fns
  2017-06-15 17:50   ` Junio C Hamano
@ 2017-06-15 18:14     ` Jonathan Tan
  2017-06-17 12:19     ` Jeff King
  1 sibling, 0 replies; 70+ messages in thread
From: Jonathan Tan @ 2017-06-15 18:14 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git, peff

On Thu, 15 Jun 2017 10:50:46 -0700
Junio C Hamano <gitster@pobox.com> wrote:

> Jonathan Tan <jonathantanmy@google.com> writes:
> 
> > Looking at the 3 primary functions (sha1_object_info_extended,
> > read_object, has_sha1_file_with_flags), they independently implement
> > mechanisms such as object replacement, retrying the packed store after
> > failing to find the object in the packed store then the loose store, and
> > being able to mark a packed object as bad and then retrying the whole
> > process. Consolidating these mechanisms would be a great help to
> > maintainability.
> >
> > Therefore, consolidate all 3 functions by extending
> > sha1_object_info_extended() to support the functionality needed by all 3
> > functions, and then modifying the other 2 to use
> > sha1_object_info_extended().
> 
> This is a rather "ugly" looking patch ;-) but I followed what
> has_sha1_file_with_flags() and read_object() do before and after
> this change, and I think this patch is a no-op wrt their behaviour
> (which is a good thing).
> 
> But I have a very mixed feelings on one aspect of the resulting
> sha1_object_info_extended().
> 
> >  int sha1_object_info_extended(const unsigned char *sha1, struct object_info *oi, unsigned flags)
> >  {
> > ...
> >  	if (!find_pack_entry(real, &e)) {
> >  		/* Most likely it's a loose object. */
> > -		if (!sha1_loose_object_info(real, oi, flags)) {
> > +		if (oi && !sha1_loose_object_info(real, oi, flags)) {
> >  			oi->whence = OI_LOOSE;
> >  			return 0;
> >  		}
> > +		if (!oi && has_loose_object(real))
> > +			return 0;
> 
> This conversion is not incorrect per-se.  
> 
> We can see that has_sha1_file_with_flags() after this patch still
> calls has_loose_object().  But it bothers me that there is no hint
> to future developers to warn that a rewrite of the above like this
> is incorrect:
> 
>         if (!find_pack_entry(read, &e)) {
>                 /* Most likely it's a loose object. */
>        +        struct object_info dummy_oi;
>        +        if (!oi) {
>        +                memset(&dummy_oi, 0, sizeof(dummy_oi);
>        +                oi = &dummy_oi;
>        +        }
>        -        if (oi && !sha1_loose_object_info(real, oi, flags)) {
>        +        if (!sha1_loose_object_info(real, oi, flags)) {
>                         oi->whence = OI_LOOSE;
>                         return 0;
>                 }
>        -        if (!oi && has_loose_object(real))
>        -                return 0;
> 
> It used to be very easy to see that has_sha1_file_with_flags() will
> call has_loose_object() when it does not find the object in a pack,
> which will result in the loose object file freshened.  In the new
> code, it is very subtle to see that---it will happen when the caller
> passes oi == NULL, and has_sha1_file_with_flags() is such a caller,
> but it is unclear if there are other callers of this "consolidated"
> sha1_object_info_extended() that passes oi == NULL, and if they do
> also want to freshen the loose object file when they do so.

Good point - sorry, I didn't pay much attention to the freshening
behavior. After some thought, I now think it might be better to avoid
modifying has_sha1_file_with_flags(). As it is,
sha1_object_info_extended() already needs special handling (special
flags and handling the possibility of "oi" being NULL) to handle the
functionality required by has_sha1_file_with_flags(); adding yet another
thing to handle (freshen or not) would make it much too complicated.

This means that subsequent patches that modify the handling of
storage-agnostic objects would still need to modify 2 functions, but at
least that is fewer than the current 3.

I'll reroll with these changes so that you (and others) can see what it
looks like.

> > @@ -3480,18 +3491,12 @@ int has_sha1_pack(const unsigned char *sha1)
> >  
> >  int has_sha1_file_with_flags(const unsigned char *sha1, int flags)
> >  {
> > -	struct pack_entry e;
> > +	int f = OBJECT_INFO_SKIP_CACHED |
> > +		((flags & HAS_SHA1_QUICK) ? OBJECT_INFO_QUICK : 0);
> >  
> >  	if (!startup_info->have_repository)
> >  		return 0;
> > -	if (find_pack_entry(sha1, &e))
> > -		return 1;
> > -	if (has_loose_object(sha1))
> > -		return 1;
> > -	if (flags & HAS_SHA1_QUICK)
> > -		return 0;
> > -	reprepare_packed_git();
> > -	return find_pack_entry(sha1, &e);
> > +	return !sha1_object_info_extended(sha1, NULL, f);
> >  }
> 
> I would have preferred to see the new variable not to be called 'f',
> as that makes it unclear what it is (is that a callback function
> pointer?).  In this case, uyou are forcing the flag bits passed
> down, so perhaps you can reuse the same variable?  
> 
> If you allocated a separate variable because
> has_sha1_file_with_flags() and sha1_object_info_extended() take flag
> bits from two separate vocabularies, that is a valid reasoning, but
> if that is the case, then I would have named 'f' to reflect that
> fact that this is different from parameter 'flag' that is defined in
> the has_sha1_file_with_flags() world, but a different thing that is
> defined in sha1_object_info_extended() world, e.g. "soie_flag" or
> something like that.
> 
> Thanks.

This makes sense. If I don't end up reverting
has_sha1_file_with_flags(), I'll change the name to "soie_flag".

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH v2 4/4] sha1_file, fsck: add missing blob support
  2017-06-13 21:06 ` [PATCH v2 4/4] sha1_file, fsck: add missing blob support Jonathan Tan
@ 2017-06-15 18:34   ` Junio C Hamano
  2017-06-15 20:31     ` Jonathan Tan
  0 siblings, 1 reply; 70+ messages in thread
From: Junio C Hamano @ 2017-06-15 18:34 UTC (permalink / raw)
  To: Jonathan Tan; +Cc: git, peff

Jonathan Tan <jonathantanmy@google.com> writes:

> diff --git a/sha1_file.c b/sha1_file.c
> index 98086e21e..75fe2174d 100644
> --- a/sha1_file.c
> +++ b/sha1_file.c
> @@ -27,6 +27,9 @@
>  #include "list.h"
>  #include "mergesort.h"
>  #include "quote.h"
> +#include "iterator.h"
> +#include "dir-iterator.h"
> +#include "sha1-lookup.h"
>  
>  #define SZ_FMT PRIuMAX
>  static inline uintmax_t sz_fmt(size_t s) { return s; }
> @@ -1624,6 +1627,72 @@ static const struct packed_git *has_packed_and_bad(const unsigned char *sha1)
>  	return NULL;
>  }
>  
> +struct missing_blob_manifest {
> +	struct missing_blob_manifest *next;
> +	const char *data;
> +};
> +struct missing_blob_manifest *missing_blobs;
> +int missing_blobs_initialized;

I do not think you meant to make these non-static.  The type of the
former is not even visible to the outside world, and the latter is
something that could be made into static to prepare_missing_blobs()
function (unless and until you start allowing the missing-blobs
manifest to be re-initialized).  Your ensure_configured() below
seems to do the "static" right, on the other hand ;-).

Do we expect that we will have only a handful of these missing blob
manifests?  Each manifest seems to be efficiently looked-up with a
binary search, but it makes me wonder if it is a good idea to
consolidate these manifests into a single list of object names to
eliminate the outer loop in has_missing_blob().  Unlike pack .idx
files that must stay one-to-one with .pack files, it appears to me
that there is no reason why we need to keep multiple ones separate
for extended period of time (e.g. whenever we learn that we receieved
an incomplete pack from the other side with a list of newly missing
blobs, we could incorporate that into existing missing blob list).

> +int has_missing_blob(const unsigned char *sha1, unsigned long *size)
> +{

This function that answers "is it expected to be missing?" is
confusingly named.  Is it missing, or does it exist?

> @@ -2981,11 +3050,55 @@ static int sha1_loose_object_info(const unsigned char *sha1,
>  	return (status < 0) ? status : 0;
>  }
>  
> +static char *missing_blob_command;
> +static int sha1_file_config(const char *conf_key, const char *value, void *cb)
> +{
> +	if (!strcmp(conf_key, "core.missingblobcommand")) {
> +		missing_blob_command = xstrdup(value);
> +	}
> +	return 0;
> +}
> +
> +static int configured;
> +static void ensure_configured(void)
> +{
> +	if (configured)
> +		return;

Do not be selfish and pretend that this is the _only_ kind of
configuration that needs to be done inside sha1_file.c.  Call the
function ensure_<something>_is_configured() and rename the run-once
guard to match.

The run-once guard can be made static to the "ensure" function, and
if you do so, then its name can stay to be "configured", as at that
point it is clear what it is guarding.

> diff --git a/t/t3907-missing-blob.sh b/t/t3907-missing-blob.sh
> new file mode 100755
> index 000000000..e0ce0942d
> --- /dev/null
> +++ b/t/t3907-missing-blob.sh
> @@ -0,0 +1,69 @@
> +#!/bin/sh
> +
> +test_description='core.missingblobcommand option'
> +
> +. ./test-lib.sh
> +
> +pack() {

Style: "pack () {"

> +	perl -e '$/ = undef; $input = <>; print pack("H*", $input)'

high-nybble first to match ntohll() done in has_missing_blob()?  OK.

> +}
> +
> +test_expect_success 'sha1_object_info_extended and read_sha1_file (through git cat-file -p)' '
> +	rm -rf server client &&
> +
> +	git init server &&
> +	test_commit -C server 1 &&
> +	test_config -C server uploadpack.allowanysha1inwant 1 &&
> +	HASH=$(git hash-object server/1.t) &&
> +
> +	git init client &&
> +	test_config -C client core.missingblobcommand \
> +		"git -C \"$(pwd)/server\" pack-objects --stdout | git unpack-objects" &&
> +
> +	# does not work if missing blob is not registered
> +	test_must_fail git -C client cat-file -p "$HASH" &&
> +
> +	mkdir -p client/.git/objects/missing &&
> +	printf "%016x%s%016x" 1 "$HASH" "$(wc -c <server/1.t)" |
> +		pack >client/.git/objects/missing/x &&
> +
> +	# works when missing blob is registered
> +	git -C client cat-file -p "$HASH"
> +'

OK, by passing printf '%016x', implementations of "$(wc -c)" that
gives extra whitespace around its output can still work correctly.
Good.

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH v2 4/4] sha1_file, fsck: add missing blob support
  2017-06-15 18:34   ` Junio C Hamano
@ 2017-06-15 20:31     ` Jonathan Tan
  2017-06-15 20:52       ` Junio C Hamano
  0 siblings, 1 reply; 70+ messages in thread
From: Jonathan Tan @ 2017-06-15 20:31 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git, peff

A reroll is coming soon, but there is an interesting discussion point
here so I'll reply to this e-mail first.

On Thu, 15 Jun 2017 11:34:45 -0700
Junio C Hamano <gitster@pobox.com> wrote:

> Jonathan Tan <jonathantanmy@google.com> writes:
> 
> > +struct missing_blob_manifest {
> > +	struct missing_blob_manifest *next;
> > +	const char *data;
> > +};
> > +struct missing_blob_manifest *missing_blobs;
> > +int missing_blobs_initialized;
> 
> I do not think you meant to make these non-static.  The type of the
> former is not even visible to the outside world, and the latter is
> something that could be made into static to prepare_missing_blobs()
> function (unless and until you start allowing the missing-blobs
> manifest to be re-initialized).  Your ensure_configured() below
> seems to do the "static" right, on the other hand ;-).

Good catch - done.

> Do we expect that we will have only a handful of these missing blob
> manifests?  Each manifest seems to be efficiently looked-up with a
> binary search, but it makes me wonder if it is a good idea to
> consolidate these manifests into a single list of object names to
> eliminate the outer loop in has_missing_blob().  Unlike pack .idx
> files that must stay one-to-one with .pack files, it appears to me
> that there is no reason why we need to keep multiple ones separate
> for extended period of time (e.g. whenever we learn that we receieved
> an incomplete pack from the other side with a list of newly missing
> blobs, we could incorporate that into existing missing blob list).

There is indeed no reason why we need to keep multiple ones separate for
an extended period of time - my thinking was to let fetch/clone be fast
by not needing to scan through the entire existing manifest (in order to
create the new one), letting GC take care of consolidating them (since
it would have to check individual entries to delete those corresponding
to objects that have entered the repo through other means). But this is
at the expense of making the individual object lookups a bit slower.

For now, I'll leave the possibility of multiple files open while I try
to create a set of patches that can implement missing blob support from
fetch to day-to-day usage. But I am not opposed to changing it to a
single-file manifest.

> > +int has_missing_blob(const unsigned char *sha1, unsigned long *size)
> > +{
> 
> This function that answers "is it expected to be missing?" is
> confusingly named.  Is it missing, or does it exist?

Renamed to in_missing_blob_manifest().

> > @@ -2981,11 +3050,55 @@ static int sha1_loose_object_info(const unsigned char *sha1,
> >  	return (status < 0) ? status : 0;
> >  }
> >  
> > +static char *missing_blob_command;
> > +static int sha1_file_config(const char *conf_key, const char *value, void *cb)
> > +{
> > +	if (!strcmp(conf_key, "core.missingblobcommand")) {
> > +		missing_blob_command = xstrdup(value);
> > +	}
> > +	return 0;
> > +}
> > +
> > +static int configured;
> > +static void ensure_configured(void)
> > +{
> > +	if (configured)
> > +		return;
> 
> Do not be selfish and pretend that this is the _only_ kind of
> configuration that needs to be done inside sha1_file.c.  Call the
> function ensure_<something>_is_configured() and rename the run-once
> guard to match.

My thinking was that any additional configuration could be added to this
function, but individual configuration for each feature is fine too. I
have renamed things according to your suggestion.

> The run-once guard can be made static to the "ensure" function, and
> if you do so, then its name can stay to be "configured", as at that
> point it is clear what it is guarding.

Done.

> > +pack() {
> 
> Style: "pack () {"

Done.

> 
> > +	perl -e '$/ = undef; $input = <>; print pack("H*", $input)'
> 
> high-nybble first to match ntohll() done in has_missing_blob()?  OK.

Actually it's to match the printf behavior below that prints the high
nybble first (like in English).

^ permalink raw reply	[flat|nested] 70+ messages in thread

* [PATCH v3 0/4] Improvements to sha1_file
  2017-06-09 19:23 [RFC PATCH 0/4] Improvements to sha1_file Jonathan Tan
                   ` (8 preceding siblings ...)
  2017-06-13 21:06 ` [PATCH v2 4/4] sha1_file, fsck: add missing blob support Jonathan Tan
@ 2017-06-15 20:39 ` Jonathan Tan
  2017-06-15 20:39 ` [PATCH v3 1/4] sha1_file: teach packed_object_info about typename Jonathan Tan
                   ` (21 subsequent siblings)
  31 siblings, 0 replies; 70+ messages in thread
From: Jonathan Tan @ 2017-06-15 20:39 UTC (permalink / raw)
  To: git; +Cc: Jonathan Tan, gitster

Thanks - this has been updated following Junio's comments.

Patch 1 is unmodified from the previous version.

Patch 2 has been modified to remove the extraneous code pointed out by
Junio. I previously had an idea of populating those fields in
packed_object_info(), but later changed my mind, but a rebase went
wrong.

Patches 3-4 have been updated as I have described in [1] and [2].

[1] https://public-inbox.org/git/20170615111447.1208e02b@twelve2.svl.corp.google.com/
[2] https://public-inbox.org/git/20170615111447.1208e02b@twelve2.svl.corp.google.com/

As before, I would like review on patches 1-3 to go into the tree.
(Patch 4 is a work in progress, and is here just to demonstrate the
effectiveness of the refactoring.)

Jonathan Tan (4):
  sha1_file: teach packed_object_info about typename
  sha1_file: move delta base cache code up
  sha1_file: consolidate storage-agnostic object fns
  sha1_file, fsck: add missing blob support

 Documentation/config.txt |  10 +
 builtin/fsck.c           |   7 +
 cache.h                  |   8 +
 sha1_file.c              | 474 ++++++++++++++++++++++++++++++-----------------
 t/t3907-missing-blob.sh  |  69 +++++++
 5 files changed, 400 insertions(+), 168 deletions(-)
 create mode 100755 t/t3907-missing-blob.sh

-- 
2.13.1.518.g3df882009-goog


^ permalink raw reply	[flat|nested] 70+ messages in thread

* [PATCH v3 1/4] sha1_file: teach packed_object_info about typename
  2017-06-09 19:23 [RFC PATCH 0/4] Improvements to sha1_file Jonathan Tan
                   ` (9 preceding siblings ...)
  2017-06-15 20:39 ` [PATCH v3 0/4] Improvements to sha1_file Jonathan Tan
@ 2017-06-15 20:39 ` Jonathan Tan
  2017-06-15 20:39 ` [PATCH v3 2/4] sha1_file: move delta base cache code up Jonathan Tan
                   ` (20 subsequent siblings)
  31 siblings, 0 replies; 70+ messages in thread
From: Jonathan Tan @ 2017-06-15 20:39 UTC (permalink / raw)
  To: git; +Cc: Jonathan Tan, gitster

In commit 46f0344 ("sha1_file: support reading from a loose object of
unknown type", 2015-05-06), "struct object_info" gained a "typename"
field that could represent a type name from a loose object file, whether
valid or invalid, as opposed to the existing "typep" which could only
represent valid types. Some relatively complex manipulations were added
to avoid breaking packed_object_info() without modifying it, but it is
much easier to just teach packed_object_info() about the new field.
Therefore, teach packed_object_info() as described above.

Signed-off-by: Jonathan Tan <jonathantanmy@google.com>
---
 sha1_file.c | 29 ++++++++++++-----------------
 1 file changed, 12 insertions(+), 17 deletions(-)

diff --git a/sha1_file.c b/sha1_file.c
index 59a4ed2ed..a52b27541 100644
--- a/sha1_file.c
+++ b/sha1_file.c
@@ -2277,9 +2277,18 @@ int packed_object_info(struct packed_git *p, off_t obj_offset,
 		*oi->disk_sizep = revidx[1].offset - obj_offset;
 	}
 
-	if (oi->typep) {
-		*oi->typep = packed_to_object_type(p, obj_offset, type, &w_curs, curpos);
-		if (*oi->typep < 0) {
+	if (oi->typep || oi->typename) {
+		enum object_type ptot;
+		ptot = packed_to_object_type(p, obj_offset, type, &w_curs,
+					     curpos);
+		if (oi->typep)
+			*oi->typep = ptot;
+		if (oi->typename) {
+			const char *tn = typename(ptot);
+			if (tn)
+				strbuf_addstr(oi->typename, tn);
+		}
+		if (ptot < 0) {
 			type = OBJ_BAD;
 			goto out;
 		}
@@ -2960,7 +2969,6 @@ int sha1_object_info_extended(const unsigned char *sha1, struct object_info *oi,
 	struct cached_object *co;
 	struct pack_entry e;
 	int rtype;
-	enum object_type real_type;
 	const unsigned char *real = lookup_replace_object_extended(sha1, flags);
 
 	co = find_cached_object(real);
@@ -2992,18 +3000,9 @@ int sha1_object_info_extended(const unsigned char *sha1, struct object_info *oi,
 			return -1;
 	}
 
-	/*
-	 * packed_object_info() does not follow the delta chain to
-	 * find out the real type, unless it is given oi->typep.
-	 */
-	if (oi->typename && !oi->typep)
-		oi->typep = &real_type;
-
 	rtype = packed_object_info(e.p, e.offset, oi);
 	if (rtype < 0) {
 		mark_bad_packed_object(e.p, real);
-		if (oi->typep == &real_type)
-			oi->typep = NULL;
 		return sha1_object_info_extended(real, oi, 0);
 	} else if (in_delta_base_cache(e.p, e.offset)) {
 		oi->whence = OI_DBCACHED;
@@ -3014,10 +3013,6 @@ int sha1_object_info_extended(const unsigned char *sha1, struct object_info *oi,
 		oi->u.packed.is_delta = (rtype == OBJ_REF_DELTA ||
 					 rtype == OBJ_OFS_DELTA);
 	}
-	if (oi->typename)
-		strbuf_addstr(oi->typename, typename(*oi->typep));
-	if (oi->typep == &real_type)
-		oi->typep = NULL;
 
 	return 0;
 }
-- 
2.13.1.518.g3df882009-goog


^ permalink raw reply related	[flat|nested] 70+ messages in thread

* [PATCH v3 2/4] sha1_file: move delta base cache code up
  2017-06-09 19:23 [RFC PATCH 0/4] Improvements to sha1_file Jonathan Tan
                   ` (10 preceding siblings ...)
  2017-06-15 20:39 ` [PATCH v3 1/4] sha1_file: teach packed_object_info about typename Jonathan Tan
@ 2017-06-15 20:39 ` Jonathan Tan
  2017-06-15 20:39 ` [PATCH v3 3/4] sha1_file: consolidate storage-agnostic object fns Jonathan Tan
                   ` (19 subsequent siblings)
  31 siblings, 0 replies; 70+ messages in thread
From: Jonathan Tan @ 2017-06-15 20:39 UTC (permalink / raw)
  To: git; +Cc: Jonathan Tan, gitster

In a subsequent patch, packed_object_info() will be modified to use the
delta base cache, so move the relevant code to before
packed_object_info().

Signed-off-by: Jonathan Tan <jonathantanmy@google.com>
---
 sha1_file.c | 220 ++++++++++++++++++++++++++++++------------------------------
 1 file changed, 110 insertions(+), 110 deletions(-)

diff --git a/sha1_file.c b/sha1_file.c
index a52b27541..a38319443 100644
--- a/sha1_file.c
+++ b/sha1_file.c
@@ -2239,116 +2239,6 @@ static enum object_type packed_to_object_type(struct packed_git *p,
 	goto out;
 }
 
-int packed_object_info(struct packed_git *p, off_t obj_offset,
-		       struct object_info *oi)
-{
-	struct pack_window *w_curs = NULL;
-	unsigned long size;
-	off_t curpos = obj_offset;
-	enum object_type type;
-
-	/*
-	 * We always get the representation type, but only convert it to
-	 * a "real" type later if the caller is interested.
-	 */
-	type = unpack_object_header(p, &w_curs, &curpos, &size);
-
-	if (oi->sizep) {
-		if (type == OBJ_OFS_DELTA || type == OBJ_REF_DELTA) {
-			off_t tmp_pos = curpos;
-			off_t base_offset = get_delta_base(p, &w_curs, &tmp_pos,
-							   type, obj_offset);
-			if (!base_offset) {
-				type = OBJ_BAD;
-				goto out;
-			}
-			*oi->sizep = get_size_from_delta(p, &w_curs, tmp_pos);
-			if (*oi->sizep == 0) {
-				type = OBJ_BAD;
-				goto out;
-			}
-		} else {
-			*oi->sizep = size;
-		}
-	}
-
-	if (oi->disk_sizep) {
-		struct revindex_entry *revidx = find_pack_revindex(p, obj_offset);
-		*oi->disk_sizep = revidx[1].offset - obj_offset;
-	}
-
-	if (oi->typep || oi->typename) {
-		enum object_type ptot;
-		ptot = packed_to_object_type(p, obj_offset, type, &w_curs,
-					     curpos);
-		if (oi->typep)
-			*oi->typep = ptot;
-		if (oi->typename) {
-			const char *tn = typename(ptot);
-			if (tn)
-				strbuf_addstr(oi->typename, tn);
-		}
-		if (ptot < 0) {
-			type = OBJ_BAD;
-			goto out;
-		}
-	}
-
-	if (oi->delta_base_sha1) {
-		if (type == OBJ_OFS_DELTA || type == OBJ_REF_DELTA) {
-			const unsigned char *base;
-
-			base = get_delta_base_sha1(p, &w_curs, curpos,
-						   type, obj_offset);
-			if (!base) {
-				type = OBJ_BAD;
-				goto out;
-			}
-
-			hashcpy(oi->delta_base_sha1, base);
-		} else
-			hashclr(oi->delta_base_sha1);
-	}
-
-out:
-	unuse_pack(&w_curs);
-	return type;
-}
-
-static void *unpack_compressed_entry(struct packed_git *p,
-				    struct pack_window **w_curs,
-				    off_t curpos,
-				    unsigned long size)
-{
-	int st;
-	git_zstream stream;
-	unsigned char *buffer, *in;
-
-	buffer = xmallocz_gently(size);
-	if (!buffer)
-		return NULL;
-	memset(&stream, 0, sizeof(stream));
-	stream.next_out = buffer;
-	stream.avail_out = size + 1;
-
-	git_inflate_init(&stream);
-	do {
-		in = use_pack(p, w_curs, curpos, &stream.avail_in);
-		stream.next_in = in;
-		st = git_inflate(&stream, Z_FINISH);
-		if (!stream.avail_out)
-			break; /* the payload is larger than it should be */
-		curpos += stream.next_in - in;
-	} while (st == Z_OK || st == Z_BUF_ERROR);
-	git_inflate_end(&stream);
-	if ((st != Z_STREAM_END) || stream.total_out != size) {
-		free(buffer);
-		return NULL;
-	}
-
-	return buffer;
-}
-
 static struct hashmap delta_base_cache;
 static size_t delta_base_cached;
 
@@ -2486,6 +2376,116 @@ static void add_delta_base_cache(struct packed_git *p, off_t base_offset,
 	hashmap_add(&delta_base_cache, ent);
 }
 
+int packed_object_info(struct packed_git *p, off_t obj_offset,
+		       struct object_info *oi)
+{
+	struct pack_window *w_curs = NULL;
+	unsigned long size;
+	off_t curpos = obj_offset;
+	enum object_type type;
+
+	/*
+	 * We always get the representation type, but only convert it to
+	 * a "real" type later if the caller is interested.
+	 */
+	type = unpack_object_header(p, &w_curs, &curpos, &size);
+
+	if (oi->sizep) {
+		if (type == OBJ_OFS_DELTA || type == OBJ_REF_DELTA) {
+			off_t tmp_pos = curpos;
+			off_t base_offset = get_delta_base(p, &w_curs, &tmp_pos,
+							   type, obj_offset);
+			if (!base_offset) {
+				type = OBJ_BAD;
+				goto out;
+			}
+			*oi->sizep = get_size_from_delta(p, &w_curs, tmp_pos);
+			if (*oi->sizep == 0) {
+				type = OBJ_BAD;
+				goto out;
+			}
+		} else {
+			*oi->sizep = size;
+		}
+	}
+
+	if (oi->disk_sizep) {
+		struct revindex_entry *revidx = find_pack_revindex(p, obj_offset);
+		*oi->disk_sizep = revidx[1].offset - obj_offset;
+	}
+
+	if (oi->typep || oi->typename) {
+		enum object_type ptot;
+		ptot = packed_to_object_type(p, obj_offset, type, &w_curs,
+					     curpos);
+		if (oi->typep)
+			*oi->typep = ptot;
+		if (oi->typename) {
+			const char *tn = typename(ptot);
+			if (tn)
+				strbuf_addstr(oi->typename, tn);
+		}
+		if (ptot < 0) {
+			type = OBJ_BAD;
+			goto out;
+		}
+	}
+
+	if (oi->delta_base_sha1) {
+		if (type == OBJ_OFS_DELTA || type == OBJ_REF_DELTA) {
+			const unsigned char *base;
+
+			base = get_delta_base_sha1(p, &w_curs, curpos,
+						   type, obj_offset);
+			if (!base) {
+				type = OBJ_BAD;
+				goto out;
+			}
+
+			hashcpy(oi->delta_base_sha1, base);
+		} else
+			hashclr(oi->delta_base_sha1);
+	}
+
+out:
+	unuse_pack(&w_curs);
+	return type;
+}
+
+static void *unpack_compressed_entry(struct packed_git *p,
+				    struct pack_window **w_curs,
+				    off_t curpos,
+				    unsigned long size)
+{
+	int st;
+	git_zstream stream;
+	unsigned char *buffer, *in;
+
+	buffer = xmallocz_gently(size);
+	if (!buffer)
+		return NULL;
+	memset(&stream, 0, sizeof(stream));
+	stream.next_out = buffer;
+	stream.avail_out = size + 1;
+
+	git_inflate_init(&stream);
+	do {
+		in = use_pack(p, w_curs, curpos, &stream.avail_in);
+		stream.next_in = in;
+		st = git_inflate(&stream, Z_FINISH);
+		if (!stream.avail_out)
+			break; /* the payload is larger than it should be */
+		curpos += stream.next_in - in;
+	} while (st == Z_OK || st == Z_BUF_ERROR);
+	git_inflate_end(&stream);
+	if ((st != Z_STREAM_END) || stream.total_out != size) {
+		free(buffer);
+		return NULL;
+	}
+
+	return buffer;
+}
+
 static void *read_object(const unsigned char *sha1, enum object_type *type,
 			 unsigned long *size);
 
-- 
2.13.1.518.g3df882009-goog


^ permalink raw reply related	[flat|nested] 70+ messages in thread

* [PATCH v3 3/4] sha1_file: consolidate storage-agnostic object fns
  2017-06-09 19:23 [RFC PATCH 0/4] Improvements to sha1_file Jonathan Tan
                   ` (11 preceding siblings ...)
  2017-06-15 20:39 ` [PATCH v3 2/4] sha1_file: move delta base cache code up Jonathan Tan
@ 2017-06-15 20:39 ` Jonathan Tan
  2017-06-15 20:39 ` [PATCH v3 4/4] sha1_file, fsck: add missing blob support Jonathan Tan
                   ` (18 subsequent siblings)
  31 siblings, 0 replies; 70+ messages in thread
From: Jonathan Tan @ 2017-06-15 20:39 UTC (permalink / raw)
  To: git; +Cc: Jonathan Tan, gitster

In sha1_file.c, there are a few functions that provide information on an
object regardless of its storage (cached, loose, or packed). Looking
through all non-static functions in sha1_file.c that take in an unsigned
char * pointer, the relevant ones are:
 - sha1_object_info_extended
 - sha1_object_info (auto-fixed by sha1_object_info_extended)
 - read_sha1_file_extended (uses read_object)
 - read_object_with_reference (auto-fixed by read_sha1_file_extended)
 - has_sha1_file_with_flags
 - assert_sha1_type (auto-fixed by sha1_object_info)

Looking at the 3 primary functions (sha1_object_info_extended,
read_object, has_sha1_file_with_flags), they independently implement
mechanisms such as object replacement, retrying the packed store after
failing to find the object in the packed store then the loose store, and
being able to mark a packed object as bad and then retrying the whole
process. Consolidating these mechanisms would be a great help to
maintainability.

However, has_sha1_file_with_flags() does things that the other 2 don't
(skipping cached storage, allowing a "quick" mode that skips retrying
the packed storage after trying the loose storage, and refreshing any
loose files found).

Therefore, consolidate only the other 2 functions by extending
sha1_object_info_extended() to support the functionality needed, and
then modifying read_object() to use sha1_object_info_extended().

Signed-off-by: Jonathan Tan <jonathantanmy@google.com>
---
 cache.h     |  1 +
 sha1_file.c | 84 ++++++++++++++++++++++++++++++-------------------------------
 2 files changed, 43 insertions(+), 42 deletions(-)

diff --git a/cache.h b/cache.h
index 4d92aae0e..63a73af17 100644
--- a/cache.h
+++ b/cache.h
@@ -1835,6 +1835,7 @@ struct object_info {
 	off_t *disk_sizep;
 	unsigned char *delta_base_sha1;
 	struct strbuf *typename;
+	void **contentp;
 
 	/* Response */
 	enum {
diff --git a/sha1_file.c b/sha1_file.c
index a38319443..60b487c70 100644
--- a/sha1_file.c
+++ b/sha1_file.c
@@ -2005,19 +2005,6 @@ int parse_sha1_header(const char *hdr, unsigned long *sizep)
 	return parse_sha1_header_extended(hdr, &oi, LOOKUP_REPLACE_OBJECT);
 }
 
-static void *unpack_sha1_file(void *map, unsigned long mapsize, enum object_type *type, unsigned long *size, const unsigned char *sha1)
-{
-	int ret;
-	git_zstream stream;
-	char hdr[8192];
-
-	ret = unpack_sha1_header(&stream, map, mapsize, hdr, sizeof(hdr));
-	if (ret < Z_OK || (*type = parse_sha1_header(hdr, size)) < 0)
-		return NULL;
-
-	return unpack_sha1_rest(&stream, hdr, *size, sha1);
-}
-
 unsigned long get_size_from_delta(struct packed_git *p,
 				  struct pack_window **w_curs,
 			          off_t curpos)
@@ -2326,8 +2313,10 @@ static void *cache_or_unpack_entry(struct packed_git *p, off_t base_offset,
 	if (!ent)
 		return unpack_entry(p, base_offset, type, base_size);
 
-	*type = ent->type;
-	*base_size = ent->size;
+	if (type)
+		*type = ent->type;
+	if (base_size)
+		*base_size = ent->size;
 	return xmemdupz(ent->data, ent->size);
 }
 
@@ -2388,9 +2377,16 @@ int packed_object_info(struct packed_git *p, off_t obj_offset,
 	 * We always get the representation type, but only convert it to
 	 * a "real" type later if the caller is interested.
 	 */
-	type = unpack_object_header(p, &w_curs, &curpos, &size);
+	if (oi->contentp) {
+		*oi->contentp = cache_or_unpack_entry(p, obj_offset, oi->sizep,
+						      &type);
+		if (!*oi->contentp)
+			type = OBJ_BAD;
+	} else {
+		type = unpack_object_header(p, &w_curs, &curpos, &size);
+	}
 
-	if (oi->sizep) {
+	if (!oi->contentp && oi->sizep) {
 		if (type == OBJ_OFS_DELTA || type == OBJ_REF_DELTA) {
 			off_t tmp_pos = curpos;
 			off_t base_offset = get_delta_base(p, &w_curs, &tmp_pos,
@@ -2679,8 +2675,10 @@ void *unpack_entry(struct packed_git *p, off_t obj_offset,
 		free(external_base);
 	}
 
-	*final_type = type;
-	*final_size = size;
+	if (final_type)
+		*final_type = type;
+	if (final_size)
+		*final_size = size;
 
 	unuse_pack(&w_curs);
 
@@ -2914,6 +2912,7 @@ static int sha1_loose_object_info(const unsigned char *sha1,
 	git_zstream stream;
 	char hdr[32];
 	struct strbuf hdrbuf = STRBUF_INIT;
+	unsigned long size_scratch;
 
 	if (oi->delta_base_sha1)
 		hashclr(oi->delta_base_sha1);
@@ -2926,7 +2925,7 @@ static int sha1_loose_object_info(const unsigned char *sha1,
 	 * return value implicitly indicates whether the
 	 * object even exists.
 	 */
-	if (!oi->typep && !oi->typename && !oi->sizep) {
+	if (!oi->typep && !oi->typename && !oi->sizep && !oi->contentp) {
 		const char *path;
 		struct stat st;
 		if (stat_sha1_file(sha1, &st, &path) < 0)
@@ -2939,6 +2938,10 @@ static int sha1_loose_object_info(const unsigned char *sha1,
 	map = map_sha1_file(sha1, &mapsize);
 	if (!map)
 		return -1;
+
+	if (!oi->sizep)
+		oi->sizep = &size_scratch;
+
 	if (oi->disk_sizep)
 		*oi->disk_sizep = mapsize;
 	if ((flags & LOOKUP_UNKNOWN_OBJECT)) {
@@ -2956,10 +2959,18 @@ static int sha1_loose_object_info(const unsigned char *sha1,
 				       sha1_to_hex(sha1));
 	} else if ((status = parse_sha1_header_extended(hdr, oi, flags)) < 0)
 		status = error("unable to parse %s header", sha1_to_hex(sha1));
-	git_inflate_end(&stream);
+
+	if (status >= 0 && oi->contentp)
+		*oi->contentp = unpack_sha1_rest(&stream, hdr,
+						 *oi->sizep, sha1);
+	else
+		git_inflate_end(&stream);
+
 	munmap(map, mapsize);
 	if (status && oi->typep)
 		*oi->typep = status;
+	if (oi->sizep == &size_scratch)
+		oi->sizep = NULL;
 	strbuf_release(&hdrbuf);
 	return (status < 0) ? status : 0;
 }
@@ -2983,6 +2994,8 @@ int sha1_object_info_extended(const unsigned char *sha1, struct object_info *oi,
 			hashclr(oi->delta_base_sha1);
 		if (oi->typename)
 			strbuf_addstr(oi->typename, typename(co->type));
+		if (oi->contentp)
+			*oi->contentp = xmemdupz(co->buf, co->size);
 		oi->whence = OI_CACHED;
 		return 0;
 	}
@@ -3075,28 +3088,15 @@ int pretend_sha1_file(void *buf, unsigned long len, enum object_type type,
 static void *read_object(const unsigned char *sha1, enum object_type *type,
 			 unsigned long *size)
 {
-	unsigned long mapsize;
-	void *map, *buf;
-	struct cached_object *co;
-
-	co = find_cached_object(sha1);
-	if (co) {
-		*type = co->type;
-		*size = co->size;
-		return xmemdupz(co->buf, co->size);
-	}
+	struct object_info oi = OBJECT_INFO_INIT;
+	void *content;
+	oi.typep = type;
+	oi.sizep = size;
+	oi.contentp = &content;
 
-	buf = read_packed_sha1(sha1, type, size);
-	if (buf)
-		return buf;
-	map = map_sha1_file(sha1, &mapsize);
-	if (map) {
-		buf = unpack_sha1_file(map, mapsize, type, size, sha1);
-		munmap(map, mapsize);
-		return buf;
-	}
-	reprepare_packed_git();
-	return read_packed_sha1(sha1, type, size);
+	if (sha1_object_info_extended(sha1, &oi, 0))
+		return NULL;
+	return content;
 }
 
 /*
-- 
2.13.1.518.g3df882009-goog


^ permalink raw reply related	[flat|nested] 70+ messages in thread

* [PATCH v3 4/4] sha1_file, fsck: add missing blob support
  2017-06-09 19:23 [RFC PATCH 0/4] Improvements to sha1_file Jonathan Tan
                   ` (12 preceding siblings ...)
  2017-06-15 20:39 ` [PATCH v3 3/4] sha1_file: consolidate storage-agnostic object fns Jonathan Tan
@ 2017-06-15 20:39 ` Jonathan Tan
  2017-06-20  1:03 ` [PATCH v4 0/8] Improvements to sha1_file Jonathan Tan
                   ` (17 subsequent siblings)
  31 siblings, 0 replies; 70+ messages in thread
From: Jonathan Tan @ 2017-06-15 20:39 UTC (permalink / raw)
  To: git; +Cc: Jonathan Tan, gitster

Currently, Git does not support repos with very large numbers of blobs
or repos that wish to minimize manipulation of certain blobs (for
example, because they are very large) very well, even if the user
operates mostly on part of the repo, because Git is designed on the
assumption that every blob referenced by a tree object is available
somewhere in the repo storage.

As a first step to reducing this problem, add rudimentary support for
missing blobs by teaching sha1_file to invoke a hook whenever a blob is
requested and unavailable but registered to be missing, and by updating
fsck to tolerate such blobs.  The hook is a shell command that can be
configured through "git config"; this hook takes in a list of hashes and
writes (if successful) the corresponding objects to the repo's local
storage.

This commit does not include support for generating such a repo; neither
has any command (other than fsck) been modified to either tolerate
missing blobs (without invoking the hook) or be more efficient in
invoking the missing blob hook. Only a fallback is provided in the form
of sha1_file invoking the missing blob hook when necessary.

In order to determine the code changes in sha1_file.c necessary, I
investigated the following:
 (1) functions in sha1_file that take in a hash, without the user
     regarding how the object is stored (loose or packed)
 (2) functions in sha1_file that operate on packed objects (because I
     need to check callers that know about the loose/packed distinction
     and operate on both differently, and ensure that they can handle
     the concept of objects that are neither loose nor packed)

(1) is handled by the modification to sha1_object_info_extended() and
has_sha1_file_with_flags().

For (2), I looked through the same functions as in (1) and also
for_each_packed_object. The ones that are relevant are:
 - parse_pack_index
   - http - indirectly from http_get_info_packs
 - find_pack_entry_one
   - this searches a single pack that is provided as an argument; the
     caller already knows (through other means) that the sought object
     is in a specific pack
 - find_sha1_pack
   - fast-import - appears to be an optimization to not store a
     file if it is already in a pack
   - http-walker - to search through a struct alt_base
   - http-push - to search through remote packs
 - has_sha1_pack
   - builtin/fsck - fixed in this commit
   - builtin/count-objects - informational purposes only (check if loose
     object is also packed)
   - builtin/prune-packed - check if object to be pruned is packed (if
     not, don't prune it)
   - revision - used to exclude packed objects if requested by user
   - diff - just for optimization
 - for_each_packed_object
   - reachable - only to find recent objects
   - builtin/fsck - fixed in this commit
   - builtin/cat-file - see below

As described in the list above, builtin/fsck has been updated. I have
left builtin/cat-file alone; this means that cat-file
--batch-all-objects will only operate on objects physically in the repo.

An alternative design that I considered but rejected:

 - Adding a hook whenever a packed blob is requested, not on any blob.
   That is, whenever we attempt to search the packfiles for a blob, if
   it is missing (from the packfiles and from the loose object storage),
   to invoke the hook (which must then store it as a packfile), open the
   packfile the hook generated, and report that the blob is found in
   that new packfile. This reduces the amount of analysis needed (in
   that we only need to look at how packed blobs are handled), but
   requires that the hook generate packfiles (or for sha1_file to pack
   whatever loose objects are generated), creating one packfile for each
   missing blob and potentially very many packfiles that must be
   linearly searched. This may be tolerable now for repos that only have
   a few missing blobs (for example, repos that only want to exclude
   large blobs), and might be tolerable in the future if we have
   batching support for the most commonly used commands, but is not
   tolerable now for repos that exclude a large amount of blobs.

Signed-off-by: Jonathan Tan <jonathantanmy@google.com>
---
 Documentation/config.txt |  10 +++
 builtin/fsck.c           |   7 ++
 cache.h                  |   7 ++
 sha1_file.c              | 171 +++++++++++++++++++++++++++++++++++++++++++----
 t/t3907-missing-blob.sh  |  69 +++++++++++++++++++
 5 files changed, 250 insertions(+), 14 deletions(-)
 create mode 100755 t/t3907-missing-blob.sh

diff --git a/Documentation/config.txt b/Documentation/config.txt
index dd4beec39..10da5fde1 100644
--- a/Documentation/config.txt
+++ b/Documentation/config.txt
@@ -390,6 +390,16 @@ The default is false, except linkgit:git-clone[1] or linkgit:git-init[1]
 will probe and set core.ignoreCase true if appropriate when the repository
 is created.
 
+core.missingBlobCommand::
+	If set, whenever a blob in the local repo is attempted to be
+	read but is missing, invoke this shell command to generate or
+	obtain that blob before reporting an error. This shell command
+	should take one or more hashes, each terminated by a newline, as
+	standard input, and (if successful) should write the
+	corresponding objects to the local repo (packed or loose).
++
+If set, fsck will not treat a missing blob as an error condition.
+
 core.precomposeUnicode::
 	This option is only used by Mac OS implementation of Git.
 	When core.precomposeUnicode=true, Git reverts the unicode decomposition
diff --git a/builtin/fsck.c b/builtin/fsck.c
index cb2ba6cd1..b447bd5f9 100644
--- a/builtin/fsck.c
+++ b/builtin/fsck.c
@@ -37,6 +37,7 @@ static int verbose;
 static int show_progress = -1;
 static int show_dangling = 1;
 static int name_objects;
+static int missing_blob_ok;
 #define ERROR_OBJECT 01
 #define ERROR_REACHABLE 02
 #define ERROR_PACK 04
@@ -93,6 +94,9 @@ static int fsck_config(const char *var, const char *value, void *cb)
 		return 0;
 	}
 
+	if (!strcmp(var, "core.missingblobcommand"))
+		missing_blob_ok = 1;
+
 	return git_default_config(var, value, cb);
 }
 
@@ -222,6 +226,9 @@ static void check_reachable_object(struct object *obj)
 	if (!(obj->flags & HAS_OBJ)) {
 		if (has_sha1_pack(obj->oid.hash))
 			return; /* it is in pack - forget about it */
+		if (missing_blob_ok && obj->type == OBJ_BLOB &&
+		    in_missing_blob_manifest(obj->oid.hash, NULL))
+			return;
 		printf("missing %s %s\n", printable_type(obj),
 			describe_object(obj));
 		errors_found |= ERROR_REACHABLE;
diff --git a/cache.h b/cache.h
index 63a73af17..dd69c75f5 100644
--- a/cache.h
+++ b/cache.h
@@ -1870,6 +1870,13 @@ struct object_info {
 extern int sha1_object_info_extended(const unsigned char *, struct object_info *, unsigned flags);
 extern int packed_object_info(struct packed_git *pack, off_t offset, struct object_info *);
 
+/*
+ * Returns 1 if sha1 is the hash of a known missing blob. If size is not NULL,
+ * also returns its size.
+ */
+extern int in_missing_blob_manifest(const unsigned char *sha1,
+				    unsigned long *size);
+
 /* Dumb servers support */
 extern int update_server_info(int);
 
diff --git a/sha1_file.c b/sha1_file.c
index 60b487c70..7ef239907 100644
--- a/sha1_file.c
+++ b/sha1_file.c
@@ -27,6 +27,9 @@
 #include "list.h"
 #include "mergesort.h"
 #include "quote.h"
+#include "iterator.h"
+#include "dir-iterator.h"
+#include "sha1-lookup.h"
 
 #define SZ_FMT PRIuMAX
 static inline uintmax_t sz_fmt(size_t s) { return s; }
@@ -1624,6 +1627,72 @@ static const struct packed_git *has_packed_and_bad(const unsigned char *sha1)
 	return NULL;
 }
 
+struct missing_blob_manifest {
+	struct missing_blob_manifest *next;
+	const char *data;
+};
+static struct missing_blob_manifest *missing_blobs;
+static int missing_blobs_initialized;
+
+static void prepare_missing_blobs(void)
+{
+	int ok;
+	char *dirname;
+	struct dir_iterator *iter;
+
+	if (missing_blobs_initialized)
+		return;
+
+	missing_blobs_initialized = 1;
+
+	dirname = xstrfmt("%s/missing", get_object_directory());
+	iter = dir_iterator_begin(dirname);
+
+	while ((ok = dir_iterator_advance(iter)) == ITER_OK) {
+		int fd;
+		const char *data;
+		struct missing_blob_manifest *m;
+		if (!S_ISREG(iter->st.st_mode))
+			continue;
+		fd = git_open(iter->path.buf);
+		data = xmmap(NULL, iter->st.st_size, PROT_READ, MAP_PRIVATE, fd, 0);
+		close(fd);
+
+		m = xmalloc(sizeof(*m));
+		m->next = missing_blobs;
+		m->data = data;
+		missing_blobs = m;
+	}
+
+	if (ok != ITER_DONE) {
+		/* do something */
+	}
+
+	free(dirname);
+}
+
+int in_missing_blob_manifest(const unsigned char *sha1, unsigned long *size)
+{
+	struct missing_blob_manifest *m;
+	prepare_missing_blobs();
+	for (m = missing_blobs; m; m = m->next) {
+		uint64_t nr_nbo, nr;
+		int result;
+		memcpy(&nr_nbo, m->data, sizeof(nr_nbo));
+		nr = htonll(nr_nbo);
+		result = sha1_entry_pos(m->data, GIT_SHA1_RAWSZ + 8, 8, 0, nr, nr, sha1);
+		if (result >= 0) {
+			if (size) {
+				uint64_t size_nbo;
+				memcpy(&size_nbo, m->data + 8 + result * (GIT_SHA1_RAWSZ + 8) + GIT_SHA1_RAWSZ, sizeof(size_nbo));
+				*size = ntohll(size_nbo);
+			}
+			return 1;
+		}
+	}
+	return 0;
+}
+
 /*
  * With an in-core object data in "map", rehash it to make sure the
  * object name actually matches "sha1" to detect object corruption.
@@ -2975,12 +3044,57 @@ static int sha1_loose_object_info(const unsigned char *sha1,
 	return (status < 0) ? status : 0;
 }
 
+static char *missing_blob_command;
+static int missing_blob_config(const char *conf_key, const char *value,
+			       void *cb)
+{
+	if (!strcmp(conf_key, "core.missingblobcommand")) {
+		missing_blob_command = xstrdup(value);
+	}
+	return 0;
+}
+
+static void ensure_missing_blob_configured(void)
+{
+	static int configured;
+	if (configured)
+		return;
+
+	git_config(missing_blob_config, NULL);
+	configured = 1;
+}
+
+static void handle_missing_blob(const unsigned char *sha1)
+{
+	struct child_process cp = CHILD_PROCESS_INIT;
+	const char *argv[] = {missing_blob_command, NULL};
+	char input[GIT_MAX_HEXSZ + 1];
+
+	memcpy(input, sha1_to_hex(sha1), 40);
+	input[40] = '\n';
+
+	cp.argv = argv;
+	cp.env = local_repo_env;
+	cp.use_shell = 1;
+
+	if (pipe_command(&cp, input, sizeof(input), NULL, 0, NULL, 0)) {
+		die("failed to load blob %s", sha1_to_hex(sha1));
+	}
+
+	/*
+	 * The command above may have updated packfiles, so update our record
+	 * of them.
+	 */
+	reprepare_packed_git();
+}
+
 int sha1_object_info_extended(const unsigned char *sha1, struct object_info *oi, unsigned flags)
 {
 	struct cached_object *co;
 	struct pack_entry e;
 	int rtype;
 	const unsigned char *real = lookup_replace_object_extended(sha1, flags);
+	int already_retried = 0;
 
 	co = find_cached_object(real);
 	if (co) {
@@ -3000,19 +3114,35 @@ int sha1_object_info_extended(const unsigned char *sha1, struct object_info *oi,
 		return 0;
 	}
 
-	if (!find_pack_entry(real, &e)) {
-		/* Most likely it's a loose object. */
-		if (!sha1_loose_object_info(real, oi, flags)) {
-			oi->whence = OI_LOOSE;
-			return 0;
-		}
+retry:
+	if (find_pack_entry(real, &e))
+		goto found_packed;
 
-		/* Not a loose object; someone else may have just packed it. */
-		reprepare_packed_git();
-		if (!find_pack_entry(real, &e))
-			return -1;
+	/* Most likely it's a loose object. */
+	if (!sha1_loose_object_info(real, oi, flags)) {
+		oi->whence = OI_LOOSE;
+		return 0;
+	}
+
+	/* Not a loose object; someone else may have just packed it. */
+	reprepare_packed_git();
+	if (find_pack_entry(real, &e))
+		goto found_packed;
+
+	/* Try the missing blobs */
+	if (!already_retried) {
+		ensure_missing_blob_configured();
+		if (missing_blob_command &&
+		    in_missing_blob_manifest(real, NULL)) {
+			already_retried = 1;
+			handle_missing_blob(real);
+			goto retry;
+		}
 	}
 
+	return -1;
+
+found_packed:
 	rtype = packed_object_info(e.p, e.offset, oi);
 	if (rtype < 0) {
 		mark_bad_packed_object(e.p, real);
@@ -3475,17 +3605,30 @@ int has_sha1_pack(const unsigned char *sha1)
 int has_sha1_file_with_flags(const unsigned char *sha1, int flags)
 {
 	struct pack_entry e;
+	int already_retried = 0;
 
 	if (!startup_info->have_repository)
 		return 0;
+retry:
 	if (find_pack_entry(sha1, &e))
 		return 1;
 	if (has_loose_object(sha1))
 		return 1;
-	if (flags & HAS_SHA1_QUICK)
-		return 0;
-	reprepare_packed_git();
-	return find_pack_entry(sha1, &e);
+	if (!(flags & HAS_SHA1_QUICK)) {
+		reprepare_packed_git();
+		if (find_pack_entry(sha1, &e))
+			return 1;
+	}
+	if (!already_retried) {
+		ensure_missing_blob_configured();
+		if (missing_blob_command &&
+		    in_missing_blob_manifest(sha1, NULL)) {
+			already_retried = 1;
+			handle_missing_blob(sha1);
+			goto retry;
+		}
+	}
+	return 0;
 }
 
 int has_object_file(const struct object_id *oid)
diff --git a/t/t3907-missing-blob.sh b/t/t3907-missing-blob.sh
new file mode 100755
index 000000000..7962414cb
--- /dev/null
+++ b/t/t3907-missing-blob.sh
@@ -0,0 +1,69 @@
+#!/bin/sh
+
+test_description='core.missingblobcommand option'
+
+. ./test-lib.sh
+
+pack () {
+	perl -e '$/ = undef; $input = <>; print pack("H*", $input)'
+}
+
+test_expect_success 'sha1_object_info_extended and read_sha1_file (through git cat-file -p)' '
+	rm -rf server client &&
+
+	git init server &&
+	test_commit -C server 1 &&
+	test_config -C server uploadpack.allowanysha1inwant 1 &&
+	HASH=$(git hash-object server/1.t) &&
+
+	git init client &&
+	test_config -C client core.missingblobcommand \
+		"git -C \"$(pwd)/server\" pack-objects --stdout | git unpack-objects" &&
+
+	# does not work if missing blob is not registered
+	test_must_fail git -C client cat-file -p "$HASH" &&
+
+	mkdir -p client/.git/objects/missing &&
+	printf "%016x%s%016x" 1 "$HASH" "$(wc -c <server/1.t)" |
+		pack >client/.git/objects/missing/x &&
+
+	# works when missing blob is registered
+	git -C client cat-file -p "$HASH"
+'
+
+test_expect_success 'has_sha1_file (through git cat-file -e)' '
+	rm -rf server client &&
+
+	git init server &&
+	test_commit -C server 1 &&
+	test_config -C server uploadpack.allowanysha1inwant 1 &&
+	HASH=$(git hash-object server/1.t) &&
+
+	git init client &&
+	test_config -C client core.missingblobcommand \
+		"git -C \"$(pwd)/server\" pack-objects --stdout | git unpack-objects" &&
+	mkdir -p client/.git/objects/missing &&
+	printf "%016x%s%016x" 1 "$HASH" "$(wc -c <server/1.t)" |
+		pack >client/.git/objects/missing/x &&
+	git -C client cat-file -e "$HASH"
+'
+
+test_expect_success 'fsck' '
+	rm -rf server client &&
+
+	git init server &&
+	test_commit -C server 1 &&
+	test_config -C server uploadpack.allowanysha1inwant 1 &&
+	HASH=$(git hash-object server/1.t) &&
+	echo hash is $HASH &&
+
+	cp -r server client &&
+	test_config -C client core.missingblobcommand "this-command-is-not-actually-run" &&
+	mkdir -p client/.git/objects/missing &&
+	printf "%016x%s%016x" 1 "$HASH" "$(wc -c <server/1.t)" |
+		pack >client/.git/objects/missing/x &&
+	rm client/.git/objects/$(echo $HASH | cut -c1-2)/$(echo $HASH | cut -c3-40) &&
+	git -C client fsck
+'
+
+test_done
-- 
2.13.1.518.g3df882009-goog


^ permalink raw reply related	[flat|nested] 70+ messages in thread

* Re: [PATCH v2 4/4] sha1_file, fsck: add missing blob support
  2017-06-15 20:31     ` Jonathan Tan
@ 2017-06-15 20:52       ` Junio C Hamano
  0 siblings, 0 replies; 70+ messages in thread
From: Junio C Hamano @ 2017-06-15 20:52 UTC (permalink / raw)
  To: Jonathan Tan; +Cc: git, peff

Jonathan Tan <jonathantanmy@google.com> writes:

> There is indeed no reason why we need to keep multiple ones separate for
> an extended period of time - my thinking was to let fetch/clone be fast
> by not needing to scan through the entire existing manifest (in order to
> create the new one),  letting GC take care of consolidating them ...

Given that fetch/clone already incur network cost and the users
expect to wait for them to finish, I wouldn't have made such a
trade-off.

>> > +int has_missing_blob(const unsigned char *sha1, unsigned long *size)
>> > +{
>> 
>> This function that answers "is it expected to be missing?" is
>> confusingly named.  Is it missing, or does it exist?
>
> Renamed to in_missing_blob_manifest().

Either that, or "is_known_to_be_missing()", would be OK.

Thanks.

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH v2 3/4] sha1_file: consolidate storage-agnostic object fns
  2017-06-15 17:50   ` Junio C Hamano
  2017-06-15 18:14     ` Jonathan Tan
@ 2017-06-17 12:19     ` Jeff King
  2017-06-19  4:18       ` Junio C Hamano
  1 sibling, 1 reply; 70+ messages in thread
From: Jeff King @ 2017-06-17 12:19 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Jonathan Tan, git

On Thu, Jun 15, 2017 at 10:50:46AM -0700, Junio C Hamano wrote:

> >  	if (!find_pack_entry(real, &e)) {
> >  		/* Most likely it's a loose object. */
> > -		if (!sha1_loose_object_info(real, oi, flags)) {
> > +		if (oi && !sha1_loose_object_info(real, oi, flags)) {
> >  			oi->whence = OI_LOOSE;
> >  			return 0;
> >  		}
> > +		if (!oi && has_loose_object(real))
> > +			return 0;
> 
> This conversion is not incorrect per-se.  
> 
> We can see that has_sha1_file_with_flags() after this patch still
> calls has_loose_object().  But it bothers me that there is no hint
> to future developers to warn that a rewrite of the above like this
> is incorrect:
> 
>         if (!find_pack_entry(read, &e)) {
>                 /* Most likely it's a loose object. */
>        +        struct object_info dummy_oi;
>        +        if (!oi) {
>        +                memset(&dummy_oi, 0, sizeof(dummy_oi);
>        +                oi = &dummy_oi;
>        +        }
>        -        if (oi && !sha1_loose_object_info(real, oi, flags)) {
>        +        if (!sha1_loose_object_info(real, oi, flags)) {
>                         oi->whence = OI_LOOSE;
>                         return 0;
>                 }
>        -        if (!oi && has_loose_object(real))
>        -                return 0;
> 
> It used to be very easy to see that has_sha1_file_with_flags() will
> call has_loose_object() when it does not find the object in a pack,
> which will result in the loose object file freshened.  In the new
> code, it is very subtle to see that---it will happen when the caller
> passes oi == NULL, and has_sha1_file_with_flags() is such a caller,
> but it is unclear if there are other callers of this "consolidated"
> sha1_object_info_extended() that passes oi == NULL, and if they do
> also want to freshen the loose object file when they do so.

I also found this quite subtle. However, I don't think that
has_sha1_file() actually freshens. It's a bit confusing because
has_loose_object() reuses the check_and_freshen() function to do the
lookup, but it actually sets the "freshen" flag to false.

That's why in 33d4221c7 (write_sha1_file: freshen existing objects,
2014-10-15), which introduced the freshening functions and converted
has_loose_object(), the actual write_sha1_file() function switched to
using the freshening functions directly (and obviously sets the freshen
parameter to true).

I actually think all of that infrastructure could become part of
Jonathan's consolidated lookup, too. We would just need:

  1. A QUICK flag to avoid re-reading objects/pack when we don't find
     anything (which it looks like he already has).

  2. A FRESHEN flag to update the mtime of any item that we do find.

I suspect we may also need something like ONLY_LOOSE and ONLY_NONLOCAL
to meet all the callers (e.g., has_loose_object_nonlocal). Those should
be easy to implement, I'd think.

> I would have preferred to see the new variable not to be called 'f',
> as that makes it unclear what it is (is that a callback function
> pointer?).  In this case, uyou are forcing the flag bits passed
> down, so perhaps you can reuse the same variable?  
> 
> If you allocated a separate variable because
> has_sha1_file_with_flags() and sha1_object_info_extended() take flag
> bits from two separate vocabularies, that is a valid reasoning, but
> if that is the case, then I would have named 'f' to reflect that
> fact that this is different from parameter 'flag' that is defined in
> the has_sha1_file_with_flags() world, but a different thing that is
> defined in sha1_object_info_extended() world, e.g. "soie_flag" or
> something like that.

I had the same thoughts (both on the name and the "vocabularies"). IMHO
we should consider allocating the bits from the same set. There's only
one HAS_SHA1 flag, and it has an exact match in OBJECT_INFO_QUICK.

I think this patch might be a bit easier to review if it were broken
down more in a sequence of:

  1. Add features to the consolidated function to support everything
     that function X supports.

  2. Preparatory cleanup around X (e.g., pointing HAS_SHA1_QUICK at
     OBJECT_INFO_QUICK).

  3. Convert X to use the consolidated function.

  4. Repeat for each X we wish to consolidate.

That's going to end up with probably 12 patches instead of one, but I
think it may be a lot easier to communicate the reason for the various
design decisions.

-Peff

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH v2 3/4] sha1_file: consolidate storage-agnostic object fns
  2017-06-17 12:19     ` Jeff King
@ 2017-06-19  4:18       ` Junio C Hamano
  0 siblings, 0 replies; 70+ messages in thread
From: Junio C Hamano @ 2017-06-19  4:18 UTC (permalink / raw)
  To: Jeff King; +Cc: Jonathan Tan, git

Jeff King <peff@peff.net> writes:

> I actually think all of that infrastructure could become part of
> Jonathan's consolidated lookup, too. We would just need:
>
>   1. A QUICK flag to avoid re-reading objects/pack when we don't find
>      anything (which it looks like he already has).
>
>   2. A FRESHEN flag to update the mtime of any item that we do find.
>
> I suspect we may also need something like ONLY_LOOSE and ONLY_NONLOCAL
> to meet all the callers (e.g., has_loose_object_nonlocal). Those should
> be easy to implement, I'd think.

Ahh, that makes a lot more sense than my reading of the related
codepath.  Thanks for straightening me out.

> ...
> I think this patch might be a bit easier to review if it were broken
> down more in a sequence of:
>
>   1. Add features to the consolidated function to support everything
>      that function X supports.
>
>   2. Preparatory cleanup around X (e.g., pointing HAS_SHA1_QUICK at
>      OBJECT_INFO_QUICK).
>
>   3. Convert X to use the consolidated function.
>
>   4. Repeat for each X we wish to consolidate.
>
> That's going to end up with probably 12 patches instead of one, but I
> think it may be a lot easier to communicate the reason for the various
> design decisions.

True, too.  Thanks.

^ permalink raw reply	[flat|nested] 70+ messages in thread

* [PATCH v4 0/8] Improvements to sha1_file
  2017-06-09 19:23 [RFC PATCH 0/4] Improvements to sha1_file Jonathan Tan
                   ` (13 preceding siblings ...)
  2017-06-15 20:39 ` [PATCH v3 4/4] sha1_file, fsck: add missing blob support Jonathan Tan
@ 2017-06-20  1:03 ` Jonathan Tan
  2017-06-21 18:18   ` Junio C Hamano
  2017-06-24 12:51   ` Jeff King
  2017-06-20  1:03 ` [PATCH v4 1/8] sha1_file: teach packed_object_info about typename Jonathan Tan
                   ` (16 subsequent siblings)
  31 siblings, 2 replies; 70+ messages in thread
From: Jonathan Tan @ 2017-06-20  1:03 UTC (permalink / raw)
  To: git; +Cc: Jonathan Tan, gitster, peff

Thanks, Peff and Junio for your comments. Here's an updated version and
some replies to comments.

> I also found this quite subtle. However, I don't think that
> has_sha1_file() actually freshens. It's a bit confusing because
> has_loose_object() reuses the check_and_freshen() function to do the
> lookup, but it actually sets the "freshen" flag to false.
> 
> That's why in 33d4221c7 (write_sha1_file: freshen existing objects,
> 2014-10-15), which introduced the freshening functions and converted
> has_loose_object(), the actual write_sha1_file() function switched to
> using the freshening functions directly (and obviously sets the freshen
> parameter to true).

Good catch.

> I actually think all of that infrastructure could become part of
> Jonathan's consolidated lookup, too. We would just need:
> 
>   1. A QUICK flag to avoid re-reading objects/pack when we don't find
>      anything (which it looks like he already has).
> 
>   2. A FRESHEN flag to update the mtime of any item that we do find.
> 
> I suspect we may also need something like ONLY_LOOSE and ONLY_NONLOCAL
> to meet all the callers (e.g., has_loose_object_nonlocal). Those should
> be easy to implement, I'd think.

For things like FRESHEN, ONLY_LOOSE, and ONLY_NONLOCAL, I was thinking
that I would like to restrict these patches to only handle the cases
that are agnostic to the type of storage (in preparation for missing
blob handling patches).

> I had the same thoughts (both on the name and the "vocabularies"). IMHO
> we should consider allocating the bits from the same set. There's only
> one HAS_SHA1 flag, and it has an exact match in OBJECT_INFO_QUICK.

Agreed - in this patch set, I have also consolidated the relevant flags,
including LOOKUP_REPLACE_OBJECT and LOOKUP_UNKNOWN_OBJECT.

In addition, Junio has mentioned the potential confusion in behavior
between a NULL and an empty struct passed to
sha1_object_info_extended(). In this patch set, I require non-NULL, and
have added an optimization that avoids accessing the pack in certain
situations, but this optimization requires checking a lot of fields. Let
me know what you think.

Jonathan Tan (8):
  sha1_file: teach packed_object_info about typename
  sha1_file: rename LOOKUP_UNKNOWN_OBJECT
  sha1_file: rename LOOKUP_REPLACE_OBJECT
  sha1_file: move delta base cache code up
  sha1_file: refactor read_object
  sha1_file: improve sha1_object_info_extended
  sha1_file: do not access pack if unneeded
  sha1_file: refactor has_sha1_file_with_flags

 builtin/cat-file.c   |   7 +-
 builtin/fetch.c      |  10 +-
 builtin/index-pack.c |   3 +-
 cache.h              |  37 +++--
 sha1_file.c          | 391 ++++++++++++++++++++++++++-------------------------
 streaming.c          |   1 +
 6 files changed, 228 insertions(+), 221 deletions(-)

-- 
2.13.1.611.g7e3b11ae1-goog


^ permalink raw reply	[flat|nested] 70+ messages in thread

* [PATCH v4 1/8] sha1_file: teach packed_object_info about typename
  2017-06-09 19:23 [RFC PATCH 0/4] Improvements to sha1_file Jonathan Tan
                   ` (14 preceding siblings ...)
  2017-06-20  1:03 ` [PATCH v4 0/8] Improvements to sha1_file Jonathan Tan
@ 2017-06-20  1:03 ` Jonathan Tan
  2017-06-20  1:03 ` [PATCH v4 2/8] sha1_file: rename LOOKUP_UNKNOWN_OBJECT Jonathan Tan
                   ` (15 subsequent siblings)
  31 siblings, 0 replies; 70+ messages in thread
From: Jonathan Tan @ 2017-06-20  1:03 UTC (permalink / raw)
  To: git; +Cc: Jonathan Tan, gitster, peff

In commit 46f0344 ("sha1_file: support reading from a loose object of
unknown type", 2015-05-06), "struct object_info" gained a "typename"
field that could represent a type name from a loose object file, whether
valid or invalid, as opposed to the existing "typep" which could only
represent valid types. Some relatively complex manipulations were added
to avoid breaking packed_object_info() without modifying it, but it is
much easier to just teach packed_object_info() about the new field.
Therefore, teach packed_object_info() as described above.

Signed-off-by: Jonathan Tan <jonathantanmy@google.com>
---
 sha1_file.c | 29 ++++++++++++-----------------
 1 file changed, 12 insertions(+), 17 deletions(-)

diff --git a/sha1_file.c b/sha1_file.c
index 59a4ed2ed..a52b27541 100644
--- a/sha1_file.c
+++ b/sha1_file.c
@@ -2277,9 +2277,18 @@ int packed_object_info(struct packed_git *p, off_t obj_offset,
 		*oi->disk_sizep = revidx[1].offset - obj_offset;
 	}
 
-	if (oi->typep) {
-		*oi->typep = packed_to_object_type(p, obj_offset, type, &w_curs, curpos);
-		if (*oi->typep < 0) {
+	if (oi->typep || oi->typename) {
+		enum object_type ptot;
+		ptot = packed_to_object_type(p, obj_offset, type, &w_curs,
+					     curpos);
+		if (oi->typep)
+			*oi->typep = ptot;
+		if (oi->typename) {
+			const char *tn = typename(ptot);
+			if (tn)
+				strbuf_addstr(oi->typename, tn);
+		}
+		if (ptot < 0) {
 			type = OBJ_BAD;
 			goto out;
 		}
@@ -2960,7 +2969,6 @@ int sha1_object_info_extended(const unsigned char *sha1, struct object_info *oi,
 	struct cached_object *co;
 	struct pack_entry e;
 	int rtype;
-	enum object_type real_type;
 	const unsigned char *real = lookup_replace_object_extended(sha1, flags);
 
 	co = find_cached_object(real);
@@ -2992,18 +3000,9 @@ int sha1_object_info_extended(const unsigned char *sha1, struct object_info *oi,
 			return -1;
 	}
 
-	/*
-	 * packed_object_info() does not follow the delta chain to
-	 * find out the real type, unless it is given oi->typep.
-	 */
-	if (oi->typename && !oi->typep)
-		oi->typep = &real_type;
-
 	rtype = packed_object_info(e.p, e.offset, oi);
 	if (rtype < 0) {
 		mark_bad_packed_object(e.p, real);
-		if (oi->typep == &real_type)
-			oi->typep = NULL;
 		return sha1_object_info_extended(real, oi, 0);
 	} else if (in_delta_base_cache(e.p, e.offset)) {
 		oi->whence = OI_DBCACHED;
@@ -3014,10 +3013,6 @@ int sha1_object_info_extended(const unsigned char *sha1, struct object_info *oi,
 		oi->u.packed.is_delta = (rtype == OBJ_REF_DELTA ||
 					 rtype == OBJ_OFS_DELTA);
 	}
-	if (oi->typename)
-		strbuf_addstr(oi->typename, typename(*oi->typep));
-	if (oi->typep == &real_type)
-		oi->typep = NULL;
 
 	return 0;
 }
-- 
2.13.1.611.g7e3b11ae1-goog


^ permalink raw reply related	[flat|nested] 70+ messages in thread

* [PATCH v4 2/8] sha1_file: rename LOOKUP_UNKNOWN_OBJECT
  2017-06-09 19:23 [RFC PATCH 0/4] Improvements to sha1_file Jonathan Tan
                   ` (15 preceding siblings ...)
  2017-06-20  1:03 ` [PATCH v4 1/8] sha1_file: teach packed_object_info about typename Jonathan Tan
@ 2017-06-20  1:03 ` Jonathan Tan
  2017-06-21 17:22   ` Junio C Hamano
  2017-06-20  1:03 ` [PATCH v4 3/8] sha1_file: rename LOOKUP_REPLACE_OBJECT Jonathan Tan
                   ` (14 subsequent siblings)
  31 siblings, 1 reply; 70+ messages in thread
From: Jonathan Tan @ 2017-06-20  1:03 UTC (permalink / raw)
  To: git; +Cc: Jonathan Tan, gitster, peff

The LOOKUP_UNKNOWN_OBJECT flag was introduced in commit 46f0344
("sha1_file: support reading from a loose object of unknown type",
2015-05-03) in order to support a feature in cat-file subsequently
introduced in commit 39e4ae3 ("cat-file: teach cat-file a
'--allow-unknown-type' option", 2015-05-03). Despite its name and
location in cache.h, this flag is used neither in
read_sha1_file_extended() nor in any of the lookup functions, but used
only in sha1_object_info_extended().

Therefore rename this flag to OBJECT_INFO_ALLOW_UNKNOWN_TYPE, taking the
name of the cat-file flag that invokes this feature, and move it closer
to the declaration of sha1_object_info_extended(). Also add
documentation for this flag.

Signed-off-by: Jonathan Tan <jonathantanmy@google.com>
---
 builtin/cat-file.c | 2 +-
 cache.h            | 3 ++-
 sha1_file.c        | 4 ++--
 3 files changed, 5 insertions(+), 4 deletions(-)

diff --git a/builtin/cat-file.c b/builtin/cat-file.c
index 4bffd7a2d..209374b3c 100644
--- a/builtin/cat-file.c
+++ b/builtin/cat-file.c
@@ -60,7 +60,7 @@ static int cat_one_file(int opt, const char *exp_type, const char *obj_name,
 	const char *path = force_path;
 
 	if (unknown_type)
-		flags |= LOOKUP_UNKNOWN_OBJECT;
+		flags |= OBJECT_INFO_ALLOW_UNKNOWN_TYPE;
 
 	if (get_sha1_with_context(obj_name, GET_SHA1_RECORD_PATH,
 				  oid.hash, &obj_context))
diff --git a/cache.h b/cache.h
index 4d92aae0e..e2ec45dfe 100644
--- a/cache.h
+++ b/cache.h
@@ -1207,7 +1207,6 @@ extern char *xdg_cache_home(const char *filename);
 
 /* object replacement */
 #define LOOKUP_REPLACE_OBJECT 1
-#define LOOKUP_UNKNOWN_OBJECT 2
 extern void *read_sha1_file_extended(const unsigned char *sha1, enum object_type *type, unsigned long *size, unsigned flag);
 static inline void *read_sha1_file(const unsigned char *sha1, enum object_type *type, unsigned long *size)
 {
@@ -1866,6 +1865,8 @@ struct object_info {
  */
 #define OBJECT_INFO_INIT {NULL}
 
+/* Allow reading from a loose object file of unknown/bogus type */
+#define OBJECT_INFO_ALLOW_UNKNOWN_TYPE 2
 extern int sha1_object_info_extended(const unsigned char *, struct object_info *, unsigned flags);
 extern int packed_object_info(struct packed_git *pack, off_t offset, struct object_info *);
 
diff --git a/sha1_file.c b/sha1_file.c
index a52b27541..ad04ea8e0 100644
--- a/sha1_file.c
+++ b/sha1_file.c
@@ -1964,7 +1964,7 @@ static int parse_sha1_header_extended(const char *hdr, struct object_info *oi,
 	 * we're obtaining the type using '--allow-unknown-type'
 	 * option.
 	 */
-	if ((flags & LOOKUP_UNKNOWN_OBJECT) && (type < 0))
+	if ((flags & OBJECT_INFO_ALLOW_UNKNOWN_TYPE) && (type < 0))
 		type = 0;
 	else if (type < 0)
 		die("invalid object type");
@@ -2941,7 +2941,7 @@ static int sha1_loose_object_info(const unsigned char *sha1,
 		return -1;
 	if (oi->disk_sizep)
 		*oi->disk_sizep = mapsize;
-	if ((flags & LOOKUP_UNKNOWN_OBJECT)) {
+	if ((flags & OBJECT_INFO_ALLOW_UNKNOWN_TYPE)) {
 		if (unpack_sha1_header_to_strbuf(&stream, map, mapsize, hdr, sizeof(hdr), &hdrbuf) < 0)
 			status = error("unable to unpack %s header with --allow-unknown-type",
 				       sha1_to_hex(sha1));
-- 
2.13.1.611.g7e3b11ae1-goog


^ permalink raw reply related	[flat|nested] 70+ messages in thread

* [PATCH v4 3/8] sha1_file: rename LOOKUP_REPLACE_OBJECT
  2017-06-09 19:23 [RFC PATCH 0/4] Improvements to sha1_file Jonathan Tan
                   ` (16 preceding siblings ...)
  2017-06-20  1:03 ` [PATCH v4 2/8] sha1_file: rename LOOKUP_UNKNOWN_OBJECT Jonathan Tan
@ 2017-06-20  1:03 ` Jonathan Tan
  2017-06-21 17:33   ` Junio C Hamano
  2017-06-20  1:03 ` [PATCH v4 4/8] sha1_file: move delta base cache code up Jonathan Tan
                   ` (13 subsequent siblings)
  31 siblings, 1 reply; 70+ messages in thread
From: Jonathan Tan @ 2017-06-20  1:03 UTC (permalink / raw)
  To: git; +Cc: Jonathan Tan, gitster, peff

The LOOKUP_REPLACE_OBJECT flag controls whether the
lookup_replace_object() function is invoked by
sha1_object_info_extended(), read_sha1_file_extended(), and
lookup_replace_object_extended(), but it is not immediately clear which
functions accept that flag.

Therefore restrict this flag to only sha1_object_info_extended(),
renaming it appropriately to OBJECT_INFO_LOOKUP_REPLACE and adding some
documentation. Update read_sha1_file_extended() to have a boolean
parameter instead, and delete lookup_replace_object_extended().

parse_sha1_header() also passes this flag to
parse_sha1_header_extended() since commit 46f0344 ("sha1_file: support
reading from a loose object of unknown type", 2015-05-03), but that has
had no effect since that commit. Therefore this patch also removes this
flag from that invocation.

Signed-off-by: Jonathan Tan <jonathantanmy@google.com>
---
 builtin/cat-file.c |  5 +++--
 cache.h            | 17 ++++++-----------
 sha1_file.c        | 13 ++++++++-----
 3 files changed, 17 insertions(+), 18 deletions(-)

diff --git a/builtin/cat-file.c b/builtin/cat-file.c
index 209374b3c..923786c00 100644
--- a/builtin/cat-file.c
+++ b/builtin/cat-file.c
@@ -56,7 +56,7 @@ static int cat_one_file(int opt, const char *exp_type, const char *obj_name,
 	struct object_context obj_context;
 	struct object_info oi = OBJECT_INFO_INIT;
 	struct strbuf sb = STRBUF_INIT;
-	unsigned flags = LOOKUP_REPLACE_OBJECT;
+	unsigned flags = OBJECT_INFO_LOOKUP_REPLACE;
 	const char *path = force_path;
 
 	if (unknown_type)
@@ -337,7 +337,8 @@ static void batch_object_write(const char *obj_name, struct batch_options *opt,
 	struct strbuf buf = STRBUF_INIT;
 
 	if (!data->skip_object_info &&
-	    sha1_object_info_extended(data->oid.hash, &data->info, LOOKUP_REPLACE_OBJECT) < 0) {
+	    sha1_object_info_extended(data->oid.hash, &data->info,
+				      OBJECT_INFO_LOOKUP_REPLACE)) {
 		printf("%s missing\n",
 		       obj_name ? obj_name : oid_to_hex(&data->oid));
 		fflush(stdout);
diff --git a/cache.h b/cache.h
index e2ec45dfe..a3631b237 100644
--- a/cache.h
+++ b/cache.h
@@ -1205,12 +1205,12 @@ extern char *xdg_config_home(const char *filename);
  */
 extern char *xdg_cache_home(const char *filename);
 
-/* object replacement */
-#define LOOKUP_REPLACE_OBJECT 1
-extern void *read_sha1_file_extended(const unsigned char *sha1, enum object_type *type, unsigned long *size, unsigned flag);
+extern void *read_sha1_file_extended(const unsigned char *sha1,
+				     enum object_type *type,
+				     unsigned long *size, int lookup_replace);
 static inline void *read_sha1_file(const unsigned char *sha1, enum object_type *type, unsigned long *size)
 {
-	return read_sha1_file_extended(sha1, type, size, LOOKUP_REPLACE_OBJECT);
+	return read_sha1_file_extended(sha1, type, size, 1);
 }
 
 /*
@@ -1232,13 +1232,6 @@ static inline const unsigned char *lookup_replace_object(const unsigned char *sh
 	return do_lookup_replace_object(sha1);
 }
 
-static inline const unsigned char *lookup_replace_object_extended(const unsigned char *sha1, unsigned flag)
-{
-	if (!(flag & LOOKUP_REPLACE_OBJECT))
-		return sha1;
-	return lookup_replace_object(sha1);
-}
-
 /* Read and unpack a sha1 file into memory, write memory to a sha1 file */
 extern int sha1_object_info(const unsigned char *, unsigned long *);
 extern int hash_sha1_file(const void *buf, unsigned long len, const char *type, unsigned char *sha1);
@@ -1865,6 +1858,8 @@ struct object_info {
  */
 #define OBJECT_INFO_INIT {NULL}
 
+/* Invoke lookup_replace_object() on the given hash */
+#define OBJECT_INFO_LOOKUP_REPLACE 1
 /* Allow reading from a loose object file of unknown/bogus type */
 #define OBJECT_INFO_ALLOW_UNKNOWN_TYPE 2
 extern int sha1_object_info_extended(const unsigned char *, struct object_info *, unsigned flags);
diff --git a/sha1_file.c b/sha1_file.c
index ad04ea8e0..ae44b32f3 100644
--- a/sha1_file.c
+++ b/sha1_file.c
@@ -2002,7 +2002,7 @@ int parse_sha1_header(const char *hdr, unsigned long *sizep)
 	struct object_info oi = OBJECT_INFO_INIT;
 
 	oi.sizep = sizep;
-	return parse_sha1_header_extended(hdr, &oi, LOOKUP_REPLACE_OBJECT);
+	return parse_sha1_header_extended(hdr, &oi, 0);
 }
 
 static void *unpack_sha1_file(void *map, unsigned long mapsize, enum object_type *type, unsigned long *size, const unsigned char *sha1)
@@ -2969,7 +2969,9 @@ int sha1_object_info_extended(const unsigned char *sha1, struct object_info *oi,
 	struct cached_object *co;
 	struct pack_entry e;
 	int rtype;
-	const unsigned char *real = lookup_replace_object_extended(sha1, flags);
+	const unsigned char *real = (flags & OBJECT_INFO_LOOKUP_REPLACE) ?
+				    lookup_replace_object(sha1) :
+				    sha1;
 
 	co = find_cached_object(real);
 	if (co) {
@@ -3025,7 +3027,7 @@ int sha1_object_info(const unsigned char *sha1, unsigned long *sizep)
 
 	oi.typep = &type;
 	oi.sizep = sizep;
-	if (sha1_object_info_extended(sha1, &oi, LOOKUP_REPLACE_OBJECT) < 0)
+	if (sha1_object_info_extended(sha1, &oi, OBJECT_INFO_LOOKUP_REPLACE))
 		return -1;
 	return type;
 }
@@ -3107,13 +3109,14 @@ static void *read_object(const unsigned char *sha1, enum object_type *type,
 void *read_sha1_file_extended(const unsigned char *sha1,
 			      enum object_type *type,
 			      unsigned long *size,
-			      unsigned flag)
+			      int lookup_replace)
 {
 	void *data;
 	const struct packed_git *p;
 	const char *path;
 	struct stat st;
-	const unsigned char *repl = lookup_replace_object_extended(sha1, flag);
+	const unsigned char *repl = lookup_replace ? lookup_replace_object(sha1)
+						   : sha1;
 
 	errno = 0;
 	data = read_object(repl, type, size);
-- 
2.13.1.611.g7e3b11ae1-goog


^ permalink raw reply related	[flat|nested] 70+ messages in thread

* [PATCH v4 4/8] sha1_file: move delta base cache code up
  2017-06-09 19:23 [RFC PATCH 0/4] Improvements to sha1_file Jonathan Tan
                   ` (17 preceding siblings ...)
  2017-06-20  1:03 ` [PATCH v4 3/8] sha1_file: rename LOOKUP_REPLACE_OBJECT Jonathan Tan
@ 2017-06-20  1:03 ` Jonathan Tan
  2017-06-20  1:03 ` [PATCH v4 5/8] sha1_file: refactor read_object Jonathan Tan
                   ` (12 subsequent siblings)
  31 siblings, 0 replies; 70+ messages in thread
From: Jonathan Tan @ 2017-06-20  1:03 UTC (permalink / raw)
  To: git; +Cc: Jonathan Tan, gitster, peff

In a subsequent patch, packed_object_info() will be modified to use the
delta base cache, so move the relevant code to before
packed_object_info().

Signed-off-by: Jonathan Tan <jonathantanmy@google.com>
---
 sha1_file.c | 220 ++++++++++++++++++++++++++++++------------------------------
 1 file changed, 110 insertions(+), 110 deletions(-)

diff --git a/sha1_file.c b/sha1_file.c
index ae44b32f3..a7be45efe 100644
--- a/sha1_file.c
+++ b/sha1_file.c
@@ -2239,116 +2239,6 @@ static enum object_type packed_to_object_type(struct packed_git *p,
 	goto out;
 }
 
-int packed_object_info(struct packed_git *p, off_t obj_offset,
-		       struct object_info *oi)
-{
-	struct pack_window *w_curs = NULL;
-	unsigned long size;
-	off_t curpos = obj_offset;
-	enum object_type type;
-
-	/*
-	 * We always get the representation type, but only convert it to
-	 * a "real" type later if the caller is interested.
-	 */
-	type = unpack_object_header(p, &w_curs, &curpos, &size);
-
-	if (oi->sizep) {
-		if (type == OBJ_OFS_DELTA || type == OBJ_REF_DELTA) {
-			off_t tmp_pos = curpos;
-			off_t base_offset = get_delta_base(p, &w_curs, &tmp_pos,
-							   type, obj_offset);
-			if (!base_offset) {
-				type = OBJ_BAD;
-				goto out;
-			}
-			*oi->sizep = get_size_from_delta(p, &w_curs, tmp_pos);
-			if (*oi->sizep == 0) {
-				type = OBJ_BAD;
-				goto out;
-			}
-		} else {
-			*oi->sizep = size;
-		}
-	}
-
-	if (oi->disk_sizep) {
-		struct revindex_entry *revidx = find_pack_revindex(p, obj_offset);
-		*oi->disk_sizep = revidx[1].offset - obj_offset;
-	}
-
-	if (oi->typep || oi->typename) {
-		enum object_type ptot;
-		ptot = packed_to_object_type(p, obj_offset, type, &w_curs,
-					     curpos);
-		if (oi->typep)
-			*oi->typep = ptot;
-		if (oi->typename) {
-			const char *tn = typename(ptot);
-			if (tn)
-				strbuf_addstr(oi->typename, tn);
-		}
-		if (ptot < 0) {
-			type = OBJ_BAD;
-			goto out;
-		}
-	}
-
-	if (oi->delta_base_sha1) {
-		if (type == OBJ_OFS_DELTA || type == OBJ_REF_DELTA) {
-			const unsigned char *base;
-
-			base = get_delta_base_sha1(p, &w_curs, curpos,
-						   type, obj_offset);
-			if (!base) {
-				type = OBJ_BAD;
-				goto out;
-			}
-
-			hashcpy(oi->delta_base_sha1, base);
-		} else
-			hashclr(oi->delta_base_sha1);
-	}
-
-out:
-	unuse_pack(&w_curs);
-	return type;
-}
-
-static void *unpack_compressed_entry(struct packed_git *p,
-				    struct pack_window **w_curs,
-				    off_t curpos,
-				    unsigned long size)
-{
-	int st;
-	git_zstream stream;
-	unsigned char *buffer, *in;
-
-	buffer = xmallocz_gently(size);
-	if (!buffer)
-		return NULL;
-	memset(&stream, 0, sizeof(stream));
-	stream.next_out = buffer;
-	stream.avail_out = size + 1;
-
-	git_inflate_init(&stream);
-	do {
-		in = use_pack(p, w_curs, curpos, &stream.avail_in);
-		stream.next_in = in;
-		st = git_inflate(&stream, Z_FINISH);
-		if (!stream.avail_out)
-			break; /* the payload is larger than it should be */
-		curpos += stream.next_in - in;
-	} while (st == Z_OK || st == Z_BUF_ERROR);
-	git_inflate_end(&stream);
-	if ((st != Z_STREAM_END) || stream.total_out != size) {
-		free(buffer);
-		return NULL;
-	}
-
-	return buffer;
-}
-
 static struct hashmap delta_base_cache;
 static size_t delta_base_cached;
 
@@ -2486,6 +2376,116 @@ static void add_delta_base_cache(struct packed_git *p, off_t base_offset,
 	hashmap_add(&delta_base_cache, ent);
 }
 
+int packed_object_info(struct packed_git *p, off_t obj_offset,
+		       struct object_info *oi)
+{
+	struct pack_window *w_curs = NULL;
+	unsigned long size;
+	off_t curpos = obj_offset;
+	enum object_type type;
+
+	/*
+	 * We always get the representation type, but only convert it to
+	 * a "real" type later if the caller is interested.
+	 */
+	type = unpack_object_header(p, &w_curs, &curpos, &size);
+
+	if (oi->sizep) {
+		if (type == OBJ_OFS_DELTA || type == OBJ_REF_DELTA) {
+			off_t tmp_pos = curpos;
+			off_t base_offset = get_delta_base(p, &w_curs, &tmp_pos,
+							   type, obj_offset);
+			if (!base_offset) {
+				type = OBJ_BAD;
+				goto out;
+			}
+			*oi->sizep = get_size_from_delta(p, &w_curs, tmp_pos);
+			if (*oi->sizep == 0) {
+				type = OBJ_BAD;
+				goto out;
+			}
+		} else {
+			*oi->sizep = size;
+		}
+	}
+
+	if (oi->disk_sizep) {
+		struct revindex_entry *revidx = find_pack_revindex(p, obj_offset);
+		*oi->disk_sizep = revidx[1].offset - obj_offset;
+	}
+
+	if (oi->typep || oi->typename) {
+		enum object_type ptot;
+		ptot = packed_to_object_type(p, obj_offset, type, &w_curs,
+					     curpos);
+		if (oi->typep)
+			*oi->typep = ptot;
+		if (oi->typename) {
+			const char *tn = typename(ptot);
+			if (tn)
+				strbuf_addstr(oi->typename, tn);
+		}
+		if (ptot < 0) {
+			type = OBJ_BAD;
+			goto out;
+		}
+	}
+
+	if (oi->delta_base_sha1) {
+		if (type == OBJ_OFS_DELTA || type == OBJ_REF_DELTA) {
+			const unsigned char *base;
+
+			base = get_delta_base_sha1(p, &w_curs, curpos,
+						   type, obj_offset);
+			if (!base) {
+				type = OBJ_BAD;
+				goto out;
+			}
+
+			hashcpy(oi->delta_base_sha1, base);
+		} else
+			hashclr(oi->delta_base_sha1);
+	}
+
+out:
+	unuse_pack(&w_curs);
+	return type;
+}
+
+static void *unpack_compressed_entry(struct packed_git *p,
+				    struct pack_window **w_curs,
+				    off_t curpos,
+				    unsigned long size)
+{
+	int st;
+	git_zstream stream;
+	unsigned char *buffer, *in;
+
+	buffer = xmallocz_gently(size);
+	if (!buffer)
+		return NULL;
+	memset(&stream, 0, sizeof(stream));
+	stream.next_out = buffer;
+	stream.avail_out = size + 1;
+
+	git_inflate_init(&stream);
+	do {
+		in = use_pack(p, w_curs, curpos, &stream.avail_in);
+		stream.next_in = in;
+		st = git_inflate(&stream, Z_FINISH);
+		if (!stream.avail_out)
+			break; /* the payload is larger than it should be */
+		curpos += stream.next_in - in;
+	} while (st == Z_OK || st == Z_BUF_ERROR);
+	git_inflate_end(&stream);
+	if ((st != Z_STREAM_END) || stream.total_out != size) {
+		free(buffer);
+		return NULL;
+	}
+
+	return buffer;
+}
+
 static void *read_object(const unsigned char *sha1, enum object_type *type,
 			 unsigned long *size);
 
-- 
2.13.1.611.g7e3b11ae1-goog


^ permalink raw reply related	[flat|nested] 70+ messages in thread

* [PATCH v4 5/8] sha1_file: refactor read_object
  2017-06-09 19:23 [RFC PATCH 0/4] Improvements to sha1_file Jonathan Tan
                   ` (18 preceding siblings ...)
  2017-06-20  1:03 ` [PATCH v4 4/8] sha1_file: move delta base cache code up Jonathan Tan
@ 2017-06-20  1:03 ` Jonathan Tan
  2017-06-21 17:58   ` Junio C Hamano
  2017-06-20  1:03 ` [PATCH v4 6/8] sha1_file: improve sha1_object_info_extended Jonathan Tan
                   ` (11 subsequent siblings)
  31 siblings, 1 reply; 70+ messages in thread
From: Jonathan Tan @ 2017-06-20  1:03 UTC (permalink / raw)
  To: git; +Cc: Jonathan Tan, gitster, peff

read_object() and sha1_object_info_extended() both implement mechanisms
such as object replacement, retrying the packed store after failing to
find the object in the packed store then the loose store, and being able
to mark a packed object as bad and then retrying the whole process.
Consolidating these mechanisms would be a great help to maintainability.

Therefore, consolidate them by extending sha1_object_info_extended() to
support the functionality needed, and then modifying read_object() to
use sha1_object_info_extended().

Signed-off-by: Jonathan Tan <jonathantanmy@google.com>
---
 cache.h     |  1 +
 sha1_file.c | 84 ++++++++++++++++++++++++++++++-------------------------------
 2 files changed, 43 insertions(+), 42 deletions(-)

diff --git a/cache.h b/cache.h
index a3631b237..48aea923b 100644
--- a/cache.h
+++ b/cache.h
@@ -1827,6 +1827,7 @@ struct object_info {
 	off_t *disk_sizep;
 	unsigned char *delta_base_sha1;
 	struct strbuf *typename;
+	void **contentp;
 
 	/* Response */
 	enum {
diff --git a/sha1_file.c b/sha1_file.c
index a7be45efe..4d5033c48 100644
--- a/sha1_file.c
+++ b/sha1_file.c
@@ -2005,19 +2005,6 @@ int parse_sha1_header(const char *hdr, unsigned long *sizep)
 	return parse_sha1_header_extended(hdr, &oi, 0);
 }
 
-static void *unpack_sha1_file(void *map, unsigned long mapsize, enum object_type *type, unsigned long *size, const unsigned char *sha1)
-{
-	int ret;
-	git_zstream stream;
-	char hdr[8192];
-
-	ret = unpack_sha1_header(&stream, map, mapsize, hdr, sizeof(hdr));
-	if (ret < Z_OK || (*type = parse_sha1_header(hdr, size)) < 0)
-		return NULL;
-
-	return unpack_sha1_rest(&stream, hdr, *size, sha1);
-}
-
 unsigned long get_size_from_delta(struct packed_git *p,
 				  struct pack_window **w_curs,
 			          off_t curpos)
@@ -2326,8 +2313,10 @@ static void *cache_or_unpack_entry(struct packed_git *p, off_t base_offset,
 	if (!ent)
 		return unpack_entry(p, base_offset, type, base_size);
 
-	*type = ent->type;
-	*base_size = ent->size;
+	if (type)
+		*type = ent->type;
+	if (base_size)
+		*base_size = ent->size;
 	return xmemdupz(ent->data, ent->size);
 }
 
@@ -2388,9 +2377,16 @@ int packed_object_info(struct packed_git *p, off_t obj_offset,
 	 * We always get the representation type, but only convert it to
 	 * a "real" type later if the caller is interested.
 	 */
-	type = unpack_object_header(p, &w_curs, &curpos, &size);
+	if (oi->contentp) {
+		*oi->contentp = cache_or_unpack_entry(p, obj_offset, oi->sizep,
+						      &type);
+		if (!*oi->contentp)
+			type = OBJ_BAD;
+	} else {
+		type = unpack_object_header(p, &w_curs, &curpos, &size);
+	}
 
-	if (oi->sizep) {
+	if (!oi->contentp && oi->sizep) {
 		if (type == OBJ_OFS_DELTA || type == OBJ_REF_DELTA) {
 			off_t tmp_pos = curpos;
 			off_t base_offset = get_delta_base(p, &w_curs, &tmp_pos,
@@ -2679,8 +2675,10 @@ void *unpack_entry(struct packed_git *p, off_t obj_offset,
 		free(external_base);
 	}
 
-	*final_type = type;
-	*final_size = size;
+	if (final_type)
+		*final_type = type;
+	if (final_size)
+		*final_size = size;
 
 	unuse_pack(&w_curs);
 
@@ -2914,6 +2912,7 @@ static int sha1_loose_object_info(const unsigned char *sha1,
 	git_zstream stream;
 	char hdr[32];
 	struct strbuf hdrbuf = STRBUF_INIT;
+	unsigned long size_scratch;
 
 	if (oi->delta_base_sha1)
 		hashclr(oi->delta_base_sha1);
@@ -2926,7 +2925,7 @@ static int sha1_loose_object_info(const unsigned char *sha1,
 	 * return value implicitly indicates whether the
 	 * object even exists.
 	 */
-	if (!oi->typep && !oi->typename && !oi->sizep) {
+	if (!oi->typep && !oi->typename && !oi->sizep && !oi->contentp) {
 		const char *path;
 		struct stat st;
 		if (stat_sha1_file(sha1, &st, &path) < 0)
@@ -2939,6 +2938,10 @@ static int sha1_loose_object_info(const unsigned char *sha1,
 	map = map_sha1_file(sha1, &mapsize);
 	if (!map)
 		return -1;
+
+	if (!oi->sizep)
+		oi->sizep = &size_scratch;
+
 	if (oi->disk_sizep)
 		*oi->disk_sizep = mapsize;
 	if ((flags & OBJECT_INFO_ALLOW_UNKNOWN_TYPE)) {
@@ -2956,10 +2959,18 @@ static int sha1_loose_object_info(const unsigned char *sha1,
 				       sha1_to_hex(sha1));
 	} else if ((status = parse_sha1_header_extended(hdr, oi, flags)) < 0)
 		status = error("unable to parse %s header", sha1_to_hex(sha1));
-	git_inflate_end(&stream);
+
+	if (status >= 0 && oi->contentp)
+		*oi->contentp = unpack_sha1_rest(&stream, hdr,
+						 *oi->sizep, sha1);
+	else
+		git_inflate_end(&stream);
+
 	munmap(map, mapsize);
 	if (status && oi->typep)
 		*oi->typep = status;
+	if (oi->sizep == &size_scratch)
+		oi->sizep = NULL;
 	strbuf_release(&hdrbuf);
 	return (status < 0) ? status : 0;
 }
@@ -2985,6 +2996,8 @@ int sha1_object_info_extended(const unsigned char *sha1, struct object_info *oi,
 			hashclr(oi->delta_base_sha1);
 		if (oi->typename)
 			strbuf_addstr(oi->typename, typename(co->type));
+		if (oi->contentp)
+			*oi->contentp = xmemdupz(co->buf, co->size);
 		oi->whence = OI_CACHED;
 		return 0;
 	}
@@ -3077,28 +3090,15 @@ int pretend_sha1_file(void *buf, unsigned long len, enum object_type type,
 static void *read_object(const unsigned char *sha1, enum object_type *type,
 			 unsigned long *size)
 {
-	unsigned long mapsize;
-	void *map, *buf;
-	struct cached_object *co;
-
-	co = find_cached_object(sha1);
-	if (co) {
-		*type = co->type;
-		*size = co->size;
-		return xmemdupz(co->buf, co->size);
-	}
+	struct object_info oi = OBJECT_INFO_INIT;
+	void *content;
+	oi.typep = type;
+	oi.sizep = size;
+	oi.contentp = &content;
 
-	buf = read_packed_sha1(sha1, type, size);
-	if (buf)
-		return buf;
-	map = map_sha1_file(sha1, &mapsize);
-	if (map) {
-		buf = unpack_sha1_file(map, mapsize, type, size, sha1);
-		munmap(map, mapsize);
-		return buf;
-	}
-	reprepare_packed_git();
-	return read_packed_sha1(sha1, type, size);
+	if (sha1_object_info_extended(sha1, &oi, 0))
+		return NULL;
+	return content;
 }
 
 /*
-- 
2.13.1.611.g7e3b11ae1-goog


^ permalink raw reply related	[flat|nested] 70+ messages in thread

* [PATCH v4 6/8] sha1_file: improve sha1_object_info_extended
  2017-06-09 19:23 [RFC PATCH 0/4] Improvements to sha1_file Jonathan Tan
                   ` (19 preceding siblings ...)
  2017-06-20  1:03 ` [PATCH v4 5/8] sha1_file: refactor read_object Jonathan Tan
@ 2017-06-20  1:03 ` Jonathan Tan
  2017-06-24 12:45   ` Jeff King
  2017-06-20  1:03 ` [PATCH v4 7/8] sha1_file: do not access pack if unneeded Jonathan Tan
                   ` (10 subsequent siblings)
  31 siblings, 1 reply; 70+ messages in thread
From: Jonathan Tan @ 2017-06-20  1:03 UTC (permalink / raw)
  To: git; +Cc: Jonathan Tan, gitster, peff

Improve sha1_object_info_extended() by supporting additional flags. This
allows has_sha1_file_with_flags() to be modified to use
sha1_object_info_extended() in a subsequent patch.

Signed-off-by: Jonathan Tan <jonathantanmy@google.com>
---
 cache.h     |  4 ++++
 sha1_file.c | 43 ++++++++++++++++++++++++-------------------
 2 files changed, 28 insertions(+), 19 deletions(-)

diff --git a/cache.h b/cache.h
index 48aea923b..7cf2ca466 100644
--- a/cache.h
+++ b/cache.h
@@ -1863,6 +1863,10 @@ struct object_info {
 #define OBJECT_INFO_LOOKUP_REPLACE 1
 /* Allow reading from a loose object file of unknown/bogus type */
 #define OBJECT_INFO_ALLOW_UNKNOWN_TYPE 2
+/* Do not check cached storage */
+#define OBJECT_INFO_SKIP_CACHED 4
+/* Do not retry packed storage after checking packed and loose storage */
+#define OBJECT_INFO_QUICK 8
 extern int sha1_object_info_extended(const unsigned char *, struct object_info *, unsigned flags);
 extern int packed_object_info(struct packed_git *pack, off_t offset, struct object_info *);
 
diff --git a/sha1_file.c b/sha1_file.c
index 4d5033c48..24f7a146e 100644
--- a/sha1_file.c
+++ b/sha1_file.c
@@ -2977,29 +2977,30 @@ static int sha1_loose_object_info(const unsigned char *sha1,
 
 int sha1_object_info_extended(const unsigned char *sha1, struct object_info *oi, unsigned flags)
 {
-	struct cached_object *co;
 	struct pack_entry e;
 	int rtype;
 	const unsigned char *real = (flags & OBJECT_INFO_LOOKUP_REPLACE) ?
 				    lookup_replace_object(sha1) :
 				    sha1;
 
-	co = find_cached_object(real);
-	if (co) {
-		if (oi->typep)
-			*(oi->typep) = co->type;
-		if (oi->sizep)
-			*(oi->sizep) = co->size;
-		if (oi->disk_sizep)
-			*(oi->disk_sizep) = 0;
-		if (oi->delta_base_sha1)
-			hashclr(oi->delta_base_sha1);
-		if (oi->typename)
-			strbuf_addstr(oi->typename, typename(co->type));
-		if (oi->contentp)
-			*oi->contentp = xmemdupz(co->buf, co->size);
-		oi->whence = OI_CACHED;
-		return 0;
+	if (!(flags & OBJECT_INFO_SKIP_CACHED)) {
+		struct cached_object *co = find_cached_object(real);
+		if (co) {
+			if (oi->typep)
+				*(oi->typep) = co->type;
+			if (oi->sizep)
+				*(oi->sizep) = co->size;
+			if (oi->disk_sizep)
+				*(oi->disk_sizep) = 0;
+			if (oi->delta_base_sha1)
+				hashclr(oi->delta_base_sha1);
+			if (oi->typename)
+				strbuf_addstr(oi->typename, typename(co->type));
+			if (oi->contentp)
+				*oi->contentp = xmemdupz(co->buf, co->size);
+			oi->whence = OI_CACHED;
+			return 0;
+		}
 	}
 
 	if (!find_pack_entry(real, &e)) {
@@ -3010,9 +3011,13 @@ int sha1_object_info_extended(const unsigned char *sha1, struct object_info *oi,
 		}
 
 		/* Not a loose object; someone else may have just packed it. */
-		reprepare_packed_git();
-		if (!find_pack_entry(real, &e))
+		if (flags & OBJECT_INFO_QUICK) {
 			return -1;
+		} else {
+			reprepare_packed_git();
+			if (!find_pack_entry(real, &e))
+				return -1;
+		}
 	}
 
 	rtype = packed_object_info(e.p, e.offset, oi);
-- 
2.13.1.611.g7e3b11ae1-goog


^ permalink raw reply related	[flat|nested] 70+ messages in thread

* [PATCH v4 7/8] sha1_file: do not access pack if unneeded
  2017-06-09 19:23 [RFC PATCH 0/4] Improvements to sha1_file Jonathan Tan
                   ` (20 preceding siblings ...)
  2017-06-20  1:03 ` [PATCH v4 6/8] sha1_file: improve sha1_object_info_extended Jonathan Tan
@ 2017-06-20  1:03 ` Jonathan Tan
  2017-06-21 18:15   ` Junio C Hamano
  2017-06-20  1:03 ` [PATCH v4 8/8] sha1_file: refactor has_sha1_file_with_flags Jonathan Tan
                   ` (9 subsequent siblings)
  31 siblings, 1 reply; 70+ messages in thread
From: Jonathan Tan @ 2017-06-20  1:03 UTC (permalink / raw)
  To: git; +Cc: Jonathan Tan, gitster, peff

Add an option to struct object_info to suppress population of additional
information about a packed object if unneeded. This allows an
optimization in which sha1_object_info_extended() does not even need to
access the pack if no information besides provenance is requested. A
subsequent patch will make use of this optimization.

Signed-off-by: Jonathan Tan <jonathantanmy@google.com>
---
 cache.h     |  1 +
 sha1_file.c | 17 +++++++++++++----
 streaming.c |  1 +
 3 files changed, 15 insertions(+), 4 deletions(-)

diff --git a/cache.h b/cache.h
index 7cf2ca466..2e1cc3fe2 100644
--- a/cache.h
+++ b/cache.h
@@ -1828,6 +1828,7 @@ struct object_info {
 	unsigned char *delta_base_sha1;
 	struct strbuf *typename;
 	void **contentp;
+	unsigned populate_u : 1;
 
 	/* Response */
 	enum {
diff --git a/sha1_file.c b/sha1_file.c
index 24f7a146e..68e3a3400 100644
--- a/sha1_file.c
+++ b/sha1_file.c
@@ -3020,6 +3020,13 @@ int sha1_object_info_extended(const unsigned char *sha1, struct object_info *oi,
 		}
 	}
 
+	if (!oi->typep && !oi->sizep && !oi->disk_sizep &&
+	    !oi->delta_base_sha1 && !oi->typename && !oi->contentp &&
+	    !oi->populate_u) {
+		oi->whence = OI_PACKED;
+		return 0;
+	}
+
 	rtype = packed_object_info(e.p, e.offset, oi);
 	if (rtype < 0) {
 		mark_bad_packed_object(e.p, real);
@@ -3028,10 +3035,12 @@ int sha1_object_info_extended(const unsigned char *sha1, struct object_info *oi,
 		oi->whence = OI_DBCACHED;
 	} else {
 		oi->whence = OI_PACKED;
-		oi->u.packed.offset = e.offset;
-		oi->u.packed.pack = e.p;
-		oi->u.packed.is_delta = (rtype == OBJ_REF_DELTA ||
-					 rtype == OBJ_OFS_DELTA);
+		if (oi->populate_u) {
+			oi->u.packed.offset = e.offset;
+			oi->u.packed.pack = e.p;
+			oi->u.packed.is_delta = (rtype == OBJ_REF_DELTA ||
+						 rtype == OBJ_OFS_DELTA);
+		}
 	}
 
 	return 0;
diff --git a/streaming.c b/streaming.c
index 9afa66b8b..deebc18a8 100644
--- a/streaming.c
+++ b/streaming.c
@@ -113,6 +113,7 @@ static enum input_source istream_source(const unsigned char *sha1,
 
 	oi->typep = type;
 	oi->sizep = &size;
+	oi->populate_u = 1;
 	status = sha1_object_info_extended(sha1, oi, 0);
 	if (status < 0)
 		return stream_error;
-- 
2.13.1.611.g7e3b11ae1-goog


^ permalink raw reply related	[flat|nested] 70+ messages in thread

* [PATCH v4 8/8] sha1_file: refactor has_sha1_file_with_flags
  2017-06-09 19:23 [RFC PATCH 0/4] Improvements to sha1_file Jonathan Tan
                   ` (21 preceding siblings ...)
  2017-06-20  1:03 ` [PATCH v4 7/8] sha1_file: do not access pack if unneeded Jonathan Tan
@ 2017-06-20  1:03 ` Jonathan Tan
  2017-06-22  0:40 ` [PATCH v5 0/8] Improvements to sha1_file Jonathan Tan
                   ` (8 subsequent siblings)
  31 siblings, 0 replies; 70+ messages in thread
From: Jonathan Tan @ 2017-06-20  1:03 UTC (permalink / raw)
  To: git; +Cc: Jonathan Tan, gitster, peff

has_sha1_file_with_flags() implements many mechanisms in common with
sha1_object_info_extended(). Make has_sha1_file_with_flags() a
convenience function for sha1_object_info_extended() instead.

Signed-off-by: Jonathan Tan <jonathantanmy@google.com>
---
 builtin/fetch.c      | 10 ++++++----
 builtin/index-pack.c |  3 ++-
 cache.h              | 11 +++--------
 sha1_file.c          | 13 +++----------
 4 files changed, 14 insertions(+), 23 deletions(-)

diff --git a/builtin/fetch.c b/builtin/fetch.c
index 47708451b..96d5146c4 100644
--- a/builtin/fetch.c
+++ b/builtin/fetch.c
@@ -242,9 +242,11 @@ static void find_non_local_tags(struct transport *transport,
 		 */
 		if (ends_with(ref->name, "^{}")) {
 			if (item &&
-			    !has_object_file_with_flags(&ref->old_oid, HAS_SHA1_QUICK) &&
+			    !has_object_file_with_flags(&ref->old_oid,
+							OBJECT_INFO_QUICK) &&
 			    !will_fetch(head, ref->old_oid.hash) &&
-			    !has_sha1_file_with_flags(item->util, HAS_SHA1_QUICK) &&
+			    !has_sha1_file_with_flags(item->util,
+						      OBJECT_INFO_QUICK) &&
 			    !will_fetch(head, item->util))
 				item->util = NULL;
 			item = NULL;
@@ -258,7 +260,7 @@ static void find_non_local_tags(struct transport *transport,
 		 * fetch.
 		 */
 		if (item &&
-		    !has_sha1_file_with_flags(item->util, HAS_SHA1_QUICK) &&
+		    !has_sha1_file_with_flags(item->util, OBJECT_INFO_QUICK) &&
 		    !will_fetch(head, item->util))
 			item->util = NULL;
 
@@ -279,7 +281,7 @@ static void find_non_local_tags(struct transport *transport,
 	 * checked to see if it needs fetching.
 	 */
 	if (item &&
-	    !has_sha1_file_with_flags(item->util, HAS_SHA1_QUICK) &&
+	    !has_sha1_file_with_flags(item->util, OBJECT_INFO_QUICK) &&
 	    !will_fetch(head, item->util))
 		item->util = NULL;
 
diff --git a/builtin/index-pack.c b/builtin/index-pack.c
index 04b9dcaf0..587bc80c9 100644
--- a/builtin/index-pack.c
+++ b/builtin/index-pack.c
@@ -794,7 +794,8 @@ static void sha1_object(const void *data, struct object_entry *obj_entry,
 
 	if (startup_info->have_repository) {
 		read_lock();
-		collision_test_needed = has_sha1_file_with_flags(oid->hash, HAS_SHA1_QUICK);
+		collision_test_needed =
+			has_sha1_file_with_flags(oid->hash, OBJECT_INFO_QUICK);
 		read_unlock();
 	}
 
diff --git a/cache.h b/cache.h
index 2e1cc3fe2..387694b25 100644
--- a/cache.h
+++ b/cache.h
@@ -1268,15 +1268,10 @@ int read_loose_object(const char *path,
 		      void **contents);
 
 /*
- * Return true iff we have an object named sha1, whether local or in
- * an alternate object database, and whether packed or loose.  This
- * function does not respect replace references.
- *
- * If the QUICK flag is set, do not re-check the pack directory
- * when we cannot find the object (this means we may give a false
- * negative answer if another process is simultaneously repacking).
+ * Convenience for sha1_object_info_extended() with a blank struct
+ * object_info. OBJECT_INFO_SKIP_CACHED is automatically set; pass
+ * nonzero flags to also set other flags.
  */
-#define HAS_SHA1_QUICK 0x1
 extern int has_sha1_file_with_flags(const unsigned char *sha1, int flags);
 static inline int has_sha1_file(const unsigned char *sha1)
 {
diff --git a/sha1_file.c b/sha1_file.c
index 68e3a3400..20db9b510 100644
--- a/sha1_file.c
+++ b/sha1_file.c
@@ -3491,18 +3491,11 @@ int has_sha1_pack(const unsigned char *sha1)
 
 int has_sha1_file_with_flags(const unsigned char *sha1, int flags)
 {
-	struct pack_entry e;
-
+	static struct object_info blank;
 	if (!startup_info->have_repository)
 		return 0;
-	if (find_pack_entry(sha1, &e))
-		return 1;
-	if (has_loose_object(sha1))
-		return 1;
-	if (flags & HAS_SHA1_QUICK)
-		return 0;
-	reprepare_packed_git();
-	return find_pack_entry(sha1, &e);
+	return !sha1_object_info_extended(sha1, &blank,
+					  flags | OBJECT_INFO_SKIP_CACHED);
 }
 
 int has_object_file(const struct object_id *oid)
-- 
2.13.1.611.g7e3b11ae1-goog


^ permalink raw reply related	[flat|nested] 70+ messages in thread

* Re: [PATCH v4 2/8] sha1_file: rename LOOKUP_UNKNOWN_OBJECT
  2017-06-20  1:03 ` [PATCH v4 2/8] sha1_file: rename LOOKUP_UNKNOWN_OBJECT Jonathan Tan
@ 2017-06-21 17:22   ` Junio C Hamano
  2017-06-21 17:34     ` Jonathan Tan
  0 siblings, 1 reply; 70+ messages in thread
From: Junio C Hamano @ 2017-06-21 17:22 UTC (permalink / raw)
  To: Jonathan Tan; +Cc: git, peff

Jonathan Tan <jonathantanmy@google.com> writes:

> The LOOKUP_UNKNOWN_OBJECT flag was introduced in commit 46f0344
> ("sha1_file: support reading from a loose object of unknown type",
> 2015-05-03) in order to support a feature in cat-file subsequently
> introduced in commit 39e4ae3 ("cat-file: teach cat-file a
> '--allow-unknown-type' option", 2015-05-03). Despite its name and
> location in cache.h, this flag is used neither in
> read_sha1_file_extended() nor in any of the lookup functions, but used
> only in sha1_object_info_extended().
>
> Therefore rename this flag to OBJECT_INFO_ALLOW_UNKNOWN_TYPE, taking the
> name of the cat-file flag that invokes this feature, and move it closer
> to the declaration of sha1_object_info_extended(). Also add
> documentation for this flag.

All of the above makes sense, but ...

> diff --git a/cache.h b/cache.h
> index 4d92aae0e..e2ec45dfe 100644
> --- a/cache.h
> +++ b/cache.h
> @@ -1207,7 +1207,6 @@ extern char *xdg_cache_home(const char *filename);
>  
>  /* object replacement */
>  #define LOOKUP_REPLACE_OBJECT 1
> -#define LOOKUP_UNKNOWN_OBJECT 2
>  extern void *read_sha1_file_extended(const unsigned char *sha1, enum object_type *type, unsigned long *size, unsigned flag);
>  static inline void *read_sha1_file(const unsigned char *sha1, enum object_type *type, unsigned long *size)
>  {
> @@ -1866,6 +1865,8 @@ struct object_info {
>   */
>  #define OBJECT_INFO_INIT {NULL}
>  
> +/* Allow reading from a loose object file of unknown/bogus type */
> +#define OBJECT_INFO_ALLOW_UNKNOWN_TYPE 2

... this contradicts the analysis given, doesn't it?

Does something break if we change this to 1 (perhaps because in some
cases this bit reach read_sha1_file_extended())?  I doubt it, but
leaving this to still define the bit to 2 makes readers wonder why.


^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH v4 3/8] sha1_file: rename LOOKUP_REPLACE_OBJECT
  2017-06-20  1:03 ` [PATCH v4 3/8] sha1_file: rename LOOKUP_REPLACE_OBJECT Jonathan Tan
@ 2017-06-21 17:33   ` Junio C Hamano
  0 siblings, 0 replies; 70+ messages in thread
From: Junio C Hamano @ 2017-06-21 17:33 UTC (permalink / raw)
  To: Jonathan Tan; +Cc: git, peff

Jonathan Tan <jonathantanmy@google.com> writes:

> The LOOKUP_REPLACE_OBJECT flag controls whether the
> lookup_replace_object() function is invoked by
> sha1_object_info_extended(), read_sha1_file_extended(), and
> lookup_replace_object_extended(), but it is not immediately clear which
> functions accept that flag.
>
> Therefore restrict this flag to only sha1_object_info_extended(),
> renaming it appropriately to OBJECT_INFO_LOOKUP_REPLACE and adding some
> documentation. Update read_sha1_file_extended() to have a boolean
> parameter instead, and delete lookup_replace_object_extended().
>
> parse_sha1_header() also passes this flag to
> parse_sha1_header_extended() since commit 46f0344 ("sha1_file: support
> reading from a loose object of unknown type", 2015-05-03), but that has
> had no effect since that commit. Therefore this patch also removes this
> flag from that invocation.

Yay, code reduction ;-).

> -/* object replacement */
> -#define LOOKUP_REPLACE_OBJECT 1
> -extern void *read_sha1_file_extended(const unsigned char *sha1, enum object_type *type, unsigned long *size, unsigned flag);
> +extern void *read_sha1_file_extended(const unsigned char *sha1,
> +				     enum object_type *type,
> +				     unsigned long *size, int lookup_replace);

In general, I'd hesitate to regress the API from "unsigned flag"
(that is easier to extend) to "int is_foo" (that will either have to
be reverted back to "unsigned flag", or an overlong parameter list
"int is_foo, int is_bar, int is_baz, ...").  

But let's take this as-is and see how it evolves.

> @@ -3025,7 +3027,7 @@ int sha1_object_info(const unsigned char *sha1, unsigned long *sizep)
>  
>  	oi.typep = &type;
>  	oi.sizep = sizep;
> -	if (sha1_object_info_extended(sha1, &oi, LOOKUP_REPLACE_OBJECT) < 0)
> +	if (sha1_object_info_extended(sha1, &oi, OBJECT_INFO_LOOKUP_REPLACE))
>  		return -1;
>  	return type;
>  }

This changes the error behaviour slightly, doesn't it?  Is it
guaranteed that sha1_object_info_extended() will never return
positive non-zero?  Right now it appears that is the case, so
this probably is a justifiable simplification of a caller, but
given the real consumer of the _extended() API in cat-file.c
treats only negative as an error, we should be consistent.  

I'd prefer to see this change _not_ made as part of this patch.
It may be OK to change this place and two callers in cat-file in a
follow-up patch though.

> @@ -3107,13 +3109,14 @@ static void *read_object(const unsigned char *sha1, enum object_type *type,
>  void *read_sha1_file_extended(const unsigned char *sha1,
>  			      enum object_type *type,
>  			      unsigned long *size,
> -			      unsigned flag)
> +			      int lookup_replace)
>  {
>  	void *data;
>  	const struct packed_git *p;
>  	const char *path;
>  	struct stat st;
> -	const unsigned char *repl = lookup_replace_object_extended(sha1, flag);
> +	const unsigned char *repl = lookup_replace ? lookup_replace_object(sha1)
> +						   : sha1;
>  
>  	errno = 0;
>  	data = read_object(repl, type, size);

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH v4 2/8] sha1_file: rename LOOKUP_UNKNOWN_OBJECT
  2017-06-21 17:22   ` Junio C Hamano
@ 2017-06-21 17:34     ` Jonathan Tan
  0 siblings, 0 replies; 70+ messages in thread
From: Jonathan Tan @ 2017-06-21 17:34 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git, peff

On Wed, 21 Jun 2017 10:22:38 -0700
Junio C Hamano <gitster@pobox.com> wrote:

> Jonathan Tan <jonathantanmy@google.com> writes:
> 
> > The LOOKUP_UNKNOWN_OBJECT flag was introduced in commit 46f0344
> > ("sha1_file: support reading from a loose object of unknown type",
> > 2015-05-03) in order to support a feature in cat-file subsequently
> > introduced in commit 39e4ae3 ("cat-file: teach cat-file a
> > '--allow-unknown-type' option", 2015-05-03). Despite its name and
> > location in cache.h, this flag is used neither in
> > read_sha1_file_extended() nor in any of the lookup functions, but used
> > only in sha1_object_info_extended().
> >
> > Therefore rename this flag to OBJECT_INFO_ALLOW_UNKNOWN_TYPE, taking the
> > name of the cat-file flag that invokes this feature, and move it closer
> > to the declaration of sha1_object_info_extended(). Also add
> > documentation for this flag.
> 
> All of the above makes sense, but ...
> 
> > diff --git a/cache.h b/cache.h
> > index 4d92aae0e..e2ec45dfe 100644
> > --- a/cache.h
> > +++ b/cache.h
> > @@ -1207,7 +1207,6 @@ extern char *xdg_cache_home(const char *filename);
> >  
> >  /* object replacement */
> >  #define LOOKUP_REPLACE_OBJECT 1
> > -#define LOOKUP_UNKNOWN_OBJECT 2
> >  extern void *read_sha1_file_extended(const unsigned char *sha1, enum object_type *type, unsigned long *size, unsigned flag);
> >  static inline void *read_sha1_file(const unsigned char *sha1, enum object_type *type, unsigned long *size)
> >  {
> > @@ -1866,6 +1865,8 @@ struct object_info {
> >   */
> >  #define OBJECT_INFO_INIT {NULL}
> >  
> > +/* Allow reading from a loose object file of unknown/bogus type */
> > +#define OBJECT_INFO_ALLOW_UNKNOWN_TYPE 2
> 
> ... this contradicts the analysis given, doesn't it?
> 
> Does something break if we change this to 1 (perhaps because in some
> cases this bit reach read_sha1_file_extended())?  I doubt it, but
> leaving this to still define the bit to 2 makes readers wonder why.

The issue is that LOOKUP_REPLACE_OBJECT (which is 1) is also used by
sha1_object_info_extended(). So yes, it will break if
OBJECT_INFO_ALLOW_UNKNOWN_TYPE is changed to 1. I'm resolving this in
the next patch by also renaming LOOKUP_REPLACE_OBJECT and making it only
used by sha1_object_info_extended().

I'll add a comment in the commit message locally, and will resend it out
tomorrow (in case more comments come).

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH v4 5/8] sha1_file: refactor read_object
  2017-06-20  1:03 ` [PATCH v4 5/8] sha1_file: refactor read_object Jonathan Tan
@ 2017-06-21 17:58   ` Junio C Hamano
  0 siblings, 0 replies; 70+ messages in thread
From: Junio C Hamano @ 2017-06-21 17:58 UTC (permalink / raw)
  To: Jonathan Tan; +Cc: git, peff

Jonathan Tan <jonathantanmy@google.com> writes:

> @@ -2914,6 +2912,7 @@ static int sha1_loose_object_info(const unsigned char *sha1,
>  	git_zstream stream;
>  	char hdr[32];
>  	struct strbuf hdrbuf = STRBUF_INIT;
> +	unsigned long size_scratch;
>  
>  	if (oi->delta_base_sha1)
>  		hashclr(oi->delta_base_sha1);
> @@ -2939,6 +2938,10 @@ static int sha1_loose_object_info(const unsigned char *sha1,
>  	map = map_sha1_file(sha1, &mapsize);
>  	if (!map)
>  		return -1;
> +
> +	if (!oi->sizep)
> +		oi->sizep = &size_scratch;
> +
>  	if (oi->disk_sizep)
>  		*oi->disk_sizep = mapsize;
>  	if ((flags & OBJECT_INFO_ALLOW_UNKNOWN_TYPE)) {
>  	if (status && oi->typep)
>  		*oi->typep = status;
> +	if (oi->sizep == &size_scratch)
> +		oi->sizep = NULL;

This looked somewhat unusual but nevertheless is correct.  Because
of the way parse_sha1_header_extended() interacts with its callers,
the usual fn(oi->sizep ? oi->sizep : &dummy) pattern does not apply
to this codepath.

> @@ -3077,28 +3090,15 @@ int pretend_sha1_file(void *buf, unsigned long len, enum object_type type,
>  static void *read_object(const unsigned char *sha1, enum object_type *type,
>  			 unsigned long *size)
>  {
> -	unsigned long mapsize;
> -	void *map, *buf;
> -	struct cached_object *co;
> -
> -	co = find_cached_object(sha1);
> -	if (co) {
> -		*type = co->type;
> -		*size = co->size;
> -		return xmemdupz(co->buf, co->size);
> -	}
> +	struct object_info oi = OBJECT_INFO_INIT;
> +	void *content;
> +	oi.typep = type;
> +	oi.sizep = size;
> +	oi.contentp = &content;
>  
> -	buf = read_packed_sha1(sha1, type, size);
> -	if (buf)
> -		return buf;
> -	map = map_sha1_file(sha1, &mapsize);
> -	if (map) {
> -		buf = unpack_sha1_file(map, mapsize, type, size, sha1);
> -		munmap(map, mapsize);
> -		return buf;
> -	}
> -	reprepare_packed_git();
> -	return read_packed_sha1(sha1, type, size);
> +	if (sha1_object_info_extended(sha1, &oi, 0))
> +		return NULL;
> +	return content;
>  }

Nice code reduction; it is somewhat funny to think that a function
meant to gather 'object info' does so much, but we can always say
the contents is part of the information about the object ;-).

Same comment as the other one applies here; the definition of how an
error is reported by sha1_object_info_extended() should be kept
consistent with existing callers.

Thanks.


^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH v4 7/8] sha1_file: do not access pack if unneeded
  2017-06-20  1:03 ` [PATCH v4 7/8] sha1_file: do not access pack if unneeded Jonathan Tan
@ 2017-06-21 18:15   ` Junio C Hamano
  2017-06-24 12:48     ` Jeff King
  0 siblings, 1 reply; 70+ messages in thread
From: Junio C Hamano @ 2017-06-21 18:15 UTC (permalink / raw)
  To: Jonathan Tan; +Cc: git, peff

Jonathan Tan <jonathantanmy@google.com> writes:

> Add an option to struct object_info to suppress population of additional
> information about a packed object if unneeded. This allows an
> optimization in which sha1_object_info_extended() does not even need to
> access the pack if no information besides provenance is requested. A
> subsequent patch will make use of this optimization.
>
> Signed-off-by: Jonathan Tan <jonathantanmy@google.com>

I think the motivation is sound, but...

> diff --git a/sha1_file.c b/sha1_file.c
> index 24f7a146e..68e3a3400 100644
> --- a/sha1_file.c
> +++ b/sha1_file.c
> @@ -3020,6 +3020,13 @@ int sha1_object_info_extended(const unsigned char *sha1, struct object_info *oi,
>  		}
>  	}
>  
> +	if (!oi->typep && !oi->sizep && !oi->disk_sizep &&
> +	    !oi->delta_base_sha1 && !oi->typename && !oi->contentp &&
> +	    !oi->populate_u) {
> +		oi->whence = OI_PACKED;
> +		return 0;
> +	}
> +

... this "if" statement feels like a maintenance nightmare.  The
intent of the guard, I think, is "when the call wants absolutely
nothing but whence", but the implementation of the guard will not
stay true to the intent whenever somebody adds a new field to oi.

I wonder if it makes more sense to have a new field "whence_only",
which is set only by such a specialized caller, which this guard
checks (and no other fields).

> diff --git a/streaming.c b/streaming.c
> index 9afa66b8b..deebc18a8 100644
> --- a/streaming.c
> +++ b/streaming.c
> @@ -113,6 +113,7 @@ static enum input_source istream_source(const unsigned char *sha1,
>  
>  	oi->typep = type;
>  	oi->sizep = &size;
> +	oi->populate_u = 1;
>  	status = sha1_object_info_extended(sha1, oi, 0);
>  	if (status < 0)
>  		return stream_error;

By the way, populate_u feels very misnamed.  Even if that union
gains details about other types of representations, your caller that
flips populate_u would not care about them.  This bit is about
learning even more detail about a packed object, so a name with
"packed" somewhere would be more appropriate, I would think.

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH v4 0/8] Improvements to sha1_file
  2017-06-20  1:03 ` [PATCH v4 0/8] Improvements to sha1_file Jonathan Tan
@ 2017-06-21 18:18   ` Junio C Hamano
  2017-06-24 12:51   ` Jeff King
  1 sibling, 0 replies; 70+ messages in thread
From: Junio C Hamano @ 2017-06-21 18:18 UTC (permalink / raw)
  To: Jonathan Tan; +Cc: git, peff

Jonathan Tan <jonathantanmy@google.com> writes:

> Thanks, Peff and Junio for your comments. Here's an updated version and
> ...
> Jonathan Tan (8):
>   sha1_file: teach packed_object_info about typename
>   sha1_file: rename LOOKUP_UNKNOWN_OBJECT
>   sha1_file: rename LOOKUP_REPLACE_OBJECT
>   sha1_file: move delta base cache code up
>   sha1_file: refactor read_object
>   sha1_file: improve sha1_object_info_extended
>   sha1_file: do not access pack if unneeded
>   sha1_file: refactor has_sha1_file_with_flags

If 3/8 came before 2/8 I wouldn't have been puzzled by the latter,
and I threw comments at a few minor details but overall I didn't see
anything glaringly wrong that require a major rewrite of the series.

Overall it was a very pleasant read.  Thanks.


^ permalink raw reply	[flat|nested] 70+ messages in thread

* [PATCH v5 0/8] Improvements to sha1_file
  2017-06-09 19:23 [RFC PATCH 0/4] Improvements to sha1_file Jonathan Tan
                   ` (22 preceding siblings ...)
  2017-06-20  1:03 ` [PATCH v4 8/8] sha1_file: refactor has_sha1_file_with_flags Jonathan Tan
@ 2017-06-22  0:40 ` Jonathan Tan
  2017-06-22  1:40   ` Junio C Hamano
  2017-06-22  0:40 ` [PATCH v5 1/8] sha1_file: teach packed_object_info about typename Jonathan Tan
                   ` (7 subsequent siblings)
  31 siblings, 1 reply; 70+ messages in thread
From: Jonathan Tan @ 2017-06-22  0:40 UTC (permalink / raw)
  To: git; +Cc: Jonathan Tan, gitster

Thanks, Junio. A reply to your comment on patch 7:

> ... this "if" statement feels like a maintenance nightmare.  The
> intent of the guard, I think, is "when the call wants absolutely
> nothing but whence", but the implementation of the guard will not
> stay true to the intent whenever somebody adds a new field to oi.
> 
> I wonder if it makes more sense to have a new field "whence_only",
> which is set only by such a specialized caller, which this guard
> checks (and no other fields).

After some more thought, I think I came up with a better solution -
allow sha1_object_info_extended() to take a NULL struct object_info
pointer, and immediately assign it (if NULL) a blank struct, but use the
NULL-ness as an indication that we can skip accessing the packfile. The
last patch actually doesn't even need the "whence", so we can do this.

Changes from v4:
 - patch 2
   - Updated commit message to explain why
     OBJECT_INFO_ALLOW_UNKNOWN_TYPE is defined to be 2, not 1.
 - patch 3
   - Made all invocations of sha1_object_info_extended() compare "< 0".
 - patch 5
   - Made all invocations of sha1_object_info_extended() compare "< 0".
 - patch 7
   - Rewrote patch to make sha1_object_info_extended() accept NULL
     struct object_info pointer.
 - patch 8
   - Made has_sha1_file_with_flags send NULL instead of blank struct
     object_info.

Jonathan Tan (8):
  sha1_file: teach packed_object_info about typename
  sha1_file: rename LOOKUP_UNKNOWN_OBJECT
  sha1_file: rename LOOKUP_REPLACE_OBJECT
  sha1_file: move delta base cache code up
  sha1_file: refactor read_object
  sha1_file: improve sha1_object_info_extended
  sha1_file: do not access pack if unneeded
  sha1_file: refactor has_sha1_file_with_flags

 builtin/cat-file.c   |   7 +-
 builtin/fetch.c      |  10 +-
 builtin/index-pack.c |   3 +-
 cache.h              |  36 +++--
 sha1_file.c          | 385 ++++++++++++++++++++++++++-------------------------
 5 files changed, 224 insertions(+), 217 deletions(-)

-- 
2.13.1.611.g7e3b11ae1-goog


^ permalink raw reply	[flat|nested] 70+ messages in thread

* [PATCH v5 1/8] sha1_file: teach packed_object_info about typename
  2017-06-09 19:23 [RFC PATCH 0/4] Improvements to sha1_file Jonathan Tan
                   ` (23 preceding siblings ...)
  2017-06-22  0:40 ` [PATCH v5 0/8] Improvements to sha1_file Jonathan Tan
@ 2017-06-22  0:40 ` Jonathan Tan
  2017-06-22  0:40 ` [PATCH v5 2/8] sha1_file: rename LOOKUP_UNKNOWN_OBJECT Jonathan Tan
                   ` (6 subsequent siblings)
  31 siblings, 0 replies; 70+ messages in thread
From: Jonathan Tan @ 2017-06-22  0:40 UTC (permalink / raw)
  To: git; +Cc: Jonathan Tan, gitster

In commit 46f0344 ("sha1_file: support reading from a loose object of
unknown type", 2015-05-06), "struct object_info" gained a "typename"
field that could represent a type name from a loose object file, whether
valid or invalid, as opposed to the existing "typep" which could only
represent valid types. Some relatively complex manipulations were added
to avoid breaking packed_object_info() without modifying it, but it is
much easier to just teach packed_object_info() about the new field.
Therefore, teach packed_object_info() as described above.

Signed-off-by: Jonathan Tan <jonathantanmy@google.com>
---
 sha1_file.c | 29 ++++++++++++-----------------
 1 file changed, 12 insertions(+), 17 deletions(-)

diff --git a/sha1_file.c b/sha1_file.c
index 59a4ed2ed..a52b27541 100644
--- a/sha1_file.c
+++ b/sha1_file.c
@@ -2277,9 +2277,18 @@ int packed_object_info(struct packed_git *p, off_t obj_offset,
 		*oi->disk_sizep = revidx[1].offset - obj_offset;
 	}
 
-	if (oi->typep) {
-		*oi->typep = packed_to_object_type(p, obj_offset, type, &w_curs, curpos);
-		if (*oi->typep < 0) {
+	if (oi->typep || oi->typename) {
+		enum object_type ptot;
+		ptot = packed_to_object_type(p, obj_offset, type, &w_curs,
+					     curpos);
+		if (oi->typep)
+			*oi->typep = ptot;
+		if (oi->typename) {
+			const char *tn = typename(ptot);
+			if (tn)
+				strbuf_addstr(oi->typename, tn);
+		}
+		if (ptot < 0) {
 			type = OBJ_BAD;
 			goto out;
 		}
@@ -2960,7 +2969,6 @@ int sha1_object_info_extended(const unsigned char *sha1, struct object_info *oi,
 	struct cached_object *co;
 	struct pack_entry e;
 	int rtype;
-	enum object_type real_type;
 	const unsigned char *real = lookup_replace_object_extended(sha1, flags);
 
 	co = find_cached_object(real);
@@ -2992,18 +3000,9 @@ int sha1_object_info_extended(const unsigned char *sha1, struct object_info *oi,
 			return -1;
 	}
 
-	/*
-	 * packed_object_info() does not follow the delta chain to
-	 * find out the real type, unless it is given oi->typep.
-	 */
-	if (oi->typename && !oi->typep)
-		oi->typep = &real_type;
-
 	rtype = packed_object_info(e.p, e.offset, oi);
 	if (rtype < 0) {
 		mark_bad_packed_object(e.p, real);
-		if (oi->typep == &real_type)
-			oi->typep = NULL;
 		return sha1_object_info_extended(real, oi, 0);
 	} else if (in_delta_base_cache(e.p, e.offset)) {
 		oi->whence = OI_DBCACHED;
@@ -3014,10 +3013,6 @@ int sha1_object_info_extended(const unsigned char *sha1, struct object_info *oi,
 		oi->u.packed.is_delta = (rtype == OBJ_REF_DELTA ||
 					 rtype == OBJ_OFS_DELTA);
 	}
-	if (oi->typename)
-		strbuf_addstr(oi->typename, typename(*oi->typep));
-	if (oi->typep == &real_type)
-		oi->typep = NULL;
 
 	return 0;
 }
-- 
2.13.1.611.g7e3b11ae1-goog


^ permalink raw reply related	[flat|nested] 70+ messages in thread

* [PATCH v5 2/8] sha1_file: rename LOOKUP_UNKNOWN_OBJECT
  2017-06-09 19:23 [RFC PATCH 0/4] Improvements to sha1_file Jonathan Tan
                   ` (24 preceding siblings ...)
  2017-06-22  0:40 ` [PATCH v5 1/8] sha1_file: teach packed_object_info about typename Jonathan Tan
@ 2017-06-22  0:40 ` Jonathan Tan
  2017-06-22  0:40 ` [PATCH v5 3/8] sha1_file: rename LOOKUP_REPLACE_OBJECT Jonathan Tan
                   ` (5 subsequent siblings)
  31 siblings, 0 replies; 70+ messages in thread
From: Jonathan Tan @ 2017-06-22  0:40 UTC (permalink / raw)
  To: git; +Cc: Jonathan Tan, gitster

The LOOKUP_UNKNOWN_OBJECT flag was introduced in commit 46f0344
("sha1_file: support reading from a loose object of unknown type",
2015-05-03) in order to support a feature in cat-file subsequently
introduced in commit 39e4ae3 ("cat-file: teach cat-file a
'--allow-unknown-type' option", 2015-05-03). Despite its name and
location in cache.h, this flag is used neither in
read_sha1_file_extended() nor in any of the lookup functions, but used
only in sha1_object_info_extended().

Therefore rename this flag to OBJECT_INFO_ALLOW_UNKNOWN_TYPE, taking the
name of the cat-file flag that invokes this feature, and move it closer
to the declaration of sha1_object_info_extended(). Also add
documentation for this flag.

OBJECT_INFO_ALLOW_UNKNOWN_TYPE is defined to 2, not 1, to avoid
conflicting with LOOKUP_REPLACE_OBJECT. Avoidance of this conflict is
necessary because sha1_object_info_extended() supports both flags.

Signed-off-by: Jonathan Tan <jonathantanmy@google.com>
---
 builtin/cat-file.c | 2 +-
 cache.h            | 3 ++-
 sha1_file.c        | 4 ++--
 3 files changed, 5 insertions(+), 4 deletions(-)

diff --git a/builtin/cat-file.c b/builtin/cat-file.c
index 4bffd7a2d..209374b3c 100644
--- a/builtin/cat-file.c
+++ b/builtin/cat-file.c
@@ -60,7 +60,7 @@ static int cat_one_file(int opt, const char *exp_type, const char *obj_name,
 	const char *path = force_path;
 
 	if (unknown_type)
-		flags |= LOOKUP_UNKNOWN_OBJECT;
+		flags |= OBJECT_INFO_ALLOW_UNKNOWN_TYPE;
 
 	if (get_sha1_with_context(obj_name, GET_SHA1_RECORD_PATH,
 				  oid.hash, &obj_context))
diff --git a/cache.h b/cache.h
index 4d92aae0e..e2ec45dfe 100644
--- a/cache.h
+++ b/cache.h
@@ -1207,7 +1207,6 @@ extern char *xdg_cache_home(const char *filename);
 
 /* object replacement */
 #define LOOKUP_REPLACE_OBJECT 1
-#define LOOKUP_UNKNOWN_OBJECT 2
 extern void *read_sha1_file_extended(const unsigned char *sha1, enum object_type *type, unsigned long *size, unsigned flag);
 static inline void *read_sha1_file(const unsigned char *sha1, enum object_type *type, unsigned long *size)
 {
@@ -1866,6 +1865,8 @@ struct object_info {
  */
 #define OBJECT_INFO_INIT {NULL}
 
+/* Allow reading from a loose object file of unknown/bogus type */
+#define OBJECT_INFO_ALLOW_UNKNOWN_TYPE 2
 extern int sha1_object_info_extended(const unsigned char *, struct object_info *, unsigned flags);
 extern int packed_object_info(struct packed_git *pack, off_t offset, struct object_info *);
 
diff --git a/sha1_file.c b/sha1_file.c
index a52b27541..ad04ea8e0 100644
--- a/sha1_file.c
+++ b/sha1_file.c
@@ -1964,7 +1964,7 @@ static int parse_sha1_header_extended(const char *hdr, struct object_info *oi,
 	 * we're obtaining the type using '--allow-unknown-type'
 	 * option.
 	 */
-	if ((flags & LOOKUP_UNKNOWN_OBJECT) && (type < 0))
+	if ((flags & OBJECT_INFO_ALLOW_UNKNOWN_TYPE) && (type < 0))
 		type = 0;
 	else if (type < 0)
 		die("invalid object type");
@@ -2941,7 +2941,7 @@ static int sha1_loose_object_info(const unsigned char *sha1,
 		return -1;
 	if (oi->disk_sizep)
 		*oi->disk_sizep = mapsize;
-	if ((flags & LOOKUP_UNKNOWN_OBJECT)) {
+	if ((flags & OBJECT_INFO_ALLOW_UNKNOWN_TYPE)) {
 		if (unpack_sha1_header_to_strbuf(&stream, map, mapsize, hdr, sizeof(hdr), &hdrbuf) < 0)
 			status = error("unable to unpack %s header with --allow-unknown-type",
 				       sha1_to_hex(sha1));
-- 
2.13.1.611.g7e3b11ae1-goog


^ permalink raw reply related	[flat|nested] 70+ messages in thread

* [PATCH v5 3/8] sha1_file: rename LOOKUP_REPLACE_OBJECT
  2017-06-09 19:23 [RFC PATCH 0/4] Improvements to sha1_file Jonathan Tan
                   ` (25 preceding siblings ...)
  2017-06-22  0:40 ` [PATCH v5 2/8] sha1_file: rename LOOKUP_UNKNOWN_OBJECT Jonathan Tan
@ 2017-06-22  0:40 ` Jonathan Tan
  2017-06-22  0:40 ` [PATCH v5 4/8] sha1_file: move delta base cache code up Jonathan Tan
                   ` (4 subsequent siblings)
  31 siblings, 0 replies; 70+ messages in thread
From: Jonathan Tan @ 2017-06-22  0:40 UTC (permalink / raw)
  To: git; +Cc: Jonathan Tan, gitster

The LOOKUP_REPLACE_OBJECT flag controls whether the
lookup_replace_object() function is invoked by
sha1_object_info_extended(), read_sha1_file_extended(), and
lookup_replace_object_extended(), but it is not immediately clear which
functions accept that flag.

Therefore restrict this flag to only sha1_object_info_extended(),
renaming it appropriately to OBJECT_INFO_LOOKUP_REPLACE and adding some
documentation. Update read_sha1_file_extended() to have a boolean
parameter instead, and delete lookup_replace_object_extended().

parse_sha1_header() also passes this flag to
parse_sha1_header_extended() since commit 46f0344 ("sha1_file: support
reading from a loose object of unknown type", 2015-05-03), but that has
had no effect since that commit. Therefore this patch also removes this
flag from that invocation.

Signed-off-by: Jonathan Tan <jonathantanmy@google.com>
---
 builtin/cat-file.c |  5 +++--
 cache.h            | 17 ++++++-----------
 sha1_file.c        | 14 +++++++++-----
 3 files changed, 18 insertions(+), 18 deletions(-)

diff --git a/builtin/cat-file.c b/builtin/cat-file.c
index 209374b3c..a58b8c820 100644
--- a/builtin/cat-file.c
+++ b/builtin/cat-file.c
@@ -56,7 +56,7 @@ static int cat_one_file(int opt, const char *exp_type, const char *obj_name,
 	struct object_context obj_context;
 	struct object_info oi = OBJECT_INFO_INIT;
 	struct strbuf sb = STRBUF_INIT;
-	unsigned flags = LOOKUP_REPLACE_OBJECT;
+	unsigned flags = OBJECT_INFO_LOOKUP_REPLACE;
 	const char *path = force_path;
 
 	if (unknown_type)
@@ -337,7 +337,8 @@ static void batch_object_write(const char *obj_name, struct batch_options *opt,
 	struct strbuf buf = STRBUF_INIT;
 
 	if (!data->skip_object_info &&
-	    sha1_object_info_extended(data->oid.hash, &data->info, LOOKUP_REPLACE_OBJECT) < 0) {
+	    sha1_object_info_extended(data->oid.hash, &data->info,
+				      OBJECT_INFO_LOOKUP_REPLACE) < 0) {
 		printf("%s missing\n",
 		       obj_name ? obj_name : oid_to_hex(&data->oid));
 		fflush(stdout);
diff --git a/cache.h b/cache.h
index e2ec45dfe..a3631b237 100644
--- a/cache.h
+++ b/cache.h
@@ -1205,12 +1205,12 @@ extern char *xdg_config_home(const char *filename);
  */
 extern char *xdg_cache_home(const char *filename);
 
-/* object replacement */
-#define LOOKUP_REPLACE_OBJECT 1
-extern void *read_sha1_file_extended(const unsigned char *sha1, enum object_type *type, unsigned long *size, unsigned flag);
+extern void *read_sha1_file_extended(const unsigned char *sha1,
+				     enum object_type *type,
+				     unsigned long *size, int lookup_replace);
 static inline void *read_sha1_file(const unsigned char *sha1, enum object_type *type, unsigned long *size)
 {
-	return read_sha1_file_extended(sha1, type, size, LOOKUP_REPLACE_OBJECT);
+	return read_sha1_file_extended(sha1, type, size, 1);
 }
 
 /*
@@ -1232,13 +1232,6 @@ static inline const unsigned char *lookup_replace_object(const unsigned char *sh
 	return do_lookup_replace_object(sha1);
 }
 
-static inline const unsigned char *lookup_replace_object_extended(const unsigned char *sha1, unsigned flag)
-{
-	if (!(flag & LOOKUP_REPLACE_OBJECT))
-		return sha1;
-	return lookup_replace_object(sha1);
-}
-
 /* Read and unpack a sha1 file into memory, write memory to a sha1 file */
 extern int sha1_object_info(const unsigned char *, unsigned long *);
 extern int hash_sha1_file(const void *buf, unsigned long len, const char *type, unsigned char *sha1);
@@ -1865,6 +1858,8 @@ struct object_info {
  */
 #define OBJECT_INFO_INIT {NULL}
 
+/* Invoke lookup_replace_object() on the given hash */
+#define OBJECT_INFO_LOOKUP_REPLACE 1
 /* Allow reading from a loose object file of unknown/bogus type */
 #define OBJECT_INFO_ALLOW_UNKNOWN_TYPE 2
 extern int sha1_object_info_extended(const unsigned char *, struct object_info *, unsigned flags);
diff --git a/sha1_file.c b/sha1_file.c
index ad04ea8e0..71296e6cd 100644
--- a/sha1_file.c
+++ b/sha1_file.c
@@ -2002,7 +2002,7 @@ int parse_sha1_header(const char *hdr, unsigned long *sizep)
 	struct object_info oi = OBJECT_INFO_INIT;
 
 	oi.sizep = sizep;
-	return parse_sha1_header_extended(hdr, &oi, LOOKUP_REPLACE_OBJECT);
+	return parse_sha1_header_extended(hdr, &oi, 0);
 }
 
 static void *unpack_sha1_file(void *map, unsigned long mapsize, enum object_type *type, unsigned long *size, const unsigned char *sha1)
@@ -2969,7 +2969,9 @@ int sha1_object_info_extended(const unsigned char *sha1, struct object_info *oi,
 	struct cached_object *co;
 	struct pack_entry e;
 	int rtype;
-	const unsigned char *real = lookup_replace_object_extended(sha1, flags);
+	const unsigned char *real = (flags & OBJECT_INFO_LOOKUP_REPLACE) ?
+				    lookup_replace_object(sha1) :
+				    sha1;
 
 	co = find_cached_object(real);
 	if (co) {
@@ -3025,7 +3027,8 @@ int sha1_object_info(const unsigned char *sha1, unsigned long *sizep)
 
 	oi.typep = &type;
 	oi.sizep = sizep;
-	if (sha1_object_info_extended(sha1, &oi, LOOKUP_REPLACE_OBJECT) < 0)
+	if (sha1_object_info_extended(sha1, &oi,
+				      OBJECT_INFO_LOOKUP_REPLACE) < 0)
 		return -1;
 	return type;
 }
@@ -3107,13 +3110,14 @@ static void *read_object(const unsigned char *sha1, enum object_type *type,
 void *read_sha1_file_extended(const unsigned char *sha1,
 			      enum object_type *type,
 			      unsigned long *size,
-			      unsigned flag)
+			      int lookup_replace)
 {
 	void *data;
 	const struct packed_git *p;
 	const char *path;
 	struct stat st;
-	const unsigned char *repl = lookup_replace_object_extended(sha1, flag);
+	const unsigned char *repl = lookup_replace ? lookup_replace_object(sha1)
+						   : sha1;
 
 	errno = 0;
 	data = read_object(repl, type, size);
-- 
2.13.1.611.g7e3b11ae1-goog


^ permalink raw reply related	[flat|nested] 70+ messages in thread

* [PATCH v5 4/8] sha1_file: move delta base cache code up
  2017-06-09 19:23 [RFC PATCH 0/4] Improvements to sha1_file Jonathan Tan
                   ` (26 preceding siblings ...)
  2017-06-22  0:40 ` [PATCH v5 3/8] sha1_file: rename LOOKUP_REPLACE_OBJECT Jonathan Tan
@ 2017-06-22  0:40 ` Jonathan Tan
  2017-06-22  0:40 ` [PATCH v5 5/8] sha1_file: refactor read_object Jonathan Tan
                   ` (3 subsequent siblings)
  31 siblings, 0 replies; 70+ messages in thread
From: Jonathan Tan @ 2017-06-22  0:40 UTC (permalink / raw)
  To: git; +Cc: Jonathan Tan, gitster

In a subsequent patch, packed_object_info() will be modified to use the
delta base cache, so move the relevant code to before
packed_object_info().

Signed-off-by: Jonathan Tan <jonathantanmy@google.com>
---
 sha1_file.c | 220 ++++++++++++++++++++++++++++++------------------------------
 1 file changed, 110 insertions(+), 110 deletions(-)

diff --git a/sha1_file.c b/sha1_file.c
index 71296e6cd..0c996370d 100644
--- a/sha1_file.c
+++ b/sha1_file.c
@@ -2239,116 +2239,6 @@ static enum object_type packed_to_object_type(struct packed_git *p,
 	goto out;
 }
 
-int packed_object_info(struct packed_git *p, off_t obj_offset,
-		       struct object_info *oi)
-{
-	struct pack_window *w_curs = NULL;
-	unsigned long size;
-	off_t curpos = obj_offset;
-	enum object_type type;
-
-	/*
-	 * We always get the representation type, but only convert it to
-	 * a "real" type later if the caller is interested.
-	 */
-	type = unpack_object_header(p, &w_curs, &curpos, &size);
-
-	if (oi->sizep) {
-		if (type == OBJ_OFS_DELTA || type == OBJ_REF_DELTA) {
-			off_t tmp_pos = curpos;
-			off_t base_offset = get_delta_base(p, &w_curs, &tmp_pos,
-							   type, obj_offset);
-			if (!base_offset) {
-				type = OBJ_BAD;
-				goto out;
-			}
-			*oi->sizep = get_size_from_delta(p, &w_curs, tmp_pos);
-			if (*oi->sizep == 0) {
-				type = OBJ_BAD;
-				goto out;
-			}
-		} else {
-			*oi->sizep = size;
-		}
-	}
-
-	if (oi->disk_sizep) {
-		struct revindex_entry *revidx = find_pack_revindex(p, obj_offset);
-		*oi->disk_sizep = revidx[1].offset - obj_offset;
-	}
-
-	if (oi->typep || oi->typename) {
-		enum object_type ptot;
-		ptot = packed_to_object_type(p, obj_offset, type, &w_curs,
-					     curpos);
-		if (oi->typep)
-			*oi->typep = ptot;
-		if (oi->typename) {
-			const char *tn = typename(ptot);
-			if (tn)
-				strbuf_addstr(oi->typename, tn);
-		}
-		if (ptot < 0) {
-			type = OBJ_BAD;
-			goto out;
-		}
-	}
-
-	if (oi->delta_base_sha1) {
-		if (type == OBJ_OFS_DELTA || type == OBJ_REF_DELTA) {
-			const unsigned char *base;
-
-			base = get_delta_base_sha1(p, &w_curs, curpos,
-						   type, obj_offset);
-			if (!base) {
-				type = OBJ_BAD;
-				goto out;
-			}
-
-			hashcpy(oi->delta_base_sha1, base);
-		} else
-			hashclr(oi->delta_base_sha1);
-	}
-
-out:
-	unuse_pack(&w_curs);
-	return type;
-}
-
-static void *unpack_compressed_entry(struct packed_git *p,
-				    struct pack_window **w_curs,
-				    off_t curpos,
-				    unsigned long size)
-{
-	int st;
-	git_zstream stream;
-	unsigned char *buffer, *in;
-
-	buffer = xmallocz_gently(size);
-	if (!buffer)
-		return NULL;
-	memset(&stream, 0, sizeof(stream));
-	stream.next_out = buffer;
-	stream.avail_out = size + 1;
-
-	git_inflate_init(&stream);
-	do {
-		in = use_pack(p, w_curs, curpos, &stream.avail_in);
-		stream.next_in = in;
-		st = git_inflate(&stream, Z_FINISH);
-		if (!stream.avail_out)
-			break; /* the payload is larger than it should be */
-		curpos += stream.next_in - in;
-	} while (st == Z_OK || st == Z_BUF_ERROR);
-	git_inflate_end(&stream);
-	if ((st != Z_STREAM_END) || stream.total_out != size) {
-		free(buffer);
-		return NULL;
-	}
-
-	return buffer;
-}
-
 static struct hashmap delta_base_cache;
 static size_t delta_base_cached;
 
@@ -2486,6 +2376,116 @@ static void add_delta_base_cache(struct packed_git *p, off_t base_offset,
 	hashmap_add(&delta_base_cache, ent);
 }
 
+int packed_object_info(struct packed_git *p, off_t obj_offset,
+		       struct object_info *oi)
+{
+	struct pack_window *w_curs = NULL;
+	unsigned long size;
+	off_t curpos = obj_offset;
+	enum object_type type;
+
+	/*
+	 * We always get the representation type, but only convert it to
+	 * a "real" type later if the caller is interested.
+	 */
+	type = unpack_object_header(p, &w_curs, &curpos, &size);
+
+	if (oi->sizep) {
+		if (type == OBJ_OFS_DELTA || type == OBJ_REF_DELTA) {
+			off_t tmp_pos = curpos;
+			off_t base_offset = get_delta_base(p, &w_curs, &tmp_pos,
+							   type, obj_offset);
+			if (!base_offset) {
+				type = OBJ_BAD;
+				goto out;
+			}
+			*oi->sizep = get_size_from_delta(p, &w_curs, tmp_pos);
+			if (*oi->sizep == 0) {
+				type = OBJ_BAD;
+				goto out;
+			}
+		} else {
+			*oi->sizep = size;
+		}
+	}
+
+	if (oi->disk_sizep) {
+		struct revindex_entry *revidx = find_pack_revindex(p, obj_offset);
+		*oi->disk_sizep = revidx[1].offset - obj_offset;
+	}
+
+	if (oi->typep || oi->typename) {
+		enum object_type ptot;
+		ptot = packed_to_object_type(p, obj_offset, type, &w_curs,
+					     curpos);
+		if (oi->typep)
+			*oi->typep = ptot;
+		if (oi->typename) {
+			const char *tn = typename(ptot);
+			if (tn)
+				strbuf_addstr(oi->typename, tn);
+		}
+		if (ptot < 0) {
+			type = OBJ_BAD;
+			goto out;
+		}
+	}
+
+	if (oi->delta_base_sha1) {
+		if (type == OBJ_OFS_DELTA || type == OBJ_REF_DELTA) {
+			const unsigned char *base;
+
+			base = get_delta_base_sha1(p, &w_curs, curpos,
+						   type, obj_offset);
+			if (!base) {
+				type = OBJ_BAD;
+				goto out;
+			}
+
+			hashcpy(oi->delta_base_sha1, base);
+		} else
+			hashclr(oi->delta_base_sha1);
+	}
+
+out:
+	unuse_pack(&w_curs);
+	return type;
+}
+
+static void *unpack_compressed_entry(struct packed_git *p,
+				    struct pack_window **w_curs,
+				    off_t curpos,
+				    unsigned long size)
+{
+	int st;
+	git_zstream stream;
+	unsigned char *buffer, *in;
+
+	buffer = xmallocz_gently(size);
+	if (!buffer)
+		return NULL;
+	memset(&stream, 0, sizeof(stream));
+	stream.next_out = buffer;
+	stream.avail_out = size + 1;
+
+	git_inflate_init(&stream);
+	do {
+		in = use_pack(p, w_curs, curpos, &stream.avail_in);
+		stream.next_in = in;
+		st = git_inflate(&stream, Z_FINISH);
+		if (!stream.avail_out)
+			break; /* the payload is larger than it should be */
+		curpos += stream.next_in - in;
+	} while (st == Z_OK || st == Z_BUF_ERROR);
+	git_inflate_end(&stream);
+	if ((st != Z_STREAM_END) || stream.total_out != size) {
+		free(buffer);
+		return NULL;
+	}
+
+	return buffer;
+}
+
 static void *read_object(const unsigned char *sha1, enum object_type *type,
 			 unsigned long *size);
 
-- 
2.13.1.611.g7e3b11ae1-goog


^ permalink raw reply related	[flat|nested] 70+ messages in thread

* [PATCH v5 5/8] sha1_file: refactor read_object
  2017-06-09 19:23 [RFC PATCH 0/4] Improvements to sha1_file Jonathan Tan
                   ` (27 preceding siblings ...)
  2017-06-22  0:40 ` [PATCH v5 4/8] sha1_file: move delta base cache code up Jonathan Tan
@ 2017-06-22  0:40 ` Jonathan Tan
  2017-06-22  0:40 ` [PATCH v5 6/8] sha1_file: improve sha1_object_info_extended Jonathan Tan
                   ` (2 subsequent siblings)
  31 siblings, 0 replies; 70+ messages in thread
From: Jonathan Tan @ 2017-06-22  0:40 UTC (permalink / raw)
  To: git; +Cc: Jonathan Tan, gitster

read_object() and sha1_object_info_extended() both implement mechanisms
such as object replacement, retrying the packed store after failing to
find the object in the packed store then the loose store, and being able
to mark a packed object as bad and then retrying the whole process.
Consolidating these mechanisms would be a great help to maintainability.

Therefore, consolidate them by extending sha1_object_info_extended() to
support the functionality needed, and then modifying read_object() to
use sha1_object_info_extended().

Signed-off-by: Jonathan Tan <jonathantanmy@google.com>
---
 cache.h     |  1 +
 sha1_file.c | 84 ++++++++++++++++++++++++++++++-------------------------------
 2 files changed, 43 insertions(+), 42 deletions(-)

diff --git a/cache.h b/cache.h
index a3631b237..48aea923b 100644
--- a/cache.h
+++ b/cache.h
@@ -1827,6 +1827,7 @@ struct object_info {
 	off_t *disk_sizep;
 	unsigned char *delta_base_sha1;
 	struct strbuf *typename;
+	void **contentp;
 
 	/* Response */
 	enum {
diff --git a/sha1_file.c b/sha1_file.c
index 0c996370d..615a27dac 100644
--- a/sha1_file.c
+++ b/sha1_file.c
@@ -2005,19 +2005,6 @@ int parse_sha1_header(const char *hdr, unsigned long *sizep)
 	return parse_sha1_header_extended(hdr, &oi, 0);
 }
 
-static void *unpack_sha1_file(void *map, unsigned long mapsize, enum object_type *type, unsigned long *size, const unsigned char *sha1)
-{
-	int ret;
-	git_zstream stream;
-	char hdr[8192];
-
-	ret = unpack_sha1_header(&stream, map, mapsize, hdr, sizeof(hdr));
-	if (ret < Z_OK || (*type = parse_sha1_header(hdr, size)) < 0)
-		return NULL;
-
-	return unpack_sha1_rest(&stream, hdr, *size, sha1);
-}
-
 unsigned long get_size_from_delta(struct packed_git *p,
 				  struct pack_window **w_curs,
 			          off_t curpos)
@@ -2326,8 +2313,10 @@ static void *cache_or_unpack_entry(struct packed_git *p, off_t base_offset,
 	if (!ent)
 		return unpack_entry(p, base_offset, type, base_size);
 
-	*type = ent->type;
-	*base_size = ent->size;
+	if (type)
+		*type = ent->type;
+	if (base_size)
+		*base_size = ent->size;
 	return xmemdupz(ent->data, ent->size);
 }
 
@@ -2388,9 +2377,16 @@ int packed_object_info(struct packed_git *p, off_t obj_offset,
 	 * We always get the representation type, but only convert it to
 	 * a "real" type later if the caller is interested.
 	 */
-	type = unpack_object_header(p, &w_curs, &curpos, &size);
+	if (oi->contentp) {
+		*oi->contentp = cache_or_unpack_entry(p, obj_offset, oi->sizep,
+						      &type);
+		if (!*oi->contentp)
+			type = OBJ_BAD;
+	} else {
+		type = unpack_object_header(p, &w_curs, &curpos, &size);
+	}
 
-	if (oi->sizep) {
+	if (!oi->contentp && oi->sizep) {
 		if (type == OBJ_OFS_DELTA || type == OBJ_REF_DELTA) {
 			off_t tmp_pos = curpos;
 			off_t base_offset = get_delta_base(p, &w_curs, &tmp_pos,
@@ -2679,8 +2675,10 @@ void *unpack_entry(struct packed_git *p, off_t obj_offset,
 		free(external_base);
 	}
 
-	*final_type = type;
-	*final_size = size;
+	if (final_type)
+		*final_type = type;
+	if (final_size)
+		*final_size = size;
 
 	unuse_pack(&w_curs);
 
@@ -2914,6 +2912,7 @@ static int sha1_loose_object_info(const unsigned char *sha1,
 	git_zstream stream;
 	char hdr[32];
 	struct strbuf hdrbuf = STRBUF_INIT;
+	unsigned long size_scratch;
 
 	if (oi->delta_base_sha1)
 		hashclr(oi->delta_base_sha1);
@@ -2926,7 +2925,7 @@ static int sha1_loose_object_info(const unsigned char *sha1,
 	 * return value implicitly indicates whether the
 	 * object even exists.
 	 */
-	if (!oi->typep && !oi->typename && !oi->sizep) {
+	if (!oi->typep && !oi->typename && !oi->sizep && !oi->contentp) {
 		const char *path;
 		struct stat st;
 		if (stat_sha1_file(sha1, &st, &path) < 0)
@@ -2939,6 +2938,10 @@ static int sha1_loose_object_info(const unsigned char *sha1,
 	map = map_sha1_file(sha1, &mapsize);
 	if (!map)
 		return -1;
+
+	if (!oi->sizep)
+		oi->sizep = &size_scratch;
+
 	if (oi->disk_sizep)
 		*oi->disk_sizep = mapsize;
 	if ((flags & OBJECT_INFO_ALLOW_UNKNOWN_TYPE)) {
@@ -2956,10 +2959,18 @@ static int sha1_loose_object_info(const unsigned char *sha1,
 				       sha1_to_hex(sha1));
 	} else if ((status = parse_sha1_header_extended(hdr, oi, flags)) < 0)
 		status = error("unable to parse %s header", sha1_to_hex(sha1));
-	git_inflate_end(&stream);
+
+	if (status >= 0 && oi->contentp)
+		*oi->contentp = unpack_sha1_rest(&stream, hdr,
+						 *oi->sizep, sha1);
+	else
+		git_inflate_end(&stream);
+
 	munmap(map, mapsize);
 	if (status && oi->typep)
 		*oi->typep = status;
+	if (oi->sizep == &size_scratch)
+		oi->sizep = NULL;
 	strbuf_release(&hdrbuf);
 	return (status < 0) ? status : 0;
 }
@@ -2985,6 +2996,8 @@ int sha1_object_info_extended(const unsigned char *sha1, struct object_info *oi,
 			hashclr(oi->delta_base_sha1);
 		if (oi->typename)
 			strbuf_addstr(oi->typename, typename(co->type));
+		if (oi->contentp)
+			*oi->contentp = xmemdupz(co->buf, co->size);
 		oi->whence = OI_CACHED;
 		return 0;
 	}
@@ -3078,28 +3091,15 @@ int pretend_sha1_file(void *buf, unsigned long len, enum object_type type,
 static void *read_object(const unsigned char *sha1, enum object_type *type,
 			 unsigned long *size)
 {
-	unsigned long mapsize;
-	void *map, *buf;
-	struct cached_object *co;
-
-	co = find_cached_object(sha1);
-	if (co) {
-		*type = co->type;
-		*size = co->size;
-		return xmemdupz(co->buf, co->size);
-	}
+	struct object_info oi = OBJECT_INFO_INIT;
+	void *content;
+	oi.typep = type;
+	oi.sizep = size;
+	oi.contentp = &content;
 
-	buf = read_packed_sha1(sha1, type, size);
-	if (buf)
-		return buf;
-	map = map_sha1_file(sha1, &mapsize);
-	if (map) {
-		buf = unpack_sha1_file(map, mapsize, type, size, sha1);
-		munmap(map, mapsize);
-		return buf;
-	}
-	reprepare_packed_git();
-	return read_packed_sha1(sha1, type, size);
+	if (sha1_object_info_extended(sha1, &oi, 0) < 0)
+		return NULL;
+	return content;
 }
 
 /*
-- 
2.13.1.611.g7e3b11ae1-goog


^ permalink raw reply related	[flat|nested] 70+ messages in thread

* [PATCH v5 6/8] sha1_file: improve sha1_object_info_extended
  2017-06-09 19:23 [RFC PATCH 0/4] Improvements to sha1_file Jonathan Tan
                   ` (28 preceding siblings ...)
  2017-06-22  0:40 ` [PATCH v5 5/8] sha1_file: refactor read_object Jonathan Tan
@ 2017-06-22  0:40 ` Jonathan Tan
  2017-06-22  0:40 ` [PATCH v5 7/8] sha1_file: do not access pack if unneeded Jonathan Tan
  2017-06-22  0:40 ` [PATCH v5 8/8] sha1_file: refactor has_sha1_file_with_flags Jonathan Tan
  31 siblings, 0 replies; 70+ messages in thread
From: Jonathan Tan @ 2017-06-22  0:40 UTC (permalink / raw)
  To: git; +Cc: Jonathan Tan, gitster

Improve sha1_object_info_extended() by supporting additional flags. This
allows has_sha1_file_with_flags() to be modified to use
sha1_object_info_extended() in a subsequent patch.

Signed-off-by: Jonathan Tan <jonathantanmy@google.com>
---
 cache.h     |  4 ++++
 sha1_file.c | 43 ++++++++++++++++++++++++-------------------
 2 files changed, 28 insertions(+), 19 deletions(-)

diff --git a/cache.h b/cache.h
index 48aea923b..7cf2ca466 100644
--- a/cache.h
+++ b/cache.h
@@ -1863,6 +1863,10 @@ struct object_info {
 #define OBJECT_INFO_LOOKUP_REPLACE 1
 /* Allow reading from a loose object file of unknown/bogus type */
 #define OBJECT_INFO_ALLOW_UNKNOWN_TYPE 2
+/* Do not check cached storage */
+#define OBJECT_INFO_SKIP_CACHED 4
+/* Do not retry packed storage after checking packed and loose storage */
+#define OBJECT_INFO_QUICK 8
 extern int sha1_object_info_extended(const unsigned char *, struct object_info *, unsigned flags);
 extern int packed_object_info(struct packed_git *pack, off_t offset, struct object_info *);
 
diff --git a/sha1_file.c b/sha1_file.c
index 615a27dac..b6bc02f09 100644
--- a/sha1_file.c
+++ b/sha1_file.c
@@ -2977,29 +2977,30 @@ static int sha1_loose_object_info(const unsigned char *sha1,
 
 int sha1_object_info_extended(const unsigned char *sha1, struct object_info *oi, unsigned flags)
 {
-	struct cached_object *co;
 	struct pack_entry e;
 	int rtype;
 	const unsigned char *real = (flags & OBJECT_INFO_LOOKUP_REPLACE) ?
 				    lookup_replace_object(sha1) :
 				    sha1;
 
-	co = find_cached_object(real);
-	if (co) {
-		if (oi->typep)
-			*(oi->typep) = co->type;
-		if (oi->sizep)
-			*(oi->sizep) = co->size;
-		if (oi->disk_sizep)
-			*(oi->disk_sizep) = 0;
-		if (oi->delta_base_sha1)
-			hashclr(oi->delta_base_sha1);
-		if (oi->typename)
-			strbuf_addstr(oi->typename, typename(co->type));
-		if (oi->contentp)
-			*oi->contentp = xmemdupz(co->buf, co->size);
-		oi->whence = OI_CACHED;
-		return 0;
+	if (!(flags & OBJECT_INFO_SKIP_CACHED)) {
+		struct cached_object *co = find_cached_object(real);
+		if (co) {
+			if (oi->typep)
+				*(oi->typep) = co->type;
+			if (oi->sizep)
+				*(oi->sizep) = co->size;
+			if (oi->disk_sizep)
+				*(oi->disk_sizep) = 0;
+			if (oi->delta_base_sha1)
+				hashclr(oi->delta_base_sha1);
+			if (oi->typename)
+				strbuf_addstr(oi->typename, typename(co->type));
+			if (oi->contentp)
+				*oi->contentp = xmemdupz(co->buf, co->size);
+			oi->whence = OI_CACHED;
+			return 0;
+		}
 	}
 
 	if (!find_pack_entry(real, &e)) {
@@ -3010,9 +3011,13 @@ int sha1_object_info_extended(const unsigned char *sha1, struct object_info *oi,
 		}
 
 		/* Not a loose object; someone else may have just packed it. */
-		reprepare_packed_git();
-		if (!find_pack_entry(real, &e))
+		if (flags & OBJECT_INFO_QUICK) {
 			return -1;
+		} else {
+			reprepare_packed_git();
+			if (!find_pack_entry(real, &e))
+				return -1;
+		}
 	}
 
 	rtype = packed_object_info(e.p, e.offset, oi);
-- 
2.13.1.611.g7e3b11ae1-goog


^ permalink raw reply related	[flat|nested] 70+ messages in thread

* [PATCH v5 7/8] sha1_file: do not access pack if unneeded
  2017-06-09 19:23 [RFC PATCH 0/4] Improvements to sha1_file Jonathan Tan
                   ` (29 preceding siblings ...)
  2017-06-22  0:40 ` [PATCH v5 6/8] sha1_file: improve sha1_object_info_extended Jonathan Tan
@ 2017-06-22  0:40 ` Jonathan Tan
  2017-06-22  0:40 ` [PATCH v5 8/8] sha1_file: refactor has_sha1_file_with_flags Jonathan Tan
  31 siblings, 0 replies; 70+ messages in thread
From: Jonathan Tan @ 2017-06-22  0:40 UTC (permalink / raw)
  To: git; +Cc: Jonathan Tan, gitster

Currently, regardless of the contents of the "struct object_info" passed
to sha1_object_info_extended(), that function always accesses the
packfile whenever it returns information about a packed object, since it
needs to populate "u.packed".

Add the ability to pass NULL, and use NULL-ness of the argument to
activate an optimization in which sha1_object_info_extended() does not
needlessly access the packfile. A subsequent patch will make use of this
optimization.

A similar optimization is not made for the cached and loose cases as it
would not cause a significant performance improvement.

Signed-off-by: Jonathan Tan <jonathantanmy@google.com>
---
 sha1_file.c | 11 +++++++++++
 1 file changed, 11 insertions(+)

diff --git a/sha1_file.c b/sha1_file.c
index b6bc02f09..bf6b64ec8 100644
--- a/sha1_file.c
+++ b/sha1_file.c
@@ -2977,12 +2977,16 @@ static int sha1_loose_object_info(const unsigned char *sha1,
 
 int sha1_object_info_extended(const unsigned char *sha1, struct object_info *oi, unsigned flags)
 {
+	static struct object_info blank_oi = OBJECT_INFO_INIT;
 	struct pack_entry e;
 	int rtype;
 	const unsigned char *real = (flags & OBJECT_INFO_LOOKUP_REPLACE) ?
 				    lookup_replace_object(sha1) :
 				    sha1;
 
+	if (!oi)
+		oi = &blank_oi;
+
 	if (!(flags & OBJECT_INFO_SKIP_CACHED)) {
 		struct cached_object *co = find_cached_object(real);
 		if (co) {
@@ -3020,6 +3024,13 @@ int sha1_object_info_extended(const unsigned char *sha1, struct object_info *oi,
 		}
 	}
 
+	if (oi == &blank_oi)
+		/*
+		 * We know that the caller doesn't actually need the
+		 * information below, so return early.
+		 */
+		return 0;
+
 	rtype = packed_object_info(e.p, e.offset, oi);
 	if (rtype < 0) {
 		mark_bad_packed_object(e.p, real);
-- 
2.13.1.611.g7e3b11ae1-goog


^ permalink raw reply related	[flat|nested] 70+ messages in thread

* [PATCH v5 8/8] sha1_file: refactor has_sha1_file_with_flags
  2017-06-09 19:23 [RFC PATCH 0/4] Improvements to sha1_file Jonathan Tan
                   ` (30 preceding siblings ...)
  2017-06-22  0:40 ` [PATCH v5 7/8] sha1_file: do not access pack if unneeded Jonathan Tan
@ 2017-06-22  0:40 ` Jonathan Tan
  2017-07-18 10:30   ` Christian Couder
  31 siblings, 1 reply; 70+ messages in thread
From: Jonathan Tan @ 2017-06-22  0:40 UTC (permalink / raw)
  To: git; +Cc: Jonathan Tan, gitster

has_sha1_file_with_flags() implements many mechanisms in common with
sha1_object_info_extended(). Make has_sha1_file_with_flags() a
convenience function for sha1_object_info_extended() instead.

Signed-off-by: Jonathan Tan <jonathantanmy@google.com>
---
 builtin/fetch.c      | 10 ++++++----
 builtin/index-pack.c |  3 ++-
 cache.h              | 11 +++--------
 sha1_file.c          | 12 ++----------
 4 files changed, 13 insertions(+), 23 deletions(-)

diff --git a/builtin/fetch.c b/builtin/fetch.c
index 47708451b..96d5146c4 100644
--- a/builtin/fetch.c
+++ b/builtin/fetch.c
@@ -242,9 +242,11 @@ static void find_non_local_tags(struct transport *transport,
 		 */
 		if (ends_with(ref->name, "^{}")) {
 			if (item &&
-			    !has_object_file_with_flags(&ref->old_oid, HAS_SHA1_QUICK) &&
+			    !has_object_file_with_flags(&ref->old_oid,
+							OBJECT_INFO_QUICK) &&
 			    !will_fetch(head, ref->old_oid.hash) &&
-			    !has_sha1_file_with_flags(item->util, HAS_SHA1_QUICK) &&
+			    !has_sha1_file_with_flags(item->util,
+						      OBJECT_INFO_QUICK) &&
 			    !will_fetch(head, item->util))
 				item->util = NULL;
 			item = NULL;
@@ -258,7 +260,7 @@ static void find_non_local_tags(struct transport *transport,
 		 * fetch.
 		 */
 		if (item &&
-		    !has_sha1_file_with_flags(item->util, HAS_SHA1_QUICK) &&
+		    !has_sha1_file_with_flags(item->util, OBJECT_INFO_QUICK) &&
 		    !will_fetch(head, item->util))
 			item->util = NULL;
 
@@ -279,7 +281,7 @@ static void find_non_local_tags(struct transport *transport,
 	 * checked to see if it needs fetching.
 	 */
 	if (item &&
-	    !has_sha1_file_with_flags(item->util, HAS_SHA1_QUICK) &&
+	    !has_sha1_file_with_flags(item->util, OBJECT_INFO_QUICK) &&
 	    !will_fetch(head, item->util))
 		item->util = NULL;
 
diff --git a/builtin/index-pack.c b/builtin/index-pack.c
index 04b9dcaf0..587bc80c9 100644
--- a/builtin/index-pack.c
+++ b/builtin/index-pack.c
@@ -794,7 +794,8 @@ static void sha1_object(const void *data, struct object_entry *obj_entry,
 
 	if (startup_info->have_repository) {
 		read_lock();
-		collision_test_needed = has_sha1_file_with_flags(oid->hash, HAS_SHA1_QUICK);
+		collision_test_needed =
+			has_sha1_file_with_flags(oid->hash, OBJECT_INFO_QUICK);
 		read_unlock();
 	}
 
diff --git a/cache.h b/cache.h
index 7cf2ca466..3ae9769aa 100644
--- a/cache.h
+++ b/cache.h
@@ -1268,15 +1268,10 @@ int read_loose_object(const char *path,
 		      void **contents);
 
 /*
- * Return true iff we have an object named sha1, whether local or in
- * an alternate object database, and whether packed or loose.  This
- * function does not respect replace references.
- *
- * If the QUICK flag is set, do not re-check the pack directory
- * when we cannot find the object (this means we may give a false
- * negative answer if another process is simultaneously repacking).
+ * Convenience for sha1_object_info_extended() with a NULL struct
+ * object_info. OBJECT_INFO_SKIP_CACHED is automatically set; pass
+ * nonzero flags to also set other flags.
  */
-#define HAS_SHA1_QUICK 0x1
 extern int has_sha1_file_with_flags(const unsigned char *sha1, int flags);
 static inline int has_sha1_file(const unsigned char *sha1)
 {
diff --git a/sha1_file.c b/sha1_file.c
index bf6b64ec8..778f01d92 100644
--- a/sha1_file.c
+++ b/sha1_file.c
@@ -3494,18 +3494,10 @@ int has_sha1_pack(const unsigned char *sha1)
 
 int has_sha1_file_with_flags(const unsigned char *sha1, int flags)
 {
-	struct pack_entry e;
-
 	if (!startup_info->have_repository)
 		return 0;
-	if (find_pack_entry(sha1, &e))
-		return 1;
-	if (has_loose_object(sha1))
-		return 1;
-	if (flags & HAS_SHA1_QUICK)
-		return 0;
-	reprepare_packed_git();
-	return find_pack_entry(sha1, &e);
+	return sha1_object_info_extended(sha1, NULL,
+					 flags | OBJECT_INFO_SKIP_CACHED) >= 0;
 }
 
 int has_object_file(const struct object_id *oid)
-- 
2.13.1.611.g7e3b11ae1-goog


^ permalink raw reply related	[flat|nested] 70+ messages in thread

* Re: [PATCH v5 0/8] Improvements to sha1_file
  2017-06-22  0:40 ` [PATCH v5 0/8] Improvements to sha1_file Jonathan Tan
@ 2017-06-22  1:40   ` Junio C Hamano
  0 siblings, 0 replies; 70+ messages in thread
From: Junio C Hamano @ 2017-06-22  1:40 UTC (permalink / raw)
  To: Jonathan Tan; +Cc: git

Jonathan Tan <jonathantanmy@google.com> writes:

> After some more thought, I think I came up with a better solution -
> allow sha1_object_info_extended() to take a NULL struct object_info
> pointer,...

That is an excellent way to tell the machinery "I care about NO
details of the object" ;-)




^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH v4 6/8] sha1_file: improve sha1_object_info_extended
  2017-06-20  1:03 ` [PATCH v4 6/8] sha1_file: improve sha1_object_info_extended Jonathan Tan
@ 2017-06-24 12:45   ` Jeff King
  2017-06-26 16:45     ` Jonathan Tan
  2017-06-26 17:26     ` Junio C Hamano
  0 siblings, 2 replies; 70+ messages in thread
From: Jeff King @ 2017-06-24 12:45 UTC (permalink / raw)
  To: Jonathan Tan; +Cc: git, gitster

On Mon, Jun 19, 2017 at 06:03:13PM -0700, Jonathan Tan wrote:

> Subject: [PATCH v4 6/8] sha1_file: improve sha1_object_info_extended
> Improve sha1_object_info_extended() by supporting additional flags. This
> allows has_sha1_file_with_flags() to be modified to use
> sha1_object_info_extended() in a subsequent patch.

A minor nit, but try to avoid vague words like "improve" in your subject
lines. Something like:

  sha1_file: teach sha1_object_info_extended more flags

That's not too specific either, but I think in --oneline output it gives
you a much better clue about what part of the function it touches.

> ---
>  cache.h     |  4 ++++
>  sha1_file.c | 43 ++++++++++++++++++++++++-------------------
>  2 files changed, 28 insertions(+), 19 deletions(-)

The patch itself looks good.

-Peff

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH v4 7/8] sha1_file: do not access pack if unneeded
  2017-06-21 18:15   ` Junio C Hamano
@ 2017-06-24 12:48     ` Jeff King
  2017-06-24 18:41       ` Junio C Hamano
  0 siblings, 1 reply; 70+ messages in thread
From: Jeff King @ 2017-06-24 12:48 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Jonathan Tan, git

On Wed, Jun 21, 2017 at 11:15:01AM -0700, Junio C Hamano wrote:

> > +	if (!oi->typep && !oi->sizep && !oi->disk_sizep &&
> > +	    !oi->delta_base_sha1 && !oi->typename && !oi->contentp &&
> > +	    !oi->populate_u) {
> > +		oi->whence = OI_PACKED;
> > +		return 0;
> > +	}
> > +
> 
> ... this "if" statement feels like a maintenance nightmare.  The
> intent of the guard, I think, is "when the call wants absolutely
> nothing but whence", but the implementation of the guard will not
> stay true to the intent whenever somebody adds a new field to oi.
> 
> I wonder if it makes more sense to have a new field "whence_only",
> which is set only by such a specialized caller, which this guard
> checks (and no other fields).

The other nice thing about whence_only is that it flips the logic. So
any existing callers which depend on filling the union automatically
will not be affected (though I would be surprised if there are any such
callers; most of that information isn't actually that interesting).

-Peff

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH v4 0/8] Improvements to sha1_file
  2017-06-20  1:03 ` [PATCH v4 0/8] Improvements to sha1_file Jonathan Tan
  2017-06-21 18:18   ` Junio C Hamano
@ 2017-06-24 12:51   ` Jeff King
  1 sibling, 0 replies; 70+ messages in thread
From: Jeff King @ 2017-06-24 12:51 UTC (permalink / raw)
  To: Jonathan Tan; +Cc: git, gitster

On Mon, Jun 19, 2017 at 06:03:07PM -0700, Jonathan Tan wrote:

> > I had the same thoughts (both on the name and the "vocabularies"). IMHO
> > we should consider allocating the bits from the same set. There's only
> > one HAS_SHA1 flag, and it has an exact match in OBJECT_INFO_QUICK.
> 
> Agreed - in this patch set, I have also consolidated the relevant flags,
> including LOOKUP_REPLACE_OBJECT and LOOKUP_UNKNOWN_OBJECT.
> 
> In addition, Junio has mentioned the potential confusion in behavior
> between a NULL and an empty struct passed to
> sha1_object_info_extended(). In this patch set, I require non-NULL, and
> have added an optimization that avoids accessing the pack in certain
> situations, but this optimization requires checking a lot of fields. Let
> me know what you think.

Yes, like that direction (and the direction of the whole series) much
better. Thanks for working on it.

I'm trying to clear my "to be reviewed" backlog before going offline for
a week, so I gave it a fairly cursory review. I had only a few minor
comments, but I agree with the points that Junio already raised.

-Peff

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH v4 7/8] sha1_file: do not access pack if unneeded
  2017-06-24 12:48     ` Jeff King
@ 2017-06-24 18:41       ` Junio C Hamano
  2017-06-24 20:39         ` Jeff King
  0 siblings, 1 reply; 70+ messages in thread
From: Junio C Hamano @ 2017-06-24 18:41 UTC (permalink / raw)
  To: Jeff King; +Cc: Jonathan Tan, git

Jeff King <peff@peff.net> writes:

> On Wed, Jun 21, 2017 at 11:15:01AM -0700, Junio C Hamano wrote:
>
>> > +	if (!oi->typep && !oi->sizep && !oi->disk_sizep &&
>> > +	    !oi->delta_base_sha1 && !oi->typename && !oi->contentp &&
>> > +	    !oi->populate_u) {
>> > +		oi->whence = OI_PACKED;
>> > +		return 0;
>> > +	}
>> > +
>> 
>> ... this "if" statement feels like a maintenance nightmare.  The
>> intent of the guard, I think, is "when the call wants absolutely
>> nothing but whence", but the implementation of the guard will not
>> stay true to the intent whenever somebody adds a new field to oi.
>> 
>> I wonder if it makes more sense to have a new field "whence_only",
>> which is set only by such a specialized caller, which this guard
>> checks (and no other fields).
>
> The other nice thing about whence_only is that it flips the logic. So
> any existing callers which depend on filling the union automatically
> will not be affected (though I would be surprised if there are any such
> callers; most of that information isn't actually that interesting).

Hmph, but the solution does not scale.  When a caller wants whence
and something else that cannot be asked for or ignored by being a
"pointer to a result" field, such a request cannot be expressed.  We
either need to make all fields in oi request to "pointer to a
result, if the result is needed, or NULL when the result is not of
interest", or give a bit for each non-pointer field to allow the
caller to express "I am not interested in the value of this field".

In the usecase Jonathan has, the caller's wish is a very narrow "I
am interested in nothing; just checking if the object is there", and
passing NULL for oi works fine.  So I'm inclined to suggest that we
take that approach now and worry about a more generic and scalable
"how would one tell the machinery that the value for a field is
uninteresting when the field is not a pointer to result?" mechanism
until a real need arises.

Thanks.

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH v4 7/8] sha1_file: do not access pack if unneeded
  2017-06-24 18:41       ` Junio C Hamano
@ 2017-06-24 20:39         ` Jeff King
  2017-06-26 16:28           ` Jonathan Tan
  0 siblings, 1 reply; 70+ messages in thread
From: Jeff King @ 2017-06-24 20:39 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Jonathan Tan, git

On Sat, Jun 24, 2017 at 11:41:39AM -0700, Junio C Hamano wrote:

> > The other nice thing about whence_only is that it flips the logic. So
> > any existing callers which depend on filling the union automatically
> > will not be affected (though I would be surprised if there are any such
> > callers; most of that information isn't actually that interesting).
> 
> Hmph, but the solution does not scale.  When a caller wants whence
> and something else that cannot be asked for or ignored by being a
> "pointer to a result" field, such a request cannot be expressed.  We
> either need to make all fields in oi request to "pointer to a
> result, if the result is needed, or NULL when the result is not of
> interest", or give a bit for each non-pointer field to allow the
> caller to express "I am not interested in the value of this field".

True.

> In the usecase Jonathan has, the caller's wish is a very narrow "I
> am interested in nothing; just checking if the object is there", and
> passing NULL for oi works fine.  So I'm inclined to suggest that we
> take that approach now and worry about a more generic and scalable
> "how would one tell the machinery that the value for a field is
> uninteresting when the field is not a pointer to result?" mechanism
> until a real need arises.

If we are open to writing anything, then I think it should follow the
same pointer-to-data pattern that the rest of the struct does. I.e.,
declare the extra pack-data struct as a pointer, and let callers fill it
in or not. It's excessive in the sense that it's not a variable-sized
answer, but it at least makes the interface consistent.

And no callers who read it would be silently broken; the actual data
type in "struct object_info" would change, so they'd get a noisy compile
error.

-Peff

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH v4 7/8] sha1_file: do not access pack if unneeded
  2017-06-24 20:39         ` Jeff King
@ 2017-06-26 16:28           ` Jonathan Tan
  0 siblings, 0 replies; 70+ messages in thread
From: Jonathan Tan @ 2017-06-26 16:28 UTC (permalink / raw)
  To: Jeff King; +Cc: Junio C Hamano, Git mailing list

On Sat, Jun 24, 2017 at 1:39 PM, Jeff King <peff@peff.net> wrote:
> On Sat, Jun 24, 2017 at 11:41:39AM -0700, Junio C Hamano wrote:
> If we are open to writing anything, then I think it should follow the
> same pointer-to-data pattern that the rest of the struct does. I.e.,
> declare the extra pack-data struct as a pointer, and let callers fill it
> in or not. It's excessive in the sense that it's not a variable-sized
> answer, but it at least makes the interface consistent.
>
> And no callers who read it would be silently broken; the actual data
> type in "struct object_info" would change, so they'd get a noisy compile
> error.

I considered that, but there was some trickiness in streaming.c -
open_istream() would need to establish that pointer even though that
is not its responsibility, or istream_source would need to
heap-allocate some memory then point to it from `oi` (it has to be
heap-allocated since the memory must outlive the function).

Also, it does not solve the "maintenance nightmare" issue that Junio
described (in that in order to optimize the pack read away, we would
need a big "if" statement).

Those issues are probably surmountable, but in the end I settled on
just allowing the caller to pass NULL and having
sha1_object_info_extended() substitute an empty struct when that
happens, as you can see in v5 [2]. That allows most of
sha1_object_info_extended() to not have to handle NULL, but also
allows for the specific optimization (optimizing the pack read away)
that I want.

[1] https://public-inbox.org/git/xmqq8tklqkbe.fsf@gitster.mtv.corp.google.com/
[2] https://public-inbox.org/git/ddbbc86204c131c83b3a1ff3b52789be9ed2a5b1.1498091579.git.jonathantanmy@google.com/

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH v4 6/8] sha1_file: improve sha1_object_info_extended
  2017-06-24 12:45   ` Jeff King
@ 2017-06-26 16:45     ` Jonathan Tan
  2017-06-26 17:28       ` Junio C Hamano
  2017-06-26 17:26     ` Junio C Hamano
  1 sibling, 1 reply; 70+ messages in thread
From: Jonathan Tan @ 2017-06-26 16:45 UTC (permalink / raw)
  To: Jeff King; +Cc: Git mailing list, Junio C Hamano

On Sat, Jun 24, 2017 at 5:45 AM, Jeff King <peff@peff.net> wrote:
> On Mon, Jun 19, 2017 at 06:03:13PM -0700, Jonathan Tan wrote:
>
>> Subject: [PATCH v4 6/8] sha1_file: improve sha1_object_info_extended
>> Improve sha1_object_info_extended() by supporting additional flags. This
>> allows has_sha1_file_with_flags() to be modified to use
>> sha1_object_info_extended() in a subsequent patch.
>
> A minor nit, but try to avoid vague words like "improve" in your subject
> lines. Something like:
>
>   sha1_file: teach sha1_object_info_extended more flags
>
> That's not too specific either, but I think in --oneline output it gives
> you a much better clue about what part of the function it touches.
>
>> ---
>>  cache.h     |  4 ++++
>>  sha1_file.c | 43 ++++++++++++++++++++++++-------------------
>>  2 files changed, 28 insertions(+), 19 deletions(-)
>
> The patch itself looks good.

Thanks. I did try, but all my attempts exceeded 50 characters. Maybe
"sha1_file: support more flags in object info query" is good enough.

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH v4 6/8] sha1_file: improve sha1_object_info_extended
  2017-06-24 12:45   ` Jeff King
  2017-06-26 16:45     ` Jonathan Tan
@ 2017-06-26 17:26     ` Junio C Hamano
  1 sibling, 0 replies; 70+ messages in thread
From: Junio C Hamano @ 2017-06-26 17:26 UTC (permalink / raw)
  To: Jeff King; +Cc: Jonathan Tan, git

Jeff King <peff@peff.net> writes:

> On Mon, Jun 19, 2017 at 06:03:13PM -0700, Jonathan Tan wrote:
>
>> Subject: [PATCH v4 6/8] sha1_file: improve sha1_object_info_extended
>> Improve sha1_object_info_extended() by supporting additional flags. This
>> allows has_sha1_file_with_flags() to be modified to use
>> sha1_object_info_extended() in a subsequent patch.
>
> A minor nit, but try to avoid vague words like "improve" in your subject
> lines. Something like:
>
>   sha1_file: teach sha1_object_info_extended more flags
>
> That's not too specific either, but I think in --oneline output it gives
> you a much better clue about what part of the function it touches.

Yeah, thanks for paying attention to the --oneline output.  Mention
of the exact function name tells that it is about a more options on
information gathering, which is a better title.

>> ---
>>  cache.h     |  4 ++++
>>  sha1_file.c | 43 ++++++++++++++++++++++++-------------------
>>  2 files changed, 28 insertions(+), 19 deletions(-)
>
> The patch itself looks good.

Yeah, I agree.

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH v4 6/8] sha1_file: improve sha1_object_info_extended
  2017-06-26 16:45     ` Jonathan Tan
@ 2017-06-26 17:28       ` Junio C Hamano
  2017-06-26 17:35         ` Jonathan Tan
  0 siblings, 1 reply; 70+ messages in thread
From: Junio C Hamano @ 2017-06-26 17:28 UTC (permalink / raw)
  To: Jonathan Tan; +Cc: Jeff King, Git mailing list

Jonathan Tan <jonathantanmy@google.com> writes:

> On Sat, Jun 24, 2017 at 5:45 AM, Jeff King <peff@peff.net> wrote:
>> On Mon, Jun 19, 2017 at 06:03:13PM -0700, Jonathan Tan wrote:
>>
>>> Subject: [PATCH v4 6/8] sha1_file: improve sha1_object_info_extended
>>> Improve sha1_object_info_extended() by supporting additional flags. This
>>> allows has_sha1_file_with_flags() to be modified to use
>>> sha1_object_info_extended() in a subsequent patch.
>>
>> A minor nit, but try to avoid vague words like "improve" in your subject
>> lines. Something like:
>>
>>   sha1_file: teach sha1_object_info_extended more flags
>>
>> That's not too specific either, but I think in --oneline output it gives
>> you a much better clue about what part of the function it touches.
>>
>>> ---
>>>  cache.h     |  4 ++++
>>>  sha1_file.c | 43 ++++++++++++++++++++++++-------------------
>>>  2 files changed, 28 insertions(+), 19 deletions(-)
>>
>> The patch itself looks good.
>
> Thanks. I did try, but all my attempts exceeded 50 characters. Maybe
> "sha1_file: support more flags in object info query" is good enough.

Between the two, I personally find that Peff's is more descriptive,
so unless there are other changes planned, let me "rebase -i" to
retitle the commit.

Thanks.

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH v4 6/8] sha1_file: improve sha1_object_info_extended
  2017-06-26 17:28       ` Junio C Hamano
@ 2017-06-26 17:35         ` Jonathan Tan
  0 siblings, 0 replies; 70+ messages in thread
From: Jonathan Tan @ 2017-06-26 17:35 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Jeff King, Git mailing list

On Mon, Jun 26, 2017 at 10:28 AM, Junio C Hamano <gitster@pobox.com> wrote:
> Jonathan Tan <jonathantanmy@google.com> writes:
>
>> On Sat, Jun 24, 2017 at 5:45 AM, Jeff King <peff@peff.net> wrote:
>>> On Mon, Jun 19, 2017 at 06:03:13PM -0700, Jonathan Tan wrote:
>>>
>>>> Subject: [PATCH v4 6/8] sha1_file: improve sha1_object_info_extended
>>>> Improve sha1_object_info_extended() by supporting additional flags. This
>>>> allows has_sha1_file_with_flags() to be modified to use
>>>> sha1_object_info_extended() in a subsequent patch.
>>>
>>> A minor nit, but try to avoid vague words like "improve" in your subject
>>> lines. Something like:
>>>
>>>   sha1_file: teach sha1_object_info_extended more flags
>>>
>>> That's not too specific either, but I think in --oneline output it gives
>>> you a much better clue about what part of the function it touches.
>>>
>>>> ---
>>>>  cache.h     |  4 ++++
>>>>  sha1_file.c | 43 ++++++++++++++++++++++++-------------------
>>>>  2 files changed, 28 insertions(+), 19 deletions(-)
>>>
>>> The patch itself looks good.
>>
>> Thanks. I did try, but all my attempts exceeded 50 characters. Maybe
>> "sha1_file: support more flags in object info query" is good enough.
>
> Between the two, I personally find that Peff's is more descriptive,
> so unless there are other changes planned, let me "rebase -i" to
> retitle the commit.
>
> Thanks.

His suggestion does exceed 50 characters, but I see that that's a soft
limit. Either title is fine with me, thanks.

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH v5 8/8] sha1_file: refactor has_sha1_file_with_flags
  2017-06-22  0:40 ` [PATCH v5 8/8] sha1_file: refactor has_sha1_file_with_flags Jonathan Tan
@ 2017-07-18 10:30   ` Christian Couder
  2017-07-18 16:39     ` Jonathan Tan
  0 siblings, 1 reply; 70+ messages in thread
From: Christian Couder @ 2017-07-18 10:30 UTC (permalink / raw)
  To: Jonathan Tan; +Cc: git, Junio C Hamano

On Thu, Jun 22, 2017 at 2:40 AM, Jonathan Tan <jonathantanmy@google.com> wrote:

> diff --git a/sha1_file.c b/sha1_file.c
> index bf6b64ec8..778f01d92 100644
> --- a/sha1_file.c
> +++ b/sha1_file.c
> @@ -3494,18 +3494,10 @@ int has_sha1_pack(const unsigned char *sha1)
>
>  int has_sha1_file_with_flags(const unsigned char *sha1, int flags)
>  {
> -       struct pack_entry e;
> -
>         if (!startup_info->have_repository)
>                 return 0;
> -       if (find_pack_entry(sha1, &e))
> -               return 1;
> -       if (has_loose_object(sha1))
> -               return 1;
> -       if (flags & HAS_SHA1_QUICK)
> -               return 0;
> -       reprepare_packed_git();
> -       return find_pack_entry(sha1, &e);
> +       return sha1_object_info_extended(sha1, NULL,
> +                                        flags | OBJECT_INFO_SKIP_CACHED) >= 0;
>  }

I am not sure if it could affect performance (in one way or another) a
lot or not but I just wanted to note that has_loose_object() calls
check_and_freshen() which calls access() on loose object files, while
sha1_object_info_extended() calls sha1_loose_object_info() which calls
stat_sha1_file() which calls lstat() on loose object files.

So depending on the relative performance of access() and lstat() there
could be a performance impact on repos that have a lot of loose object
files.

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH v5 8/8] sha1_file: refactor has_sha1_file_with_flags
  2017-07-18 10:30   ` Christian Couder
@ 2017-07-18 16:39     ` Jonathan Tan
  2017-07-19 12:52       ` Johannes Schindelin
  0 siblings, 1 reply; 70+ messages in thread
From: Jonathan Tan @ 2017-07-18 16:39 UTC (permalink / raw)
  To: Christian Couder; +Cc: git, Junio C Hamano

On Tue, 18 Jul 2017 12:30:46 +0200
Christian Couder <christian.couder@gmail.com> wrote:

> On Thu, Jun 22, 2017 at 2:40 AM, Jonathan Tan <jonathantanmy@google.com> wrote:
> 
> > diff --git a/sha1_file.c b/sha1_file.c
> > index bf6b64ec8..778f01d92 100644
> > --- a/sha1_file.c
> > +++ b/sha1_file.c
> > @@ -3494,18 +3494,10 @@ int has_sha1_pack(const unsigned char *sha1)
> >
> >  int has_sha1_file_with_flags(const unsigned char *sha1, int flags)
> >  {
> > -       struct pack_entry e;
> > -
> >         if (!startup_info->have_repository)
> >                 return 0;
> > -       if (find_pack_entry(sha1, &e))
> > -               return 1;
> > -       if (has_loose_object(sha1))
> > -               return 1;
> > -       if (flags & HAS_SHA1_QUICK)
> > -               return 0;
> > -       reprepare_packed_git();
> > -       return find_pack_entry(sha1, &e);
> > +       return sha1_object_info_extended(sha1, NULL,
> > +                                        flags | OBJECT_INFO_SKIP_CACHED) >= 0;
> >  }
> 
> I am not sure if it could affect performance (in one way or another) a
> lot or not but I just wanted to note that has_loose_object() calls
> check_and_freshen() which calls access() on loose object files, while
> sha1_object_info_extended() calls sha1_loose_object_info() which calls
> stat_sha1_file() which calls lstat() on loose object files.
> 
> So depending on the relative performance of access() and lstat() there
> could be a performance impact on repos that have a lot of loose object
> files.

That is true, but from what little I have read online, they have about
the same performance.

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH v5 8/8] sha1_file: refactor has_sha1_file_with_flags
  2017-07-18 16:39     ` Jonathan Tan
@ 2017-07-19 12:52       ` Johannes Schindelin
  2017-07-19 17:12         ` [PATCH] sha1_file: use access(), not lstat(), if possible Jonathan Tan
  0 siblings, 1 reply; 70+ messages in thread
From: Johannes Schindelin @ 2017-07-19 12:52 UTC (permalink / raw)
  To: Jonathan Tan; +Cc: Christian Couder, git, Junio C Hamano

Hi Jonathan,

On Tue, 18 Jul 2017, Jonathan Tan wrote:

> On Tue, 18 Jul 2017 12:30:46 +0200
> Christian Couder <christian.couder@gmail.com> wrote:
> 
> > On Thu, Jun 22, 2017 at 2:40 AM, Jonathan Tan <jonathantanmy@google.com> wrote:
> > 
> > > diff --git a/sha1_file.c b/sha1_file.c
> > > index bf6b64ec8..778f01d92 100644
> > > --- a/sha1_file.c
> > > +++ b/sha1_file.c
> > > @@ -3494,18 +3494,10 @@ int has_sha1_pack(const unsigned char *sha1)
> > >
> > >  int has_sha1_file_with_flags(const unsigned char *sha1, int flags)
> > >  {
> > > -       struct pack_entry e;
> > > -
> > >         if (!startup_info->have_repository)
> > >                 return 0;
> > > -       if (find_pack_entry(sha1, &e))
> > > -               return 1;
> > > -       if (has_loose_object(sha1))
> > > -               return 1;
> > > -       if (flags & HAS_SHA1_QUICK)
> > > -               return 0;
> > > -       reprepare_packed_git();
> > > -       return find_pack_entry(sha1, &e);
> > > +       return sha1_object_info_extended(sha1, NULL,
> > > +                                        flags | OBJECT_INFO_SKIP_CACHED) >= 0;
> > >  }
> > 
> > I am not sure if it could affect performance (in one way or another) a
> > lot or not but I just wanted to note that has_loose_object() calls
> > check_and_freshen() which calls access() on loose object files, while
> > sha1_object_info_extended() calls sha1_loose_object_info() which calls
> > stat_sha1_file() which calls lstat() on loose object files.
> > 
> > So depending on the relative performance of access() and lstat() there
> > could be a performance impact on repos that have a lot of loose object
> > files.
> 
> That is true, but from what little I have read online, they have about
> the same performance.

Then your online sources missed out on what we have in compat/mingw.[ch].
I would expect _waccess() (which is used to emulate access()) to be
substantially faster than the hoops we jump through to emulate lstat().

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 70+ messages in thread

* [PATCH] sha1_file: use access(), not lstat(), if possible
  2017-07-19 12:52       ` Johannes Schindelin
@ 2017-07-19 17:12         ` Jonathan Tan
  2017-07-20 21:48           ` Junio C Hamano
  0 siblings, 1 reply; 70+ messages in thread
From: Jonathan Tan @ 2017-07-19 17:12 UTC (permalink / raw)
  To: git; +Cc: Jonathan Tan, Johannes.Schindelin

In sha1_loose_object_info(), use access() (indirectly invoked through
has_loose_object()) instead of lstat() if we do not need the on-disk
size, as it should be faster on Windows [1].

[1] https://public-inbox.org/git/alpine.DEB.2.21.1.1707191450570.4193@virtualbox/

Signed-off-by: Jonathan Tan <jonathantanmy@google.com>
---
Thanks for the information - here's a patch. Do you, by any chance, know
of a web page (or similar thing) that I can cite for this?
---
 sha1_file.c | 21 ++++++++++-----------
 1 file changed, 10 insertions(+), 11 deletions(-)

diff --git a/sha1_file.c b/sha1_file.c
index fca165f13..81962b019 100644
--- a/sha1_file.c
+++ b/sha1_file.c
@@ -2920,20 +2920,19 @@ static int sha1_loose_object_info(const unsigned char *sha1,
 
 	/*
 	 * If we don't care about type or size, then we don't
-	 * need to look inside the object at all. Note that we
-	 * do not optimize out the stat call, even if the
-	 * caller doesn't care about the disk-size, since our
-	 * return value implicitly indicates whether the
-	 * object even exists.
+	 * need to look inside the object at all. We only check
+	 * for its existence.
 	 */
 	if (!oi->typep && !oi->typename && !oi->sizep && !oi->contentp) {
-		const char *path;
-		struct stat st;
-		if (stat_sha1_file(sha1, &st, &path) < 0)
-			return -1;
-		if (oi->disk_sizep)
+		if (oi->disk_sizep) {
+			const char *path;
+			struct stat st;
+			if (stat_sha1_file(sha1, &st, &path) < 0)
+				return -1;
 			*oi->disk_sizep = st.st_size;
-		return 0;
+			return 0;
+		}
+		return has_loose_object(sha1) ? 0 : -1;
 	}
 
 	map = map_sha1_file(sha1, &mapsize);
-- 
2.14.0.rc0.284.gd933b75aa4-goog


^ permalink raw reply related	[flat|nested] 70+ messages in thread

* Re: [PATCH] sha1_file: use access(), not lstat(), if possible
  2017-07-19 17:12         ` [PATCH] sha1_file: use access(), not lstat(), if possible Jonathan Tan
@ 2017-07-20 21:48           ` Junio C Hamano
  2017-07-22 11:16             ` Johannes Schindelin
  0 siblings, 1 reply; 70+ messages in thread
From: Junio C Hamano @ 2017-07-20 21:48 UTC (permalink / raw)
  To: Jonathan Tan; +Cc: git, Johannes.Schindelin

Jonathan Tan <jonathantanmy@google.com> writes:

> In sha1_loose_object_info(), use access() (indirectly invoked through
> has_loose_object()) instead of lstat() if we do not need the on-disk
> size, as it should be faster on Windows [1].

That sounds as if Windows is the only thing that matters.  "It is
faster in general, and is much faster on Windows" would have been
more convincing, and "It isn't slower, and is much faster on
Windows" would also have been OK.  Do we have any measurement, or
this patch does not yield any measuable gain?  

By the way, the special casing of disk_sizep (which is only used by
the batch-check feature of cat-file) is somewhat annoying with or
without this patch, but this change makes it even more so by adding
an extra indentation level.  I do not think of a way to make it less
annoying offhand, and I do not think this change needs to address it
in any way, but I am mentioning this as a hint to bystanders who may
want to find something small that can be cleaned up ;-)

Thanks.

>
> [1] https://public-inbox.org/git/alpine.DEB.2.21.1.1707191450570.4193@virtualbox/
>
> Signed-off-by: Jonathan Tan <jonathantanmy@google.com>
> ---
> Thanks for the information - here's a patch. Do you, by any chance, know
> of a web page (or similar thing) that I can cite for this?
> ---
>  sha1_file.c | 21 ++++++++++-----------
>  1 file changed, 10 insertions(+), 11 deletions(-)
>
> diff --git a/sha1_file.c b/sha1_file.c
> index fca165f13..81962b019 100644
> --- a/sha1_file.c
> +++ b/sha1_file.c
> @@ -2920,20 +2920,19 @@ static int sha1_loose_object_info(const unsigned char *sha1,
>  
>  	/*
>  	 * If we don't care about type or size, then we don't
> -	 * need to look inside the object at all. Note that we
> -	 * do not optimize out the stat call, even if the
> -	 * caller doesn't care about the disk-size, since our
> -	 * return value implicitly indicates whether the
> -	 * object even exists.
> +	 * need to look inside the object at all. We only check
> +	 * for its existence.
>  	 */
>  	if (!oi->typep && !oi->typename && !oi->sizep && !oi->contentp) {
> -		const char *path;
> -		struct stat st;
> -		if (stat_sha1_file(sha1, &st, &path) < 0)
> -			return -1;
> -		if (oi->disk_sizep)
> +		if (oi->disk_sizep) {
> +			const char *path;
> +			struct stat st;
> +			if (stat_sha1_file(sha1, &st, &path) < 0)
> +				return -1;
>  			*oi->disk_sizep = st.st_size;
> -		return 0;
> +			return 0;
> +		}
> +		return has_loose_object(sha1) ? 0 : -1;
>  	}
>  
>  	map = map_sha1_file(sha1, &mapsize);

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH] sha1_file: use access(), not lstat(), if possible
  2017-07-20 21:48           ` Junio C Hamano
@ 2017-07-22 11:16             ` Johannes Schindelin
  2017-07-22 16:15               ` Junio C Hamano
  0 siblings, 1 reply; 70+ messages in thread
From: Johannes Schindelin @ 2017-07-22 11:16 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Jonathan Tan, git

Hi,

On Thu, 20 Jul 2017, Junio C Hamano wrote:

> Jonathan Tan <jonathantanmy@google.com> writes:
> 
> > In sha1_loose_object_info(), use access() (indirectly invoked through
> > has_loose_object()) instead of lstat() if we do not need the on-disk
> > size, as it should be faster on Windows [1].
> 
> That sounds as if Windows is the only thing that matters.  "It is
> faster in general, and is much faster on Windows" would have been
> more convincing, and "It isn't slower, and is much faster on
> Windows" would also have been OK.  Do we have any measurement, or
> this patch does not yield any measuable gain?  
> 
> By the way, the special casing of disk_sizep (which is only used by
> the batch-check feature of cat-file) is somewhat annoying with or
> without this patch, but this change makes it even more so by adding
> an extra indentation level.  I do not think of a way to make it less
> annoying offhand, and I do not think this change needs to address it
> in any way, but I am mentioning this as a hint to bystanders who may
> want to find something small that can be cleaned up ;-)

I actually found a separate piece of information in the meantime:

https://blogs.msdn.microsoft.com/oldnewthing/20071023-00/?p=24713#comment-562083

i.e. _waccess() is implemented in the same way our mingw_lstat()
implementation is: by calling the very same GetFileAttributes() code path.
So it has exactly the same performance characteristics, and I was wrong.

But this whole thread taps into a gripe I have with parts of Git's code
base: part of the code is not clear at all in its intent by virtue of
calling whatever POSIX function may seem to give the answer for the
intended question, instead of implementing a function whose name says
precisely what question is asked.

In this instance, we do not call a helper get_file_size(). Oh no. That
would make it too obvious. We call lstat() instead -- under the assumption
that the whole world runs on Linux, really, because let's be honest about
it: lstat() implementations all differ in subtle ways and we really only
test on Linux.

The obviousness of something like get_file_size() would be so refreshing
to these tired eyes.

Oh, and it would make it much easier to maintain ports to other Operating
Systems, most notably Windows.

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH] sha1_file: use access(), not lstat(), if possible
  2017-07-22 11:16             ` Johannes Schindelin
@ 2017-07-22 16:15               ` Junio C Hamano
  2017-07-25 10:19                 ` Johannes Schindelin
  0 siblings, 1 reply; 70+ messages in thread
From: Junio C Hamano @ 2017-07-22 16:15 UTC (permalink / raw)
  To: Johannes Schindelin; +Cc: Jonathan Tan, git

Johannes Schindelin <Johannes.Schindelin@gmx.de> writes:

> But this whole thread taps into a gripe I have with parts of Git's code
> base: part of the code is not clear at all in its intent by virtue of
> calling whatever POSIX function may seem to give the answer for the
> intended question, instead of implementing a function whose name says
> precisely what question is asked.
>
> In this instance, we do not call a helper get_file_size(). Oh no. That
> would make it too obvious. We call lstat() instead.

I agree with you for this case and a case like this in general.  

In codepaths at a lot lower level (they tend to be the ancient and
quite fundamental ones) in our codebase, lstat() is often directly
used by the caller because they are interested not only in a single
aspect of a path but many fields in struct stat are of interest.

When the code is interested in existence or size or whatever single
aspect of a path and nothing else, however, the code would become
easier to read if a helper function with a more specific name is
used.  And it may even help individual platforms that do not want to
use the full lstat() emulation, by telling them that other fields in
struct stat are not needed.

Of course, then the issue becomes what to do when we are interested
in not just one but a selected few attributes.  Perhaps we create a
helper "get_A_B_and_C_attributes_for_path()", which may use lstat()
on POSIX and the most efficient way to get only A, B and C attributes
on non-POSIX platforms.  The implementation would be OK, but the naming
becomes a bit hard; we need to give it a good name.

Things gets even more interesting when the set of attributes we are
interested in grows by one and we need to rename the function to
"get_A_B_C_and_D_attributes_for_path()".  When it is a lot easier to
fall back to the full lstat() emulation on non-POSIX platforms, the
temptation to just use it even though it would grab attributes that
are not needed in that function grows, which needs to be resisted by
those who are doing the actual implementation for a particular platform.

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH] sha1_file: use access(), not lstat(), if possible
  2017-07-22 16:15               ` Junio C Hamano
@ 2017-07-25 10:19                 ` Johannes Schindelin
  0 siblings, 0 replies; 70+ messages in thread
From: Johannes Schindelin @ 2017-07-25 10:19 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Jonathan Tan, git

Hi,

On Sat, 22 Jul 2017, Junio C Hamano wrote:

> Johannes Schindelin <Johannes.Schindelin@gmx.de> writes:
> 
> > But this whole thread taps into a gripe I have with parts of Git's code
> > base: part of the code is not clear at all in its intent by virtue of
> > calling whatever POSIX function may seem to give the answer for the
> > intended question, instead of implementing a function whose name says
> > precisely what question is asked.
> >
> > In this instance, we do not call a helper get_file_size(). Oh no. That
> > would make it too obvious. We call lstat() instead.
> 
> I agree with you for this case and a case like this in general.  
> 
> In codepaths at a lot lower level (they tend to be the ancient and
> quite fundamental ones) in our codebase, lstat() is often directly
> used by the caller because they are interested not only in a single
> aspect of a path but many fields in struct stat are of interest.
> 
> When the code is interested in existence or size or whatever single
> aspect of a path and nothing else, however, the code would become
> easier to read if a helper function with a more specific name is
> used.  And it may even help individual platforms that do not want to
> use the full lstat() emulation, by telling them that other fields in
> struct stat are not needed.
> 
> Of course, then the issue becomes what to do when we are interested
> in not just one but a selected few attributes.  Perhaps we create a
> helper "get_A_B_and_C_attributes_for_path()", which may use lstat()
> on POSIX and the most efficient way to get only A, B and C attributes
> on non-POSIX platforms.  The implementation would be OK, but the naming
> becomes a bit hard; we need to give it a good name.
> 
> Things gets even more interesting when the set of attributes we are
> interested in grows by one and we need to rename the function to
> "get_A_B_C_and_D_attributes_for_path()".  When it is a lot easier to
> fall back to the full lstat() emulation on non-POSIX platforms, the
> temptation to just use it even though it would grab attributes that
> are not needed in that function grows, which needs to be resisted by
> those who are doing the actual implementation for a particular platform.

It becomes a lot easier to fall back to lstat(), if a lot less readable,
yes.

Until, that is, one realises that the function name does not have to
encode what information is sought. It can be a bit field in a parameter
instead. There are even precendents in Git's own source code for that
rather smart paradigm.

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 70+ messages in thread

end of thread, other threads:[~2017-07-25 10:19 UTC | newest]

Thread overview: 70+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-06-09 19:23 [RFC PATCH 0/4] Improvements to sha1_file Jonathan Tan
2017-06-09 19:23 ` [RFC PATCH 1/4] sha1_file: teach packed_object_info about typename Jonathan Tan
2017-06-12 20:55   ` Junio C Hamano
2017-06-09 19:23 ` [RFC PATCH 2/4] sha1_file: extract type and size from object_info Jonathan Tan
2017-06-10  7:01   ` Jeff King
2017-06-12 19:52     ` Jonathan Tan
2017-06-12 21:13       ` Jeff King
2017-06-09 19:23 ` [RFC PATCH 3/4] sha1_file: consolidate storage-agnostic object fns Jonathan Tan
2017-06-09 19:23 ` [RFC PATCH 4/4] sha1_file, fsck: add missing blob support Jonathan Tan
2017-06-13 21:05 ` [PATCH v2 0/4] Improvements to sha1_file Jonathan Tan
2017-06-13 21:05 ` [PATCH v2 1/4] sha1_file: teach packed_object_info about typename Jonathan Tan
2017-06-13 21:05 ` [PATCH v2 2/4] sha1_file: move delta base cache code up Jonathan Tan
2017-06-15 17:00   ` Junio C Hamano
2017-06-13 21:05 ` [PATCH v2 3/4] sha1_file: consolidate storage-agnostic object fns Jonathan Tan
2017-06-15 17:50   ` Junio C Hamano
2017-06-15 18:14     ` Jonathan Tan
2017-06-17 12:19     ` Jeff King
2017-06-19  4:18       ` Junio C Hamano
2017-06-13 21:06 ` [PATCH v2 4/4] sha1_file, fsck: add missing blob support Jonathan Tan
2017-06-15 18:34   ` Junio C Hamano
2017-06-15 20:31     ` Jonathan Tan
2017-06-15 20:52       ` Junio C Hamano
2017-06-15 20:39 ` [PATCH v3 0/4] Improvements to sha1_file Jonathan Tan
2017-06-15 20:39 ` [PATCH v3 1/4] sha1_file: teach packed_object_info about typename Jonathan Tan
2017-06-15 20:39 ` [PATCH v3 2/4] sha1_file: move delta base cache code up Jonathan Tan
2017-06-15 20:39 ` [PATCH v3 3/4] sha1_file: consolidate storage-agnostic object fns Jonathan Tan
2017-06-15 20:39 ` [PATCH v3 4/4] sha1_file, fsck: add missing blob support Jonathan Tan
2017-06-20  1:03 ` [PATCH v4 0/8] Improvements to sha1_file Jonathan Tan
2017-06-21 18:18   ` Junio C Hamano
2017-06-24 12:51   ` Jeff King
2017-06-20  1:03 ` [PATCH v4 1/8] sha1_file: teach packed_object_info about typename Jonathan Tan
2017-06-20  1:03 ` [PATCH v4 2/8] sha1_file: rename LOOKUP_UNKNOWN_OBJECT Jonathan Tan
2017-06-21 17:22   ` Junio C Hamano
2017-06-21 17:34     ` Jonathan Tan
2017-06-20  1:03 ` [PATCH v4 3/8] sha1_file: rename LOOKUP_REPLACE_OBJECT Jonathan Tan
2017-06-21 17:33   ` Junio C Hamano
2017-06-20  1:03 ` [PATCH v4 4/8] sha1_file: move delta base cache code up Jonathan Tan
2017-06-20  1:03 ` [PATCH v4 5/8] sha1_file: refactor read_object Jonathan Tan
2017-06-21 17:58   ` Junio C Hamano
2017-06-20  1:03 ` [PATCH v4 6/8] sha1_file: improve sha1_object_info_extended Jonathan Tan
2017-06-24 12:45   ` Jeff King
2017-06-26 16:45     ` Jonathan Tan
2017-06-26 17:28       ` Junio C Hamano
2017-06-26 17:35         ` Jonathan Tan
2017-06-26 17:26     ` Junio C Hamano
2017-06-20  1:03 ` [PATCH v4 7/8] sha1_file: do not access pack if unneeded Jonathan Tan
2017-06-21 18:15   ` Junio C Hamano
2017-06-24 12:48     ` Jeff King
2017-06-24 18:41       ` Junio C Hamano
2017-06-24 20:39         ` Jeff King
2017-06-26 16:28           ` Jonathan Tan
2017-06-20  1:03 ` [PATCH v4 8/8] sha1_file: refactor has_sha1_file_with_flags Jonathan Tan
2017-06-22  0:40 ` [PATCH v5 0/8] Improvements to sha1_file Jonathan Tan
2017-06-22  1:40   ` Junio C Hamano
2017-06-22  0:40 ` [PATCH v5 1/8] sha1_file: teach packed_object_info about typename Jonathan Tan
2017-06-22  0:40 ` [PATCH v5 2/8] sha1_file: rename LOOKUP_UNKNOWN_OBJECT Jonathan Tan
2017-06-22  0:40 ` [PATCH v5 3/8] sha1_file: rename LOOKUP_REPLACE_OBJECT Jonathan Tan
2017-06-22  0:40 ` [PATCH v5 4/8] sha1_file: move delta base cache code up Jonathan Tan
2017-06-22  0:40 ` [PATCH v5 5/8] sha1_file: refactor read_object Jonathan Tan
2017-06-22  0:40 ` [PATCH v5 6/8] sha1_file: improve sha1_object_info_extended Jonathan Tan
2017-06-22  0:40 ` [PATCH v5 7/8] sha1_file: do not access pack if unneeded Jonathan Tan
2017-06-22  0:40 ` [PATCH v5 8/8] sha1_file: refactor has_sha1_file_with_flags Jonathan Tan
2017-07-18 10:30   ` Christian Couder
2017-07-18 16:39     ` Jonathan Tan
2017-07-19 12:52       ` Johannes Schindelin
2017-07-19 17:12         ` [PATCH] sha1_file: use access(), not lstat(), if possible Jonathan Tan
2017-07-20 21:48           ` Junio C Hamano
2017-07-22 11:16             ` Johannes Schindelin
2017-07-22 16:15               ` Junio C Hamano
2017-07-25 10:19                 ` Johannes Schindelin

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).