git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
* [PATCH 0/2] Refactor hash search with fanout table
@ 2018-02-02 22:36 Jonathan Tan
  2018-02-02 22:36 ` [PATCH 1/2] packfile: remove GIT_DEBUG_LOOKUP log statements Jonathan Tan
                   ` (3 more replies)
  0 siblings, 4 replies; 13+ messages in thread
From: Jonathan Tan @ 2018-02-02 22:36 UTC (permalink / raw)
  To: git; +Cc: Jonathan Tan, stolee

After reviewing Derrick's Serialized Git Commit Graph patches [1], I
noticed that "[PATCH v2 11/14] commit: integrate commit graph with
commit parsing" contains (in bsearch_graph) a repeat of some packfile
functionality. Here is a pack that refactors that functionality out.

Derrick, consider incorporating these patches in your next reroll.

[1] https://public-inbox.org/git/1517348383-112294-1-git-send-email-dstolee@microsoft.com/

Jonathan Tan (2):
  packfile: remove GIT_DEBUG_LOOKUP log statements
  packfile: refactor hash search with fanout table

 packfile.c    | 30 +++++-------------------------
 sha1-lookup.c | 24 ++++++++++++++++++++++++
 sha1-lookup.h | 21 +++++++++++++++++++++
 3 files changed, 50 insertions(+), 25 deletions(-)

-- 
2.16.0.rc1.238.g530d649a79-goog


^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH 1/2] packfile: remove GIT_DEBUG_LOOKUP log statements
  2018-02-02 22:36 [PATCH 0/2] Refactor hash search with fanout table Jonathan Tan
@ 2018-02-02 22:36 ` Jonathan Tan
  2018-02-02 22:36 ` [PATCH 2/2] packfile: refactor hash search with fanout table Jonathan Tan
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 13+ messages in thread
From: Jonathan Tan @ 2018-02-02 22:36 UTC (permalink / raw)
  To: git; +Cc: Jonathan Tan, stolee

In commit 628522ec1439 ("sha1-lookup: more memory efficient search in
sorted list of SHA-1", 2008-04-09), a different algorithm for searching
a sorted list was introduced, together with a set of log statements
guarded by GIT_DEBUG_LOOKUP that are invoked both when using that
algorithm and when using the existing binary search. Those log
statements was meant for experiments and debugging, but with the removal
of the aforementioned different algorithm in commit f1068efefe6d
("sha1_file: drop experimental GIT_USE_LOOKUP search", 2017-08-09),
those log statements are probably no longer necessary.

Remove those statements.

Signed-off-by: Jonathan Tan <jonathantanmy@google.com>
---
 packfile.c | 11 -----------
 1 file changed, 11 deletions(-)

diff --git a/packfile.c b/packfile.c
index 4a5fe7ab1..58bdced3b 100644
--- a/packfile.c
+++ b/packfile.c
@@ -1713,10 +1713,6 @@ off_t find_pack_entry_one(const unsigned char *sha1,
 	const uint32_t *level1_ofs = p->index_data;
 	const unsigned char *index = p->index_data;
 	unsigned hi, lo, stride;
-	static int debug_lookup = -1;
-
-	if (debug_lookup < 0)
-		debug_lookup = !!getenv("GIT_DEBUG_LOOKUP");
 
 	if (!index) {
 		if (open_pack_index(p))
@@ -1738,17 +1734,10 @@ off_t find_pack_entry_one(const unsigned char *sha1,
 		index += 4;
 	}
 
-	if (debug_lookup)
-		printf("%02x%02x%02x... lo %u hi %u nr %"PRIu32"\n",
-		       sha1[0], sha1[1], sha1[2], lo, hi, p->num_objects);
-
 	while (lo < hi) {
 		unsigned mi = lo + (hi - lo) / 2;
 		int cmp = hashcmp(index + mi * stride, sha1);
 
-		if (debug_lookup)
-			printf("lo %u hi %u rg %u mi %u\n",
-			       lo, hi, hi - lo, mi);
 		if (!cmp)
 			return nth_packed_object_offset(p, mi);
 		if (cmp > 0)
-- 
2.16.0.rc1.238.g530d649a79-goog


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH 2/2] packfile: refactor hash search with fanout table
  2018-02-02 22:36 [PATCH 0/2] Refactor hash search with fanout table Jonathan Tan
  2018-02-02 22:36 ` [PATCH 1/2] packfile: remove GIT_DEBUG_LOOKUP log statements Jonathan Tan
@ 2018-02-02 22:36 ` Jonathan Tan
  2018-02-09 18:03   ` René Scharfe
  2018-02-02 23:30 ` [PATCH 0/2] Refactor " Junio C Hamano
  2018-02-13 18:39 ` [PATCH v2 " Jonathan Tan
  3 siblings, 1 reply; 13+ messages in thread
From: Jonathan Tan @ 2018-02-02 22:36 UTC (permalink / raw)
  To: git; +Cc: Jonathan Tan, stolee

Subsequent patches will introduce file formats that make use of a fanout
array and a sorted table containing hashes, just like packfiles.
Refactor the hash search in packfile.c into its own function, so that
those patches can make use of it as well.

Signed-off-by: Jonathan Tan <jonathantanmy@google.com>
---
 packfile.c    | 19 +++++--------------
 sha1-lookup.c | 24 ++++++++++++++++++++++++
 sha1-lookup.h | 21 +++++++++++++++++++++
 3 files changed, 50 insertions(+), 14 deletions(-)

diff --git a/packfile.c b/packfile.c
index 58bdced3b..29f5dc239 100644
--- a/packfile.c
+++ b/packfile.c
@@ -1712,7 +1712,8 @@ off_t find_pack_entry_one(const unsigned char *sha1,
 {
 	const uint32_t *level1_ofs = p->index_data;
 	const unsigned char *index = p->index_data;
-	unsigned hi, lo, stride;
+	unsigned stride;
+	int ret;
 
 	if (!index) {
 		if (open_pack_index(p))
@@ -1725,8 +1726,6 @@ off_t find_pack_entry_one(const unsigned char *sha1,
 		index += 8;
 	}
 	index += 4 * 256;
-	hi = ntohl(level1_ofs[*sha1]);
-	lo = ((*sha1 == 0x0) ? 0 : ntohl(level1_ofs[*sha1 - 1]));
 	if (p->index_version > 1) {
 		stride = 20;
 	} else {
@@ -1734,17 +1733,9 @@ off_t find_pack_entry_one(const unsigned char *sha1,
 		index += 4;
 	}
 
-	while (lo < hi) {
-		unsigned mi = lo + (hi - lo) / 2;
-		int cmp = hashcmp(index + mi * stride, sha1);
-
-		if (!cmp)
-			return nth_packed_object_offset(p, mi);
-		if (cmp > 0)
-			hi = mi;
-		else
-			lo = mi+1;
-	}
+	ret = bsearch_hash(sha1, level1_ofs, index, stride);
+	if (ret >= 0)
+		return nth_packed_object_offset(p, ret);
 	return 0;
 }
 
diff --git a/sha1-lookup.c b/sha1-lookup.c
index 4cf3ebd92..d11c5e526 100644
--- a/sha1-lookup.c
+++ b/sha1-lookup.c
@@ -99,3 +99,27 @@ int sha1_pos(const unsigned char *sha1, void *table, size_t nr,
 	} while (lo < hi);
 	return -lo-1;
 }
+
+int bsearch_hash(const unsigned char *sha1, const void *fanout_,
+		 const void *table_, size_t stride)
+{
+	const uint32_t *fanout = fanout_;
+	const unsigned char *table = table_;
+	int hi, lo;
+
+	hi = ntohl(fanout[*sha1]);
+	lo = ((*sha1 == 0x0) ? 0 : ntohl(fanout[*sha1 - 1]));
+
+	while (lo < hi) {
+		unsigned mi = lo + (hi - lo) / 2;
+		int cmp = hashcmp(table + mi * stride, sha1);
+
+		if (!cmp)
+			return mi;
+		if (cmp > 0)
+			hi = mi;
+		else
+			lo = mi + 1;
+	}
+	return -lo - 1;
+}
diff --git a/sha1-lookup.h b/sha1-lookup.h
index cf5314f40..3c59e9cb1 100644
--- a/sha1-lookup.h
+++ b/sha1-lookup.h
@@ -7,4 +7,25 @@ extern int sha1_pos(const unsigned char *sha1,
 		    void *table,
 		    size_t nr,
 		    sha1_access_fn fn);
+
+/*
+ * Searches for sha1 in table, using the given fanout table to determine the
+ * interval to search, then using binary search. Returns the element index of
+ * the position found if successful, -i-1 if not (where i is the index of the
+ * least element that is greater than sha1).
+ *
+ * Takes the following parameters:
+ *
+ *  - sha1: the hash to search for
+ *  - fanout: a 256-element array of NETWORK-order 32-bit integers; the integer
+ *    at position i represents the number of elements in table whose first byte
+ *    is less than or equal to i
+ *  - table: a sorted list of hashes with optional extra information in between
+ *  - stride: distance between two consecutive elements in table (should be
+ *    GIT_MAX_RAWSZ or greater)
+ *
+ * This function does not verify the validity of the fanout table.
+ */
+extern int bsearch_hash(const unsigned char *sha1, const void *fanout,
+			const void *table, size_t stride);
 #endif
-- 
2.16.0.rc1.238.g530d649a79-goog


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH 0/2] Refactor hash search with fanout table
  2018-02-02 22:36 [PATCH 0/2] Refactor hash search with fanout table Jonathan Tan
  2018-02-02 22:36 ` [PATCH 1/2] packfile: remove GIT_DEBUG_LOOKUP log statements Jonathan Tan
  2018-02-02 22:36 ` [PATCH 2/2] packfile: refactor hash search with fanout table Jonathan Tan
@ 2018-02-02 23:30 ` Junio C Hamano
  2018-02-03  2:09   ` Derrick Stolee
  2018-02-13 18:39 ` [PATCH v2 " Jonathan Tan
  3 siblings, 1 reply; 13+ messages in thread
From: Junio C Hamano @ 2018-02-02 23:30 UTC (permalink / raw)
  To: Jonathan Tan; +Cc: git, stolee

Jonathan Tan <jonathantanmy@google.com> writes:

> After reviewing Derrick's Serialized Git Commit Graph patches [1], I
> noticed that "[PATCH v2 11/14] commit: integrate commit graph with
> commit parsing" contains (in bsearch_graph) a repeat of some packfile
> functionality. Here is a pack that refactors that functionality out.

Yay.  I had exactly the same reaction to that part of the series.

Thansk.


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 0/2] Refactor hash search with fanout table
  2018-02-02 23:30 ` [PATCH 0/2] Refactor " Junio C Hamano
@ 2018-02-03  2:09   ` Derrick Stolee
  0 siblings, 0 replies; 13+ messages in thread
From: Derrick Stolee @ 2018-02-03  2:09 UTC (permalink / raw)
  To: Junio C Hamano, Jonathan Tan; +Cc: git

On 2/2/2018 6:30 PM, Junio C Hamano wrote:
> Jonathan Tan <jonathantanmy@google.com> writes:
> 
>> After reviewing Derrick's Serialized Git Commit Graph patches [1], I
>> noticed that "[PATCH v2 11/14] commit: integrate commit graph with
>> commit parsing" contains (in bsearch_graph) a repeat of some packfile
>> functionality. Here is a pack that refactors that functionality out.
> 
> Yay.  I had exactly the same reaction to that part of the series.
> 

Thanks for doing this refactor. I'm a fan of reducing code clones, but 
also don't want to break well-worn code paths.

Jonathan: While you are doing this, I'm guessing you could use your new 
method to replace (and maybe speed up) the binary search in 
sha1_name.c:find_abbrev_len_for_pack(). Otherwise, I can take a stab at 
it next week.

Please add
Reviewed-by: Derrick Stolee <dstolee@microsoft.com>

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 2/2] packfile: refactor hash search with fanout table
  2018-02-02 22:36 ` [PATCH 2/2] packfile: refactor hash search with fanout table Jonathan Tan
@ 2018-02-09 18:03   ` René Scharfe
  2018-02-09 19:50     ` Jonathan Tan
  0 siblings, 1 reply; 13+ messages in thread
From: René Scharfe @ 2018-02-09 18:03 UTC (permalink / raw)
  To: Jonathan Tan, git; +Cc: stolee

Am 02.02.2018 um 23:36 schrieb Jonathan Tan:
> Subsequent patches will introduce file formats that make use of a fanout
> array and a sorted table containing hashes, just like packfiles.
> Refactor the hash search in packfile.c into its own function, so that
> those patches can make use of it as well.
> 
> Signed-off-by: Jonathan Tan <jonathantanmy@google.com>
> ---
>   packfile.c    | 19 +++++--------------
>   sha1-lookup.c | 24 ++++++++++++++++++++++++
>   sha1-lookup.h | 21 +++++++++++++++++++++
>   3 files changed, 50 insertions(+), 14 deletions(-)
> 
> diff --git a/packfile.c b/packfile.c
> index 58bdced3b..29f5dc239 100644
> --- a/packfile.c
> +++ b/packfile.c
> @@ -1712,7 +1712,8 @@ off_t find_pack_entry_one(const unsigned char *sha1,
>   {
>   	const uint32_t *level1_ofs = p->index_data;
>   	const unsigned char *index = p->index_data;
> -	unsigned hi, lo, stride;
> +	unsigned stride;
> +	int ret;
>   
>   	if (!index) {
>   		if (open_pack_index(p))
> @@ -1725,8 +1726,6 @@ off_t find_pack_entry_one(const unsigned char *sha1,
>   		index += 8;
>   	}
>   	index += 4 * 256;
> -	hi = ntohl(level1_ofs[*sha1]);
> -	lo = ((*sha1 == 0x0) ? 0 : ntohl(level1_ofs[*sha1 - 1]));
>   	if (p->index_version > 1) {
>   		stride = 20;
>   	} else {
> @@ -1734,17 +1733,9 @@ off_t find_pack_entry_one(const unsigned char *sha1,
>   		index += 4;
>   	}
>   
> -	while (lo < hi) {
> -		unsigned mi = lo + (hi - lo) / 2;
> -		int cmp = hashcmp(index + mi * stride, sha1);
> -
> -		if (!cmp)
> -			return nth_packed_object_offset(p, mi);
> -		if (cmp > 0)
> -			hi = mi;
> -		else
> -			lo = mi+1;
> -	}
> +	ret = bsearch_hash(sha1, level1_ofs, index, stride);
> +	if (ret >= 0)
> +		return nth_packed_object_offset(p, ret);

Going from unsigned to signed int means the patch breaks support for
more than 2G pack entries, which was put with 326bf39677 (Use uint32_t
for all packed object counts.) in 2007.

>   	return 0;
>   }
>   
> diff --git a/sha1-lookup.c b/sha1-lookup.c
> index 4cf3ebd92..d11c5e526 100644
> --- a/sha1-lookup.c
> +++ b/sha1-lookup.c
> @@ -99,3 +99,27 @@ int sha1_pos(const unsigned char *sha1, void *table, size_t nr,
>   	} while (lo < hi);
>   	return -lo-1;
>   }
> +
> +int bsearch_hash(const unsigned char *sha1, const void *fanout_,
> +		 const void *table_, size_t stride)
> +{
> +	const uint32_t *fanout = fanout_;

Why hide the type?  It doesn't make the function more generic.

> +	const unsigned char *table = table_;
> +	int hi, lo;
> +
> +	hi = ntohl(fanout[*sha1]);
> +	lo = ((*sha1 == 0x0) ? 0 : ntohl(fanout[*sha1 - 1]));
> +
> +	while (lo < hi) {
> +		unsigned mi = lo + (hi - lo) / 2;
> +		int cmp = hashcmp(table + mi * stride, sha1);
> +
> +		if (!cmp)
> +			return mi;
> +		if (cmp > 0)
> +			hi = mi;
> +		else
> +			lo = mi + 1;
> +	}
> +	return -lo - 1;
> +}
> diff --git a/sha1-lookup.h b/sha1-lookup.h
> index cf5314f40..3c59e9cb1 100644
> --- a/sha1-lookup.h
> +++ b/sha1-lookup.h
> @@ -7,4 +7,25 @@ extern int sha1_pos(const unsigned char *sha1,
>   		    void *table,
>   		    size_t nr,
>   		    sha1_access_fn fn);
> +
> +/*
> + * Searches for sha1 in table, using the given fanout table to determine the
> + * interval to search, then using binary search. Returns the element index of
> + * the position found if successful, -i-1 if not (where i is the index of the
> + * least element that is greater than sha1).
> + *
> + * Takes the following parameters:
> + *
> + *  - sha1: the hash to search for
> + *  - fanout: a 256-element array of NETWORK-order 32-bit integers; the integer
> + *    at position i represents the number of elements in table whose first byte
> + *    is less than or equal to i
> + *  - table: a sorted list of hashes with optional extra information in between
> + *  - stride: distance between two consecutive elements in table (should be
> + *    GIT_MAX_RAWSZ or greater)
> + *
> + * This function does not verify the validity of the fanout table.
> + */
> +extern int bsearch_hash(const unsigned char *sha1, const void *fanout,
> +			const void *table, size_t stride);
>   #endif
> 

Why not use sha1_pos()?  I guess because it avoids the overhead of the
accessor function, right?  And I wonder how much of difference it makes.

A binary search function for embedded hashes just needs the key, a
pointer to the first hash in the array, the stride and the number of
elements.  It can then be used with or without a fanout table, making it
more versatile.  Just a thought.

René

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 2/2] packfile: refactor hash search with fanout table
  2018-02-09 18:03   ` René Scharfe
@ 2018-02-09 19:50     ` Jonathan Tan
  0 siblings, 0 replies; 13+ messages in thread
From: Jonathan Tan @ 2018-02-09 19:50 UTC (permalink / raw)
  To: René Scharfe; +Cc: git, stolee

On Fri, 9 Feb 2018 19:03:48 +0100
René Scharfe <l.s.r@web.de> wrote:

> Going from unsigned to signed int means the patch breaks support for
> more than 2G pack entries, which was put with 326bf39677 (Use uint32_t
> for all packed object counts.) in 2007.

Ah, good catch. I'll wait to see if there are any more comments, then
send out a new version.

> > +int bsearch_hash(const unsigned char *sha1, const void *fanout_,
> > +		 const void *table_, size_t stride)
> > +{
> > +	const uint32_t *fanout = fanout_;
> 
> Why hide the type?  It doesn't make the function more generic.

I thought that the fanout_ parameter could come from a variety of
sources (e.g. direct mmap - void *, or mmap with some pointer arithmetic
- char *) so I just picked the generic one. But now I realize that that
could lead to unaligned reads, which is probably not a good idea. I'll
update it.

For consistency, I'll also update table_ to be unsigned char *.
(Unsigned because it is primarily interpreted as hashes, which use
"unsigned char *" in the Git code.)

> Why not use sha1_pos()?  I guess because it avoids the overhead of the
> accessor function, right?  And I wonder how much of difference it makes.

Yes, overhead of the accessor function. We would also need to modify
sha1_pos to take in a function that we can pass userdata to (to contain
the stride).

> A binary search function for embedded hashes just needs the key, a
> pointer to the first hash in the array, the stride and the number of
> elements.  It can then be used with or without a fanout table, making it
> more versatile.  Just a thought.

I specifically want to include the fanout table in the calculation here,
because it will be used by subsequent patches that also incorporate the
fanout table.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH v2 0/2] Refactor hash search with fanout table
  2018-02-02 22:36 [PATCH 0/2] Refactor hash search with fanout table Jonathan Tan
                   ` (2 preceding siblings ...)
  2018-02-02 23:30 ` [PATCH 0/2] Refactor " Junio C Hamano
@ 2018-02-13 18:39 ` Jonathan Tan
  2018-02-13 18:39   ` [PATCH v2 1/2] packfile: remove GIT_DEBUG_LOOKUP log statements Jonathan Tan
                     ` (3 more replies)
  3 siblings, 4 replies; 13+ messages in thread
From: Jonathan Tan @ 2018-02-13 18:39 UTC (permalink / raw)
  To: git; +Cc: l.s.r, stolee, Jonathan Tan

Updates from v1:
 - use uint32_t so that we can operate on packfiles of up to 4G objects
   (this also means that I had to change the signature of the function)
 - don't hide types

Derrick: you'll need to slightly change your patch to use the new API.
As for find_abbrev_len_for_pack(), that's a good idea - I didn't do it
in this set but it definitely should be done.

Jonathan Tan (2):
  packfile: remove GIT_DEBUG_LOOKUP log statements
  packfile: refactor hash search with fanout table

 packfile.c    | 29 ++++-------------------------
 sha1-lookup.c | 28 ++++++++++++++++++++++++++++
 sha1-lookup.h | 22 ++++++++++++++++++++++
 3 files changed, 54 insertions(+), 25 deletions(-)

-- 
2.16.0.rc1.238.g530d649a79-goog


^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH v2 1/2] packfile: remove GIT_DEBUG_LOOKUP log statements
  2018-02-13 18:39 ` [PATCH v2 " Jonathan Tan
@ 2018-02-13 18:39   ` Jonathan Tan
  2018-02-13 18:39   ` [PATCH v2 2/2] packfile: refactor hash search with fanout table Jonathan Tan
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 13+ messages in thread
From: Jonathan Tan @ 2018-02-13 18:39 UTC (permalink / raw)
  To: git; +Cc: l.s.r, stolee, Jonathan Tan

In commit 628522ec1439 ("sha1-lookup: more memory efficient search in
sorted list of SHA-1", 2008-04-09), a different algorithm for searching
a sorted list was introduced, together with a set of log statements
guarded by GIT_DEBUG_LOOKUP that are invoked both when using that
algorithm and when using the existing binary search. Those log
statements was meant for experiments and debugging, but with the removal
of the aforementioned different algorithm in commit f1068efefe6d
("sha1_file: drop experimental GIT_USE_LOOKUP search", 2017-08-09),
those log statements are probably no longer necessary.

Remove those statements.

Signed-off-by: Jonathan Tan <jonathantanmy@google.com>
---
 packfile.c | 11 -----------
 1 file changed, 11 deletions(-)

diff --git a/packfile.c b/packfile.c
index 4a5fe7ab1..58bdced3b 100644
--- a/packfile.c
+++ b/packfile.c
@@ -1713,10 +1713,6 @@ off_t find_pack_entry_one(const unsigned char *sha1,
 	const uint32_t *level1_ofs = p->index_data;
 	const unsigned char *index = p->index_data;
 	unsigned hi, lo, stride;
-	static int debug_lookup = -1;
-
-	if (debug_lookup < 0)
-		debug_lookup = !!getenv("GIT_DEBUG_LOOKUP");
 
 	if (!index) {
 		if (open_pack_index(p))
@@ -1738,17 +1734,10 @@ off_t find_pack_entry_one(const unsigned char *sha1,
 		index += 4;
 	}
 
-	if (debug_lookup)
-		printf("%02x%02x%02x... lo %u hi %u nr %"PRIu32"\n",
-		       sha1[0], sha1[1], sha1[2], lo, hi, p->num_objects);
-
 	while (lo < hi) {
 		unsigned mi = lo + (hi - lo) / 2;
 		int cmp = hashcmp(index + mi * stride, sha1);
 
-		if (debug_lookup)
-			printf("lo %u hi %u rg %u mi %u\n",
-			       lo, hi, hi - lo, mi);
 		if (!cmp)
 			return nth_packed_object_offset(p, mi);
 		if (cmp > 0)
-- 
2.16.0.rc1.238.g530d649a79-goog


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH v2 2/2] packfile: refactor hash search with fanout table
  2018-02-13 18:39 ` [PATCH v2 " Jonathan Tan
  2018-02-13 18:39   ` [PATCH v2 1/2] packfile: remove GIT_DEBUG_LOOKUP log statements Jonathan Tan
@ 2018-02-13 18:39   ` Jonathan Tan
  2018-02-13 18:52   ` [PATCH v2 0/2] Refactor " Derrick Stolee
  2018-02-13 19:57   ` Junio C Hamano
  3 siblings, 0 replies; 13+ messages in thread
From: Jonathan Tan @ 2018-02-13 18:39 UTC (permalink / raw)
  To: git; +Cc: l.s.r, stolee, Jonathan Tan

Subsequent patches will introduce file formats that make use of a fanout
array and a sorted table containing hashes, just like packfiles.
Refactor the hash search in packfile.c into its own function, so that
those patches can make use of it as well.

Signed-off-by: Jonathan Tan <jonathantanmy@google.com>
---
 packfile.c    | 18 ++++--------------
 sha1-lookup.c | 28 ++++++++++++++++++++++++++++
 sha1-lookup.h | 22 ++++++++++++++++++++++
 3 files changed, 54 insertions(+), 14 deletions(-)

diff --git a/packfile.c b/packfile.c
index 58bdced3b..59648a182 100644
--- a/packfile.c
+++ b/packfile.c
@@ -1712,7 +1712,8 @@ off_t find_pack_entry_one(const unsigned char *sha1,
 {
 	const uint32_t *level1_ofs = p->index_data;
 	const unsigned char *index = p->index_data;
-	unsigned hi, lo, stride;
+	unsigned stride;
+	uint32_t result;
 
 	if (!index) {
 		if (open_pack_index(p))
@@ -1725,8 +1726,6 @@ off_t find_pack_entry_one(const unsigned char *sha1,
 		index += 8;
 	}
 	index += 4 * 256;
-	hi = ntohl(level1_ofs[*sha1]);
-	lo = ((*sha1 == 0x0) ? 0 : ntohl(level1_ofs[*sha1 - 1]));
 	if (p->index_version > 1) {
 		stride = 20;
 	} else {
@@ -1734,17 +1733,8 @@ off_t find_pack_entry_one(const unsigned char *sha1,
 		index += 4;
 	}
 
-	while (lo < hi) {
-		unsigned mi = lo + (hi - lo) / 2;
-		int cmp = hashcmp(index + mi * stride, sha1);
-
-		if (!cmp)
-			return nth_packed_object_offset(p, mi);
-		if (cmp > 0)
-			hi = mi;
-		else
-			lo = mi+1;
-	}
+	if (bsearch_hash(sha1, level1_ofs, index, stride, &result))
+		return nth_packed_object_offset(p, result);
 	return 0;
 }
 
diff --git a/sha1-lookup.c b/sha1-lookup.c
index 4cf3ebd92..8d0b1db3e 100644
--- a/sha1-lookup.c
+++ b/sha1-lookup.c
@@ -99,3 +99,31 @@ int sha1_pos(const unsigned char *sha1, void *table, size_t nr,
 	} while (lo < hi);
 	return -lo-1;
 }
+
+int bsearch_hash(const unsigned char *sha1, const uint32_t *fanout_nbo,
+		 const unsigned char *table, size_t stride, uint32_t *result)
+{
+	uint32_t hi, lo;
+
+	hi = ntohl(fanout_nbo[*sha1]);
+	lo = ((*sha1 == 0x0) ? 0 : ntohl(fanout_nbo[*sha1 - 1]));
+
+	while (lo < hi) {
+		unsigned mi = lo + (hi - lo) / 2;
+		int cmp = hashcmp(table + mi * stride, sha1);
+
+		if (!cmp) {
+			if (result)
+				*result = mi;
+			return 1;
+		}
+		if (cmp > 0)
+			hi = mi;
+		else
+			lo = mi + 1;
+	}
+
+	if (result)
+		*result = lo;
+	return 0;
+}
diff --git a/sha1-lookup.h b/sha1-lookup.h
index cf5314f40..7678b23b3 100644
--- a/sha1-lookup.h
+++ b/sha1-lookup.h
@@ -7,4 +7,26 @@ extern int sha1_pos(const unsigned char *sha1,
 		    void *table,
 		    size_t nr,
 		    sha1_access_fn fn);
+
+/*
+ * Searches for sha1 in table, using the given fanout table to determine the
+ * interval to search, then using binary search. Returns 1 if found, 0 if not.
+ *
+ * Takes the following parameters:
+ *
+ *  - sha1: the hash to search for
+ *  - fanout_nbo: a 256-element array of NETWORK-order 32-bit integers; the
+ *    integer at position i represents the number of elements in table whose
+ *    first byte is less than or equal to i
+ *  - table: a sorted list of hashes with optional extra information in between
+ *  - stride: distance between two consecutive elements in table (should be
+ *    GIT_MAX_RAWSZ or greater)
+ *  - result: if not NULL, this function stores the element index of the
+ *    position found (if the search is successful) or the index of the least
+ *    element that is greater than sha1 (if the search is not successful)
+ *
+ * This function does not verify the validity of the fanout table.
+ */
+int bsearch_hash(const unsigned char *sha1, const uint32_t *fanout_nbo,
+		 const unsigned char *table, size_t stride, uint32_t *result);
 #endif
-- 
2.16.0.rc1.238.g530d649a79-goog


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH v2 0/2] Refactor hash search with fanout table
  2018-02-13 18:39 ` [PATCH v2 " Jonathan Tan
  2018-02-13 18:39   ` [PATCH v2 1/2] packfile: remove GIT_DEBUG_LOOKUP log statements Jonathan Tan
  2018-02-13 18:39   ` [PATCH v2 2/2] packfile: refactor hash search with fanout table Jonathan Tan
@ 2018-02-13 18:52   ` Derrick Stolee
  2018-02-13 19:57   ` Junio C Hamano
  3 siblings, 0 replies; 13+ messages in thread
From: Derrick Stolee @ 2018-02-13 18:52 UTC (permalink / raw)
  To: Jonathan Tan, git; +Cc: l.s.r

On 2/13/2018 1:39 PM, Jonathan Tan wrote:
> Updates from v1:
>   - use uint32_t so that we can operate on packfiles of up to 4G objects
>     (this also means that I had to change the signature of the function)

I take it "4G objects" means "4 billion objects". The distinction is 
only important that the limit is not a 4GB IDX file, but in fact a large 
enough object count that the IDX file can be 100+ GB.

>   - don't hide types
>
> Derrick: you'll need to slightly change your patch to use the new API.
> As for find_abbrev_len_for_pack(), that's a good idea - I didn't do it
> in this set but it definitely should be done.

Thanks. I'll try to remember to do that change when this lands in 
master. I'll bring the new prototype into my commit-graph patch.

Reviewed-By: Derrick Stolee <dstolee@microsoft.com>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v2 0/2] Refactor hash search with fanout table
  2018-02-13 18:39 ` [PATCH v2 " Jonathan Tan
                     ` (2 preceding siblings ...)
  2018-02-13 18:52   ` [PATCH v2 0/2] Refactor " Derrick Stolee
@ 2018-02-13 19:57   ` Junio C Hamano
  2018-02-13 20:15     ` Jonathan Tan
  3 siblings, 1 reply; 13+ messages in thread
From: Junio C Hamano @ 2018-02-13 19:57 UTC (permalink / raw)
  To: Jonathan Tan; +Cc: git, l.s.r, stolee

Jonathan Tan <jonathantanmy@google.com> writes:

> Updates from v1:
>  - use uint32_t so that we can operate on packfiles of up to 4G objects
>    (this also means that I had to change the signature of the function)
>  - don't hide types
>
> Derrick: you'll need to slightly change your patch to use the new API.
> As for find_abbrev_len_for_pack(), that's a good idea - I didn't do it
> in this set but it definitely should be done.
>
> Jonathan Tan (2):
>   packfile: remove GIT_DEBUG_LOOKUP log statements
>   packfile: refactor hash search with fanout table

Hmm, is this meant to replace the topic that was merged to 'next'
last week?

>
>  packfile.c    | 29 ++++-------------------------
>  sha1-lookup.c | 28 ++++++++++++++++++++++++++++
>  sha1-lookup.h | 22 ++++++++++++++++++++++
>  3 files changed, 54 insertions(+), 25 deletions(-)

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v2 0/2] Refactor hash search with fanout table
  2018-02-13 19:57   ` Junio C Hamano
@ 2018-02-13 20:15     ` Jonathan Tan
  0 siblings, 0 replies; 13+ messages in thread
From: Jonathan Tan @ 2018-02-13 20:15 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git, l.s.r, stolee

On Tue, 13 Feb 2018 11:57:16 -0800
Junio C Hamano <gitster@pobox.com> wrote:

> Jonathan Tan <jonathantanmy@google.com> writes:
> 
> > Updates from v1:
> >  - use uint32_t so that we can operate on packfiles of up to 4G objects
> >    (this also means that I had to change the signature of the function)
> >  - don't hide types
> >
> > Derrick: you'll need to slightly change your patch to use the new API.
> > As for find_abbrev_len_for_pack(), that's a good idea - I didn't do it
> > in this set but it definitely should be done.
> >
> > Jonathan Tan (2):
> >   packfile: remove GIT_DEBUG_LOOKUP log statements
> >   packfile: refactor hash search with fanout table
> 
> Hmm, is this meant to replace the topic that was merged to 'next'
> last week?

Yes - René pointed out [1] that V1 of my patch series (which you merged
to 'next') does not handle packfiles of more than 2G pack entries, so I
sent out a new version. Yes, this replaces jt/binsearch-with-fanout.
Sorry for not being clearer.

[1] https://public-inbox.org/git/cfbde137-dbac-8796-f49f-2a543303d33a@web.de/

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2018-02-13 20:15 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-02-02 22:36 [PATCH 0/2] Refactor hash search with fanout table Jonathan Tan
2018-02-02 22:36 ` [PATCH 1/2] packfile: remove GIT_DEBUG_LOOKUP log statements Jonathan Tan
2018-02-02 22:36 ` [PATCH 2/2] packfile: refactor hash search with fanout table Jonathan Tan
2018-02-09 18:03   ` René Scharfe
2018-02-09 19:50     ` Jonathan Tan
2018-02-02 23:30 ` [PATCH 0/2] Refactor " Junio C Hamano
2018-02-03  2:09   ` Derrick Stolee
2018-02-13 18:39 ` [PATCH v2 " Jonathan Tan
2018-02-13 18:39   ` [PATCH v2 1/2] packfile: remove GIT_DEBUG_LOOKUP log statements Jonathan Tan
2018-02-13 18:39   ` [PATCH v2 2/2] packfile: refactor hash search with fanout table Jonathan Tan
2018-02-13 18:52   ` [PATCH v2 0/2] Refactor " Derrick Stolee
2018-02-13 19:57   ` Junio C Hamano
2018-02-13 20:15     ` Jonathan Tan

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).