git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
* [PATCH 0/3] delta-island fixes
@ 2018-11-20  9:44 Jeff King
  2018-11-20  9:46 ` [PATCH 1/3] pack-objects: fix tree_depth and layer invariants Jeff King
                   ` (3 more replies)
  0 siblings, 4 replies; 9+ messages in thread
From: Jeff King @ 2018-11-20  9:44 UTC (permalink / raw)
  To: git; +Cc: Christian Couder, Derrick Stolee

This fixes a few bugs in the cc/delta-islands topic: one major, and two
minor.

Sadly, the major one was not caught by our test suite, and I'm not sure
how to remedy that. The problem is a memory management one, and happens
only when you have a reasonably large number of objects. So it triggers
readily when run on a real repository, but not on the toy one in t5320.
Creating a much larger repository there would make the test suite more
expensive.

In cases like this I think it's often a good idea to have a perf test.
Those are expensive anyway, and we'd have the double benefit of
exercising the code and showing off the performance improvement. But the
delta-island code only makes sense on a very specialized repo: one where
you have multiple related but diverging histories.  You could simulate
that by picking two branches in a repository, but the effect is going to
be miniscule.

Anyway, here are the fixes without tests. We should at least apply these
before v2.20 ships with the bugs.

  [1/3]: pack-objects: fix tree_depth and layer invariants
  [2/3]: pack-objects: zero-initialize tree_depth/layer arrays
  [3/3]: pack-objects: fix off-by-one in delta-island tree-depth computation

 builtin/pack-objects.c | 4 +++-
 git-compat-util.h      | 1 +
 pack-objects.h         | 4 ++--
 3 files changed, 6 insertions(+), 3 deletions(-)


^ permalink raw reply	[flat|nested] 9+ messages in thread

* [PATCH 1/3] pack-objects: fix tree_depth and layer invariants
  2018-11-20  9:44 [PATCH 0/3] delta-island fixes Jeff King
@ 2018-11-20  9:46 ` Jeff King
  2018-11-20 16:37   ` Duy Nguyen
  2018-11-21  4:52   ` Junio C Hamano
  2018-11-20  9:48 ` [PATCH 2/3] pack-objects: zero-initialize tree_depth/layer arrays Jeff King
                   ` (2 subsequent siblings)
  3 siblings, 2 replies; 9+ messages in thread
From: Jeff King @ 2018-11-20  9:46 UTC (permalink / raw)
  To: git; +Cc: Christian Couder, Derrick Stolee

Commit 108f530385 (pack-objects: move tree_depth into 'struct
packing_data', 2018-08-16) dynamically manages a tree_depth array in
packing_data that maintains one of these invariants:

  1. tree_depth is NULL (i.e., the requested options don't require us to
     track tree depths)

  2. tree_depth is non-NULL and has as many entries as the "objects"
     array

We maintain (2) by:

  a. When the objects array grows, grow tree_depth to the same size
     (unless it's NULL, in which case we can leave it).

  b. When a caller asks to set a depth via oe_set_tree_depth(), if
     tree_depth is NULL we allocate it.

But in (b), we use the number of stored objects, _not_ the allocated
size of the objects array. So we can run into a situation like this:

  1. packlist_alloc() needs to store the Nth object, so it grows the
     objects array to M, where M > N.

  2. oe_set_tree_depth() wants to store a depth, so it allocates an
     array of length N. Now we've violated our invariant.

  3. packlist_alloc() needs to store the N+1th object. But it _doesn't_
     grow the objects array, since N <= M still holds. We try to assign
     to tree_depth[N+1], which is out of bounds.

That doesn't happen in our test scripts, because the repositories they
use are so small, but it's easy to trigger by running:

  echo HEAD | git pack-objects --revs --delta-islands --stdout >/dev/null

in any reasonably-sized repo (like git.git).

We can fix it by always growing the array to match pack->nr_alloc, not
pack->nr_objects. Likewise for the "layer" array from fe0ac2fb7f
(pack-objects: move 'layer' into 'struct packing_data', 2018-08-16),
which has the same bug.

Signed-off-by: Jeff King <peff@peff.net>
---
 pack-objects.h | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/pack-objects.h b/pack-objects.h
index feb6a6a05e..f31ac1c81c 100644
--- a/pack-objects.h
+++ b/pack-objects.h
@@ -412,7 +412,7 @@ static inline void oe_set_tree_depth(struct packing_data *pack,
 				     unsigned int tree_depth)
 {
 	if (!pack->tree_depth)
-		ALLOC_ARRAY(pack->tree_depth, pack->nr_objects);
+		ALLOC_ARRAY(pack->tree_depth, pack->nr_alloc);
 	pack->tree_depth[e - pack->objects] = tree_depth;
 }
 
@@ -429,7 +429,7 @@ static inline void oe_set_layer(struct packing_data *pack,
 				unsigned char layer)
 {
 	if (!pack->layer)
-		ALLOC_ARRAY(pack->layer, pack->nr_objects);
+		ALLOC_ARRAY(pack->layer, pack->nr_alloc);
 	pack->layer[e - pack->objects] = layer;
 }
 
-- 
2.20.0.rc0.715.gf6b01ab3e1


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH 2/3] pack-objects: zero-initialize tree_depth/layer arrays
  2018-11-20  9:44 [PATCH 0/3] delta-island fixes Jeff King
  2018-11-20  9:46 ` [PATCH 1/3] pack-objects: fix tree_depth and layer invariants Jeff King
@ 2018-11-20  9:48 ` Jeff King
  2018-11-20  9:50 ` [PATCH 3/3] pack-objects: fix off-by-one in delta-island tree-depth computation Jeff King
  2018-11-20 15:06 ` [PATCH 0/3] delta-island fixes Derrick Stolee
  3 siblings, 0 replies; 9+ messages in thread
From: Jeff King @ 2018-11-20  9:48 UTC (permalink / raw)
  To: git; +Cc: Christian Couder, Derrick Stolee

Commit 108f530385 (pack-objects: move tree_depth into 'struct
packing_data', 2018-08-16) started maintaining a tree_depth array that
matches the "objects" array. We extend the array when:

  1. The objects array is extended, in which case we use realloc to
     extend the tree_depth array.

  2. A caller asks to store a tree_depth for object N, and this is the
     first such request; we create the array from scratch and store the
     value for N.

In the latter case, though, we use regular xmalloc(), and the depth
values for any objects besides N is undefined. This happens to not
trigger a bug with the current code, but the reasons are quite subtle:

 - we never ask about the depth for any object with index i < N. This is
   because we store the depth immediately for all trees and blobs. So
   any such "i" must be a non-tree, and therefore we will never need to
   care about its depth (in fact, we really only care about the depth of
   trees).

 - there are no objects at this point with index i > N, because we
   always fill in the depth for a tree immediately after its object
   entry is created (we may still allocate uninitialized depth entries,
   but they'll be initialized by packlist_alloc() when it initializes
   the entry in the "objects" array).

So it works, but only by chance. To be defensive, let's zero the array,
which matches the "unset" values which would be handed out by
oe_tree_depth() already.

Signed-off-by: Jeff King <peff@peff.net>
---
 git-compat-util.h | 1 +
 pack-objects.h    | 4 ++--
 2 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/git-compat-util.h b/git-compat-util.h
index f16058182f..09b0102cae 100644
--- a/git-compat-util.h
+++ b/git-compat-util.h
@@ -861,6 +861,7 @@ extern FILE *fopen_or_warn(const char *path, const char *mode);
 #define FREE_AND_NULL(p) do { free(p); (p) = NULL; } while (0)
 
 #define ALLOC_ARRAY(x, alloc) (x) = xmalloc(st_mult(sizeof(*(x)), (alloc)))
+#define CALLOC_ARRAY(x, alloc) (x) = xcalloc((alloc), sizeof(*(x)));
 #define REALLOC_ARRAY(x, alloc) (x) = xrealloc((x), st_mult(sizeof(*(x)), (alloc)))
 
 #define COPY_ARRAY(dst, src, n) copy_array((dst), (src), (n), sizeof(*(dst)) + \
diff --git a/pack-objects.h b/pack-objects.h
index f31ac1c81c..dc869f26c2 100644
--- a/pack-objects.h
+++ b/pack-objects.h
@@ -412,7 +412,7 @@ static inline void oe_set_tree_depth(struct packing_data *pack,
 				     unsigned int tree_depth)
 {
 	if (!pack->tree_depth)
-		ALLOC_ARRAY(pack->tree_depth, pack->nr_alloc);
+		CALLOC_ARRAY(pack->tree_depth, pack->nr_alloc);
 	pack->tree_depth[e - pack->objects] = tree_depth;
 }
 
@@ -429,7 +429,7 @@ static inline void oe_set_layer(struct packing_data *pack,
 				unsigned char layer)
 {
 	if (!pack->layer)
-		ALLOC_ARRAY(pack->layer, pack->nr_alloc);
+		CALLOC_ARRAY(pack->layer, pack->nr_alloc);
 	pack->layer[e - pack->objects] = layer;
 }
 
-- 
2.20.0.rc0.715.gf6b01ab3e1


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH 3/3] pack-objects: fix off-by-one in delta-island tree-depth computation
  2018-11-20  9:44 [PATCH 0/3] delta-island fixes Jeff King
  2018-11-20  9:46 ` [PATCH 1/3] pack-objects: fix tree_depth and layer invariants Jeff King
  2018-11-20  9:48 ` [PATCH 2/3] pack-objects: zero-initialize tree_depth/layer arrays Jeff King
@ 2018-11-20  9:50 ` Jeff King
  2018-11-20 15:06 ` [PATCH 0/3] delta-island fixes Derrick Stolee
  3 siblings, 0 replies; 9+ messages in thread
From: Jeff King @ 2018-11-20  9:50 UTC (permalink / raw)
  To: git; +Cc: Christian Couder, Derrick Stolee

When delta-islands are in use, we need to record the deepest path at
which we find each tree and blob. Our loop to do so counts slashes, so
"foo" is depth 0, "foo/bar" is depth 1, and so on.

However, this neglects root trees, which are represented by the empty
string. Those also have depth 0, but are at a layer above "foo". Thus,
"foo" should be 1, "foo/bar" at 2, and so on. We use this depth to
topo-sort the trees in resolve_tree_islands(). As a result, we may fail
to visit a root tree before the sub-trees it contains, and therefore not
correctly pass down the island marks.

That in turn could lead to missing some delta opportunities (objects are
in the same island, but we didn't realize it) or creating unwanted
cross-island deltas (one object is in an island another isn't, but we
don't realize). In practice, it seems to have only a small effect.  Some
experiments on the real-world git/git fork network at GitHub showed an
improvement of only 0.14% in the resulting clone size.

Signed-off-by: Jeff King <peff@peff.net>
---
 builtin/pack-objects.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index e7ea206c08..411aefd687 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -2786,9 +2786,11 @@ static void show_object(struct object *obj, const char *name, void *data)
 
 	if (use_delta_islands) {
 		const char *p;
-		unsigned depth = 0;
+		unsigned depth;
 		struct object_entry *ent;
 
+		/* the empty string is a root tree, which is depth 0 */
+		depth = *name ? 1 : 0;
 		for (p = strchr(name, '/'); p; p = strchr(p + 1, '/'))
 			depth++;
 
-- 
2.20.0.rc0.715.gf6b01ab3e1

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH 0/3] delta-island fixes
  2018-11-20  9:44 [PATCH 0/3] delta-island fixes Jeff King
                   ` (2 preceding siblings ...)
  2018-11-20  9:50 ` [PATCH 3/3] pack-objects: fix off-by-one in delta-island tree-depth computation Jeff King
@ 2018-11-20 15:06 ` Derrick Stolee
  2018-11-22 16:43   ` Jeff King
  3 siblings, 1 reply; 9+ messages in thread
From: Derrick Stolee @ 2018-11-20 15:06 UTC (permalink / raw)
  To: Jeff King, git; +Cc: Christian Couder, Derrick Stolee

On 11/20/2018 4:44 AM, Jeff King wrote:
> In cases like this I think it's often a good idea to have a perf test.
> Those are expensive anyway, and we'd have the double benefit of
> exercising the code and showing off the performance improvement. But the
> delta-island code only makes sense on a very specialized repo: one where
> you have multiple related but diverging histories.  You could simulate
> that by picking two branches in a repository, but the effect is going to
> be miniscule.

The changes in this series look correct. Too bad it is difficult to test.

Perhaps we should add a performance test for the --delta-islands check 
that would trigger the failure (and maybe a clone afterwards). There are 
lots of freely available forks of git.git that present interesting fork 
structure. Here are three that would suffice to make this interesting:

     https://github.com/git/git
     https://github.com/git-for-windows/git
     https://github.com/microsoft/git

Of course, running it on a specific repo is up to the person running the 
perf suite.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 1/3] pack-objects: fix tree_depth and layer invariants
  2018-11-20  9:46 ` [PATCH 1/3] pack-objects: fix tree_depth and layer invariants Jeff King
@ 2018-11-20 16:37   ` Duy Nguyen
  2018-11-22 16:50     ` Jeff King
  2018-11-21  4:52   ` Junio C Hamano
  1 sibling, 1 reply; 9+ messages in thread
From: Duy Nguyen @ 2018-11-20 16:37 UTC (permalink / raw)
  To: Jeff King; +Cc: Git Mailing List, Christian Couder, Derrick Stolee

On Tue, Nov 20, 2018 at 11:04 AM Jeff King <peff@peff.net> wrote:
>
> Commit 108f530385 (pack-objects: move tree_depth into 'struct
> packing_data', 2018-08-16) dynamically manages a tree_depth array in
> packing_data that maintains one of these invariants:
>
>   1. tree_depth is NULL (i.e., the requested options don't require us to
>      track tree depths)
>
>   2. tree_depth is non-NULL and has as many entries as the "objects"
>      array
>
> We maintain (2) by:
>
>   a. When the objects array grows, grow tree_depth to the same size
>      (unless it's NULL, in which case we can leave it).
>
>   b. When a caller asks to set a depth via oe_set_tree_depth(), if
>      tree_depth is NULL we allocate it.
>
> But in (b), we use the number of stored objects, _not_ the allocated
> size of the objects array. So we can run into a situation like this:
>
>   1. packlist_alloc() needs to store the Nth object, so it grows the
>      objects array to M, where M > N.
>
>   2. oe_set_tree_depth() wants to store a depth, so it allocates an
>      array of length N. Now we've violated our invariant.
>
>   3. packlist_alloc() needs to store the N+1th object. But it _doesn't_
>      grow the objects array, since N <= M still holds. We try to assign
>      to tree_depth[N+1], which is out of bounds.

Do you think if this splitting data to packing_data is too fragile
that we should just scrape the whole thing and move all data back to
object_entry[]? We would use more memory of course but higher memory
usage is still better than more bugs (if these are likely to show up
again).
-- 
Duy

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 1/3] pack-objects: fix tree_depth and layer invariants
  2018-11-20  9:46 ` [PATCH 1/3] pack-objects: fix tree_depth and layer invariants Jeff King
  2018-11-20 16:37   ` Duy Nguyen
@ 2018-11-21  4:52   ` Junio C Hamano
  1 sibling, 0 replies; 9+ messages in thread
From: Junio C Hamano @ 2018-11-21  4:52 UTC (permalink / raw)
  To: Jeff King; +Cc: git, Christian Couder, Derrick Stolee

Jeff King <peff@peff.net> writes:

> But in (b), we use the number of stored objects, _not_ the allocated
> size of the objects array. So we can run into a situation like this:
>
>   1. packlist_alloc() needs to store the Nth object, so it grows the
>      objects array to M, where M > N.
>
>   2. oe_set_tree_depth() wants to store a depth, so it allocates an
>      array of length N. Now we've violated our invariant.
>
>   3. packlist_alloc() needs to store the N+1th object. But it _doesn't_
>      grow the objects array, since N <= M still holds. We try to assign
>      to tree_depth[N+1], which is out of bounds.

Ouch.  I see counting and allocationg is hard (I think I spotted a
bug in another area that comes from the same "count while filtering
and then allocate" pattern during this cycle).  Thanks for spotting.


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 0/3] delta-island fixes
  2018-11-20 15:06 ` [PATCH 0/3] delta-island fixes Derrick Stolee
@ 2018-11-22 16:43   ` Jeff King
  0 siblings, 0 replies; 9+ messages in thread
From: Jeff King @ 2018-11-22 16:43 UTC (permalink / raw)
  To: Derrick Stolee; +Cc: git, Christian Couder, Derrick Stolee

On Tue, Nov 20, 2018 at 10:06:57AM -0500, Derrick Stolee wrote:

> On 11/20/2018 4:44 AM, Jeff King wrote:
> > In cases like this I think it's often a good idea to have a perf test.
> > Those are expensive anyway, and we'd have the double benefit of
> > exercising the code and showing off the performance improvement. But the
> > delta-island code only makes sense on a very specialized repo: one where
> > you have multiple related but diverging histories.  You could simulate
> > that by picking two branches in a repository, but the effect is going to
> > be miniscule.
> 
> The changes in this series look correct. Too bad it is difficult to test.
> 
> Perhaps we should add a performance test for the --delta-islands check that
> would trigger the failure (and maybe a clone afterwards). There are lots of
> freely available forks of git.git that present interesting fork structure.
> Here are three that would suffice to make this interesting:
> 
>     https://github.com/git/git
>     https://github.com/git-for-windows/git
>     https://github.com/microsoft/git
> 
> Of course, running it on a specific repo is up to the person running the
> perf suite.

I hadn't really considered the possibility of reconstructing a fork
network repository from public repos. It probably would be possible to
include a script which does so, although:

  - I suspect it's going to be pretty expensive. We can use --reference
    to reduce the size of subsequent clones, but just the repacks you
    have to do to assemble the final shared repo can get pretty
    expensive.

  - That's 3 forks. Our real-world case has over 4000. I'm not sure of
    the size of the effects for smaller cases.

So in short, I think it's an interesting avenue to explore, but it might
turn out to be a dead-end. I'm not going to prioritize it right now, but
if somebody wants to play with it, I'd be happy to look over the
results.

-Peff

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 1/3] pack-objects: fix tree_depth and layer invariants
  2018-11-20 16:37   ` Duy Nguyen
@ 2018-11-22 16:50     ` Jeff King
  0 siblings, 0 replies; 9+ messages in thread
From: Jeff King @ 2018-11-22 16:50 UTC (permalink / raw)
  To: Duy Nguyen; +Cc: Git Mailing List, Christian Couder, Derrick Stolee

On Tue, Nov 20, 2018 at 05:37:18PM +0100, Duy Nguyen wrote:

> > But in (b), we use the number of stored objects, _not_ the allocated
> > size of the objects array. So we can run into a situation like this:
> >
> >   1. packlist_alloc() needs to store the Nth object, so it grows the
> >      objects array to M, where M > N.
> >
> >   2. oe_set_tree_depth() wants to store a depth, so it allocates an
> >      array of length N. Now we've violated our invariant.
> >
> >   3. packlist_alloc() needs to store the N+1th object. But it _doesn't_
> >      grow the objects array, since N <= M still holds. We try to assign
> >      to tree_depth[N+1], which is out of bounds.
> 
> Do you think if this splitting data to packing_data is too fragile
> that we should just scrape the whole thing and move all data back to
> object_entry[]? We would use more memory of course but higher memory
> usage is still better than more bugs (if these are likely to show up
> again).

Certainly that thought crossed my mind while working on these patches. :)

Especially given the difficulties it introduced into the recent
bitmap-reuse topic, and the size fixes we had to deal with in v2.19.

Overall, though, I dunno. This fix, while subtle, turned out not to be
too complicated. And the memory savings are real. I consider 100M
objects to be on the large size of feasible for stock Git these days,
and I think we are talking about on the order of 4GB memory savings
there. You need a big machine to handle a repository of that size, but
4GB is still appreciable.

So I guess at this point, with all (known) bugs fixed, we should stick
with it for now. If it becomes a problem for development of a future
feature, then we can re-evaluate then.

-Peff

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2018-11-22 16:50 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-11-20  9:44 [PATCH 0/3] delta-island fixes Jeff King
2018-11-20  9:46 ` [PATCH 1/3] pack-objects: fix tree_depth and layer invariants Jeff King
2018-11-20 16:37   ` Duy Nguyen
2018-11-22 16:50     ` Jeff King
2018-11-21  4:52   ` Junio C Hamano
2018-11-20  9:48 ` [PATCH 2/3] pack-objects: zero-initialize tree_depth/layer arrays Jeff King
2018-11-20  9:50 ` [PATCH 3/3] pack-objects: fix off-by-one in delta-island tree-depth computation Jeff King
2018-11-20 15:06 ` [PATCH 0/3] delta-island fixes Derrick Stolee
2018-11-22 16:43   ` Jeff King

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).