git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
* [PATCH] pack-objects: Use reachability bitmap index when generating non-stdout pack too
@ 2016-07-07 19:09 Kirill Smelkov
  2016-07-07 20:52 ` Jeff King
  0 siblings, 1 reply; 62+ messages in thread
From: Kirill Smelkov @ 2016-07-07 19:09 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Jérome Perrin, Isabelle Vallet, Kazuhiko Shiozaki,
	Julien Muchembled, git, Kirill Smelkov, Vicent Marti, Jeff King

Starting from 6b8fda2d (pack-objects: use bitmaps when packing objects)
if a repository has bitmap index, pack-objects can nicely speedup
"Counting objects" graph traversal phase. That however was done only for
case when resultant pack is sent to stdout, not written into a file.

We can teach pack-objects to use bitmap index for initial object
counting phase when generating resultant pack file too:

- if we know bitmap index generation is not enabled for resultant pack:

  Current code has singleton bitmap_git so cannot work simultaneously
  with two bitmap indices.

- if we keep pack reuse enabled still only for "send-to-stdout" case:

  Because on pack reuse raw entries are directly written out to destination
  pack by write_reused_pack() bypassing needed for pack index generation
  bookkeeping done by regular codepath in write_one() and friends.

  (at least that's my understanding after briefly looking at the code)

We also need to care and teach add_object_entry_from_bitmap() to respect
--local via not adding nonlocal loose object to resultant pack (this
is bitmap-codepath counterpart of daae0625 (pack-objects: extend --local
to mean ignore non-local loose objects too) -- not to break 'loose
objects in alternate ODB are not repacked' in t7700-repack.sh .

Otherwise all git tests pass, and for pack-objects -> file we get nice
speedup:

    erp5.git[1] (~230MB) extracted from ~ 5GB lab.nexedi.com backup
    repository managed by git-backup[2] via

    time echo 0186ac99 | git pack-objects --revs erp5pack

before:  37.2s
after:   26.2s

And for `git repack -adb` packed git.git

    time echo 5c589a73 | git pack-objects --revs gitpack

before:   7.1s
after:    3.6s

i.e. it can be 30% - 50% speedup for pack extraction.

git-backup extracts many packs on repositories restoration. That was my
initial motivation for the patch.

[1] https://lab.nexedi.com/nexedi/erp5
[2] https://lab.nexedi.com/kirr/git-backup

Cc: Vicent Marti <tanoku@gmail.com>
Cc: Jeff King <peff@peff.net>
Signed-off-by: Kirill Smelkov <kirr@nexedi.com>
---
 builtin/pack-objects.c  | 7 +++++--
 t/t5310-pack-bitmaps.sh | 9 +++++++++
 2 files changed, 14 insertions(+), 2 deletions(-)

diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index a2f8cfd..be0ebe8 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -1052,6 +1052,9 @@ static int add_object_entry_from_bitmap(const unsigned char *sha1,
 {
 	uint32_t index_pos;
 
+	if (local && has_loose_object_nonlocal(sha1))
+		return 0;
+
 	if (have_duplicate_entry(sha1, 0, &index_pos))
 		return 0;
 
@@ -2488,7 +2491,7 @@ static int get_object_list_from_bitmap(struct rev_info *revs)
 	if (prepare_bitmap_walk(revs) < 0)
 		return -1;
 
-	if (pack_options_allow_reuse() &&
+	if (pack_options_allow_reuse() && pack_to_stdout &&
 	    !reuse_partial_packfile_from_bitmap(
 			&reuse_packfile,
 			&reuse_packfile_objects,
@@ -2773,7 +2776,7 @@ int cmd_pack_objects(int argc, const char **argv, const char *prefix)
 	if (!rev_list_all || !rev_list_reflog || !rev_list_index)
 		unpack_unreachable_expiration = 0;
 
-	if (!use_internal_rev_list || !pack_to_stdout || is_repository_shallow())
+	if (!use_internal_rev_list || (!pack_to_stdout && write_bitmap_index) || is_repository_shallow())
 		use_bitmap_index = 0;
 
 	if (pack_to_stdout || !rev_list_all)
diff --git a/t/t5310-pack-bitmaps.sh b/t/t5310-pack-bitmaps.sh
index 3893afd..533fc31 100755
--- a/t/t5310-pack-bitmaps.sh
+++ b/t/t5310-pack-bitmaps.sh
@@ -118,6 +118,15 @@ test_expect_success 'incremental repack can disable bitmaps' '
 	git repack -d --no-write-bitmap-index
 '
 
+test_expect_success 'pack-objects to file can use bitmap' '
+	# make sure we still have 1 bitmap index from previous tests
+	ls .git/objects/pack/ | grep bitmap >output &&
+	test_line_count = 1 output &&
+	# pack-objects uses bitmap index by default, when it is available
+	packsha1=$(git pack-objects --all mypack </dev/null) &&
+	git verify-pack mypack-$packsha1.pack
+'
+
 test_expect_success 'full repack, reusing previous bitmaps' '
 	git repack -ad &&
 	ls .git/objects/pack/ | grep bitmap >output &&
-- 
2.9.0.431.gb11dac7.dirty

^ permalink raw reply related	[flat|nested] 62+ messages in thread

* Re: [PATCH] pack-objects: Use reachability bitmap index when generating non-stdout pack too
  2016-07-07 19:09 [PATCH] pack-objects: Use reachability bitmap index when generating non-stdout pack too Kirill Smelkov
@ 2016-07-07 20:52 ` Jeff King
  2016-07-08 10:38   ` Kirill Smelkov
  0 siblings, 1 reply; 62+ messages in thread
From: Jeff King @ 2016-07-07 20:52 UTC (permalink / raw)
  To: Kirill Smelkov
  Cc: Junio C Hamano, Jérome Perrin, Isabelle Vallet,
	Kazuhiko Shiozaki, Julien Muchembled, git, Vicent Marti

On Thu, Jul 07, 2016 at 10:09:17PM +0300, Kirill Smelkov wrote:

> Starting from 6b8fda2d (pack-objects: use bitmaps when packing objects)
> if a repository has bitmap index, pack-objects can nicely speedup
> "Counting objects" graph traversal phase. That however was done only for
> case when resultant pack is sent to stdout, not written into a file.
> 
> We can teach pack-objects to use bitmap index for initial object
> counting phase when generating resultant pack file too:

I'm not sure this is a good idea in general. When bitmaps are in use, we
cannot fill out the details in the object-packing list as thoroughly. In
particular:

  - we will not compute the same write order (which is based on
    traversal order), leading to packs that have less efficient cache
    characteristics

  - we don't learn about the filename of trees and blobs, which is going
    to make the delta step much less efficient. This might be mitigated
    by turning on the bitmap name-hash cache; I don't recall how much
    detail pack-objects needs on the name (i.e., the full name versus
    just the hash).

There may be other subtle things, too. The general idea of tying the
bitmap use to pack_to_stdout is that you _do_ want to use it for
serving fetches and pushes, but for a full on-disk repack via gc, it's
more important to generate a good pack.

Your use case:

> git-backup extracts many packs on repositories restoration. That was my
> initial motivation for the patch.

Seems to be somewhere in between. I'm not sure I understand how you're
invoking pack-objects here, but I wonder if you should be using
"pack-objects --stdout" yourself.

But even if it is the right thing for your use case to be using bitmaps
to generate an on-disk bitmap, I think we should be making sure it
_doesn't_ trigger when doing a normal repack.

-Peff

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH] pack-objects: Use reachability bitmap index when generating non-stdout pack too
  2016-07-07 20:52 ` Jeff King
@ 2016-07-08 10:38   ` Kirill Smelkov
  2016-07-12 19:08     ` Kirill Smelkov
  2016-07-13  8:26     ` Jeff King
  0 siblings, 2 replies; 62+ messages in thread
From: Kirill Smelkov @ 2016-07-08 10:38 UTC (permalink / raw)
  To: Jeff King
  Cc: Junio C Hamano, Jérome Perrin, Isabelle Vallet,
	Kazuhiko Shiozaki, Julien Muchembled, git, Vicent Marti

Peff first of all thanks for feedback,

On Thu, Jul 07, 2016 at 04:52:23PM -0400, Jeff King wrote:
> On Thu, Jul 07, 2016 at 10:09:17PM +0300, Kirill Smelkov wrote:
> 
> > Starting from 6b8fda2d (pack-objects: use bitmaps when packing objects)
> > if a repository has bitmap index, pack-objects can nicely speedup
> > "Counting objects" graph traversal phase. That however was done only for
> > case when resultant pack is sent to stdout, not written into a file.
> > 
> > We can teach pack-objects to use bitmap index for initial object
> > counting phase when generating resultant pack file too:
> 
> I'm not sure this is a good idea in general. When bitmaps are in use, we
> cannot fill out the details in the object-packing list as thoroughly. In
> particular:
> 
>   - we will not compute the same write order (which is based on
>     traversal order), leading to packs that have less efficient cache
>     characteristics

I agree the order can be not exactly the same. Still if original pack is
packed well (with good recency order), while using bitmap we will tend
to traverse it in close to original order.

Maybe I'm not completely right on this, but to me it looks to be the
case because if objects in original pack are put there linearly sorted
by recency order, and we use bitmap index to set of all reachable
objects from a root, and then just _linearly_ gather all those objects
from original pack by 1s in bitmap and put them in the same order into
destination pack, the recency order won't be broken.

Or am I maybe misunderstanding something?

Please also see below:

>   - we don't learn about the filename of trees and blobs, which is going
>     to make the delta step much less efficient. This might be mitigated
>     by turning on the bitmap name-hash cache; I don't recall how much
>     detail pack-objects needs on the name (i.e., the full name versus
>     just the hash).

If I understand it right, it uses only uint32_t name hash while searching. From
pack-objects.{h,c} :

---- 8< ----
struct object_entry {
	...
	uint32_t hash;                  /* name hint hash */


/*
 * We search for deltas in a list sorted by type, by filename hash, and then
 * by size, so that we see progressively smaller and smaller files.
 * That's because we prefer deltas to be from the bigger file
 * to the smaller -- deletes are potentially cheaper, but perhaps
 * more importantly, the bigger file is likely the more recent
 * one.  The deepest deltas are therefore the oldest objects which are
 * less susceptible to be accessed often.
 */
static int type_size_sort(const void *_a, const void *_b)
{
        const struct object_entry *a = *(struct object_entry **)_a;
        const struct object_entry *b = *(struct object_entry **)_b;

        if (a->type > b->type)
                return -1;
        if (a->type < b->type) 
                return 1;
        if (a->hash > b->hash)
                return -1;
        if (a->hash < b->hash)
                return 1;
	...
---- 8< ----

Documentation/technical/pack-heuristics.txt also confirms this:

---- 8< ----
    ...
    <gitster> The quote from the above linus should be rewritten a
        bit (wait for it):
        - first sort by type.  Different objects never delta with
          each other.
        - then sort by filename/dirname.  hash of the basename
          occupies the top BITS_PER_INT-DIR_BITS bits, and bottom
          DIR_BITS are for the hash of leading path elements.

    ...

    If I might add, the trick is to make files that _might_ be similar be
    located close to each other in the hash buckets based on their file
    names.  It used to be that "foo/Makefile", "bar/baz/quux/Makefile" and
    "Makefile" all landed in the same bucket due to their common basename,
    "Makefile". However, now they land in "close" buckets.
    
    The algorithm allows not just for the _same_ bucket, but for _close_
    buckets to be considered delta candidates.  The rationale is
    essentially that files, like Makefiles, often have very similar
    content no matter what directory they live in.
---- 8< ----


So yes, exactly as you say with pack.writeBitmapHashCache=true (ae4f07fb) the
delta-search heuristics is almost as efficient as with just raw filenames.

I can confirm this also via e.g. (with my patch applied) :

---- 8< ----
$ time echo 0186ac99 | git pack-objects --no-use-bitmap-index --revs erp5pack-plain
Counting objects: 627171, done.
Compressing objects: 100% (176949/176949), done.
50570987560d481742af4a8083028c2322a0534a
Writing objects: 100% (627171/627171), done.
Total 627171 (delta 439404), reused 594820 (delta 410210)

real    0m37.272s
user    0m33.648s
sys     0m1.580s

$ time echo 0186ac99 | git pack-objects --revs erp5pack-bitmap
Counting objects: 627171, done.
Compressing objects: 100% (176914/176914), done.
7c15a9b1eca1326e679297b217c5a48954625ca2
Writing objects: 100% (627171/627171), done.
Total 627171 (delta 439484), reused 594855 (delta 410245)

real    0m27.020s
user    0m23.364s
sys     0m0.992s

$ ll erp5pack-{plain,bitmap}*
	  17561860  erp5pack-bitmap-7c15a9b1eca1326e679297b217c5a48954625ca2.idx
	 238760161  erp5pack-bitmap-7c15a9b1eca1326e679297b217c5a48954625ca2.pack
	  17561860  erp5pack-plain-50570987560d481742af4a8083028c2322a0534a.idx
	 238634201  erp5pack-plain-50570987560d481742af4a8083028c2322a0534a.pack
---- 8< ----

( By the way about pack generated with bitmap retaining close recency
  order:

  ---- 8< ----
  $ git verify-pack -v erp5pack-plain-50570987560d481742af4a8083028c2322a0534a.pack >1
  $ git verify-pack -v erp5pack-bitmap-7c15a9b1eca1326e679297b217c5a48954625ca2.pack >2
  $ grep commit 1 |awk '{print $1}' >1.commit
  $ grep commit 2 |awk '{print $1}' >2.commit
  $ wc -l 1.commit
  46136 1.commit
  $ wc -l 2.commit
  46136 2.commit
  $ diff -u0 1.commit 2.commit |wc -l
  55
  ---- 8< ----
  
  so 55/46136 shows it is very almost the same. )


> There may be other subtle things, too. The general idea of tying the
> bitmap use to pack_to_stdout is that you _do_ want to use it for
> serving fetches and pushes, but for a full on-disk repack via gc, it's
> more important to generate a good pack.

It is better we send good packs to clients too, right? And with
pack.writeBitmapHashCache=true and retaining recency order (please see
above, but again maybe I'm not completely right) to me we should be still
generating a good pack while using bitmap reachability index for object
graph traversal.

> Your use case:
> 
> > git-backup extracts many packs on repositories restoration. That was my
> > initial motivation for the patch.
> 
> Seems to be somewhere in between. I'm not sure I understand how you're
> invoking pack-objects here,

It is just

    pack-objects --revs --reuse-object --reuse-delta --delta-base-offset extractedrepo/objects/pack/pack  < SHA1-HEADS

    https://lab.nexedi.com/kirr/git-backup/blob/7fcb8c67/git-backup.go#L829

> but I wonder if you should be using "pack-objects --stdout" yourself.

I already tried --stdout. The problem is on repository extraction we
need to both extract the pack and index it. While `pack-object file`
does both, for --stdout case we need to additionally index extracted
pack with `git index-pack`, and standalone `git index-pack` is very slow
- in my experience much slower than generating the pack itself:

---- 8< ----
$ time echo 0186ac99 | git pack-objects --stdout --revs >erp5pack-stdout.pack
Counting objects: 627171, done.
Compressing objects: 100% (176914/176914), done.
Total 627171 (delta 439484), reused 594855 (delta 410245)

real    0m22.309s
user    0m21.148s
sys     0m0.932s

$ ll erp5pack-stdout*
        238760161   erp5pack-stdout.pack

$ time git index-pack erp5pack-stdout.pack
7c15a9b1eca1326e679297b217c5a48954625ca2

real    0m50.873s   <-- more than 2 times slower than time to generate pack itself!
user    0m49.300s
sys     0m1.360s

$ ll erp5pack-stdout*
         17561860   erp5pack-stdout.idx
        238760161   erp5pack-stdout.pack
---- 8< ----

So the time for

    `pack-object --stdout >file.pack` + `index-pack file.pack`  is  72s,
    
while

    `pack-objects file.pack` which does both pack and index     is  27s.
    
And even

    `pack-objects --no-use-bitmap-index file.pack`              is  37s.


I've tried to briefly see why index-pack is so slow and offhand I can
see that it needs to load all objects, decompresses them etc (maybe I'm
not so right here - I looked only briefly), while pack-objects while
generating the pack has all needed information directly at hand and thus
can emit index much more easily.

For sever - clients scenario, index-pack load is put onto clients thus
offloading server, but for my use case where extracted repository is on
the same machine the load does not go away.

That's why for me it makes more sense to emit both pack and its index in
one go.

Still it would be interesting to eventually see why index-pack is so
anomaly slow.

> But even if it is the right thing for your use case to be using bitmaps
> to generate an on-disk bitmap, I think we should be making sure it
> _doesn't_ trigger when doing a normal repack.

So seems the way forward here is to teach pack-objects not to silently
drop explicit --use-pack-bitmap for cases when it can handle it?
(currently even if this option was given, for !stdout cases pack-objects
simply drop use_bitmap_index to 0).

And to make sure default for use_bitmap_index is 0 for !stdout cases?

Or are we fine with my arguments about recency order staying the same
when using bitmap reachability index for object graph traversal, and this
way the patch is fine to go in as it is?

Thanks again,
Kirill

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH] pack-objects: Use reachability bitmap index when generating non-stdout pack too
  2016-07-08 10:38   ` Kirill Smelkov
@ 2016-07-12 19:08     ` Kirill Smelkov
  2016-07-13  8:30       ` Jeff King
  2016-07-13  8:26     ` Jeff King
  1 sibling, 1 reply; 62+ messages in thread
From: Kirill Smelkov @ 2016-07-12 19:08 UTC (permalink / raw)
  To: Jeff King
  Cc: Junio C Hamano, Jérome Perrin, Isabelle Vallet,
	Kazuhiko Shiozaki, Julien Muchembled, git, Vicent Marti

On Fri, Jul 08, 2016 at 01:38:55PM +0300, Kirill Smelkov wrote:
> Peff first of all thanks for feedback,
> 
> On Thu, Jul 07, 2016 at 04:52:23PM -0400, Jeff King wrote:
> > On Thu, Jul 07, 2016 at 10:09:17PM +0300, Kirill Smelkov wrote:
> > 
> > > Starting from 6b8fda2d (pack-objects: use bitmaps when packing objects)
> > > if a repository has bitmap index, pack-objects can nicely speedup
> > > "Counting objects" graph traversal phase. That however was done only for
> > > case when resultant pack is sent to stdout, not written into a file.
> > > 
> > > We can teach pack-objects to use bitmap index for initial object
> > > counting phase when generating resultant pack file too:
> > 
> > I'm not sure this is a good idea in general. When bitmaps are in use, we
> > cannot fill out the details in the object-packing list as thoroughly. In
> > particular:
> > 
> >   - we will not compute the same write order (which is based on
> >     traversal order), leading to packs that have less efficient cache
> >     characteristics
> 
> I agree the order can be not exactly the same. Still if original pack is
> packed well (with good recency order), while using bitmap we will tend
> to traverse it in close to original order.
> 
> Maybe I'm not completely right on this, but to me it looks to be the
> case because if objects in original pack are put there linearly sorted
> by recency order, and we use bitmap index to set of all reachable
> objects from a root, and then just _linearly_ gather all those objects
> from original pack by 1s in bitmap and put them in the same order into
> destination pack, the recency order won't be broken.
> 
> Or am I maybe misunderstanding something?
> 
> Please also see below:
> 
> >   - we don't learn about the filename of trees and blobs, which is going
> >     to make the delta step much less efficient. This might be mitigated
> >     by turning on the bitmap name-hash cache; I don't recall how much
> >     detail pack-objects needs on the name (i.e., the full name versus
> >     just the hash).
> 
> If I understand it right, it uses only uint32_t name hash while searching. From
> pack-objects.{h,c} :
> 
> ---- 8< ----
> struct object_entry {
> 	...
> 	uint32_t hash;                  /* name hint hash */
> 
> 
> /*
>  * We search for deltas in a list sorted by type, by filename hash, and then
>  * by size, so that we see progressively smaller and smaller files.
>  * That's because we prefer deltas to be from the bigger file
>  * to the smaller -- deletes are potentially cheaper, but perhaps
>  * more importantly, the bigger file is likely the more recent
>  * one.  The deepest deltas are therefore the oldest objects which are
>  * less susceptible to be accessed often.
>  */
> static int type_size_sort(const void *_a, const void *_b)
> {
>         const struct object_entry *a = *(struct object_entry **)_a;
>         const struct object_entry *b = *(struct object_entry **)_b;
> 
>         if (a->type > b->type)
>                 return -1;
>         if (a->type < b->type) 
>                 return 1;
>         if (a->hash > b->hash)
>                 return -1;
>         if (a->hash < b->hash)
>                 return 1;
> 	...
> ---- 8< ----
> 
> Documentation/technical/pack-heuristics.txt also confirms this:
> 
> ---- 8< ----
>     ...
>     <gitster> The quote from the above linus should be rewritten a
>         bit (wait for it):
>         - first sort by type.  Different objects never delta with
>           each other.
>         - then sort by filename/dirname.  hash of the basename
>           occupies the top BITS_PER_INT-DIR_BITS bits, and bottom
>           DIR_BITS are for the hash of leading path elements.
> 
>     ...
> 
>     If I might add, the trick is to make files that _might_ be similar be
>     located close to each other in the hash buckets based on their file
>     names.  It used to be that "foo/Makefile", "bar/baz/quux/Makefile" and
>     "Makefile" all landed in the same bucket due to their common basename,
>     "Makefile". However, now they land in "close" buckets.
>     
>     The algorithm allows not just for the _same_ bucket, but for _close_
>     buckets to be considered delta candidates.  The rationale is
>     essentially that files, like Makefiles, often have very similar
>     content no matter what directory they live in.
> ---- 8< ----
> 
> 
> So yes, exactly as you say with pack.writeBitmapHashCache=true (ae4f07fb) the
> delta-search heuristics is almost as efficient as with just raw filenames.
> 
> I can confirm this also via e.g. (with my patch applied) :
> 
> ---- 8< ----
> $ time echo 0186ac99 | git pack-objects --no-use-bitmap-index --revs erp5pack-plain
> Counting objects: 627171, done.
> Compressing objects: 100% (176949/176949), done.
> 50570987560d481742af4a8083028c2322a0534a
> Writing objects: 100% (627171/627171), done.
> Total 627171 (delta 439404), reused 594820 (delta 410210)
> 
> real    0m37.272s
> user    0m33.648s
> sys     0m1.580s
> 
> $ time echo 0186ac99 | git pack-objects --revs erp5pack-bitmap
> Counting objects: 627171, done.
> Compressing objects: 100% (176914/176914), done.
> 7c15a9b1eca1326e679297b217c5a48954625ca2
> Writing objects: 100% (627171/627171), done.
> Total 627171 (delta 439484), reused 594855 (delta 410245)
> 
> real    0m27.020s
> user    0m23.364s
> sys     0m0.992s
> 
> $ ll erp5pack-{plain,bitmap}*
> 	  17561860  erp5pack-bitmap-7c15a9b1eca1326e679297b217c5a48954625ca2.idx
> 	 238760161  erp5pack-bitmap-7c15a9b1eca1326e679297b217c5a48954625ca2.pack
> 	  17561860  erp5pack-plain-50570987560d481742af4a8083028c2322a0534a.idx
> 	 238634201  erp5pack-plain-50570987560d481742af4a8083028c2322a0534a.pack
> ---- 8< ----
> 
> ( By the way about pack generated with bitmap retaining close recency
>   order:
> 
>   ---- 8< ----
>   $ git verify-pack -v erp5pack-plain-50570987560d481742af4a8083028c2322a0534a.pack >1
>   $ git verify-pack -v erp5pack-bitmap-7c15a9b1eca1326e679297b217c5a48954625ca2.pack >2
>   $ grep commit 1 |awk '{print $1}' >1.commit
>   $ grep commit 2 |awk '{print $1}' >2.commit
>   $ wc -l 1.commit
>   46136 1.commit
>   $ wc -l 2.commit
>   46136 2.commit
>   $ diff -u0 1.commit 2.commit |wc -l
>   55
>   ---- 8< ----
>   
>   so 55/46136 shows it is very almost the same. )
> 
> 
> > There may be other subtle things, too. The general idea of tying the
> > bitmap use to pack_to_stdout is that you _do_ want to use it for
> > serving fetches and pushes, but for a full on-disk repack via gc, it's
> > more important to generate a good pack.
> 
> It is better we send good packs to clients too, right? And with
> pack.writeBitmapHashCache=true and retaining recency order (please see
> above, but again maybe I'm not completely right) to me we should be still
> generating a good pack while using bitmap reachability index for object
> graph traversal.
> 
> > Your use case:
> > 
> > > git-backup extracts many packs on repositories restoration. That was my
> > > initial motivation for the patch.
> > 
> > Seems to be somewhere in between. I'm not sure I understand how you're
> > invoking pack-objects here,
> 
> It is just
> 
>     pack-objects --revs --reuse-object --reuse-delta --delta-base-offset extractedrepo/objects/pack/pack  < SHA1-HEADS
> 
>     https://lab.nexedi.com/kirr/git-backup/blob/7fcb8c67/git-backup.go#L829
> 
> > but I wonder if you should be using "pack-objects --stdout" yourself.
> 
> I already tried --stdout. The problem is on repository extraction we
> need to both extract the pack and index it. While `pack-object file`
> does both, for --stdout case we need to additionally index extracted
> pack with `git index-pack`, and standalone `git index-pack` is very slow
> - in my experience much slower than generating the pack itself:
> 
> ---- 8< ----
> $ time echo 0186ac99 | git pack-objects --stdout --revs >erp5pack-stdout.pack
> Counting objects: 627171, done.
> Compressing objects: 100% (176914/176914), done.
> Total 627171 (delta 439484), reused 594855 (delta 410245)
> 
> real    0m22.309s
> user    0m21.148s
> sys     0m0.932s
> 
> $ ll erp5pack-stdout*
>         238760161   erp5pack-stdout.pack
> 
> $ time git index-pack erp5pack-stdout.pack
> 7c15a9b1eca1326e679297b217c5a48954625ca2
> 
> real    0m50.873s   <-- more than 2 times slower than time to generate pack itself!
> user    0m49.300s
> sys     0m1.360s
> 
> $ ll erp5pack-stdout*
>          17561860   erp5pack-stdout.idx
>         238760161   erp5pack-stdout.pack
> ---- 8< ----
> 
> So the time for
> 
>     `pack-object --stdout >file.pack` + `index-pack file.pack`  is  72s,
>     
> while
> 
>     `pack-objects file.pack` which does both pack and index     is  27s.
>     
> And even
> 
>     `pack-objects --no-use-bitmap-index file.pack`              is  37s.
> 
> 
> I've tried to briefly see why index-pack is so slow and offhand I can
> see that it needs to load all objects, decompresses them etc (maybe I'm
> not so right here - I looked only briefly), while pack-objects while
> generating the pack has all needed information directly at hand and thus
> can emit index much more easily.
> 
> For sever - clients scenario, index-pack load is put onto clients thus
> offloading server, but for my use case where extracted repository is on
> the same machine the load does not go away.
> 
> That's why for me it makes more sense to emit both pack and its index in
> one go.
> 
> Still it would be interesting to eventually see why index-pack is so
> anomaly slow.
> 
> > But even if it is the right thing for your use case to be using bitmaps
> > to generate an on-disk bitmap, I think we should be making sure it
> > _doesn't_ trigger when doing a normal repack.
> 
> So seems the way forward here is to teach pack-objects not to silently
> drop explicit --use-pack-bitmap for cases when it can handle it?
> (currently even if this option was given, for !stdout cases pack-objects
> simply drop use_bitmap_index to 0).
> 
> And to make sure default for use_bitmap_index is 0 for !stdout cases?
> 
> Or are we fine with my arguments about recency order staying the same
> when using bitmap reachability index for object graph traversal, and this
> way the patch is fine to go in as it is?

Since there is no reply I assume the safe way to go is to let default
for pack-to-file case to be "not using bitmap index". Please find updated
patch and interdiff below. I would still be grateful for feedback on
my above use-bitmap-for-pack-to-file arguments.

Thanks,
Kirill

(interdiff)
diff --git a/Documentation/config.txt b/Documentation/config.txt
index e455fae..1888f42 100644
--- a/Documentation/config.txt
+++ b/Documentation/config.txt
@@ -2241,12 +2241,20 @@ pack.packSizeLimit::
 	Common unit suffixes of 'k', 'm', or 'g' are
 	supported.
 
-pack.useBitmaps::
+pack.useBitmaps (deprecated)::
+	This is a deprecated synonym for `pack.useBitmaps.stdout`.
+
+pack.useBitmaps.stdout::
 	When true, git will use pack bitmaps (if available) when packing
 	to stdout (e.g., during the server side of a fetch). Defaults to
 	true. You should not generally need to turn this off unless
 	you are debugging pack bitmaps.
 
+pack.useBitmaps.file::
+	When true, git will use pack bitmaps (if available) when packing
+	to file (e.g., on repack). Defaults to false. You should not
+	generally need to turn this on unless you know what you are doing.
+
 pack.writeBitmaps (deprecated)::
 	This is a deprecated synonym for `repack.writeBitmaps`.
 
diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index be0ebe8..7aaa1af 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -66,7 +66,8 @@ static struct packed_git *reuse_packfile;
 static uint32_t reuse_packfile_objects;
 static off_t reuse_packfile_offset;
 
-static int use_bitmap_index = 1;
+static int use_bitmap_stdout = 1, use_bitmap_file = 0;
+static int use_bitmap_index;
 static int write_bitmap_index;
 static uint16_t write_bitmap_options;
 
@@ -2227,8 +2228,12 @@ static int git_pack_config(const char *k, const char *v, void *cb)
 		else
 			write_bitmap_options &= ~BITMAP_OPT_HASH_CACHE;
 	}
-	if (!strcmp(k, "pack.usebitmaps")) {
-		use_bitmap_index = git_config_bool(k, v);
+	if (!strcmp(k, "pack.usebitmaps") || !strcmp(k, "pack.usebitmaps.stdout")) {
+		use_bitmap_stdout = git_config_bool(k, v);
+		return 0;
+	}
+	if (!strcmp(k, "pack.usebitmaps.file")) {
+		use_bitmap_file = git_config_bool(k, v);
 		return 0;
 	}
 	if (!strcmp(k, "pack.threads")) {
@@ -2705,6 +2710,7 @@ int cmd_pack_objects(int argc, const char **argv, const char *prefix)
 
 	reset_pack_idx_option(&pack_idx_opts);
 	git_config(git_pack_config, NULL);
+	use_bitmap_index = pack_to_stdout ? use_bitmap_stdout : use_bitmap_file;
 	if (!pack_compression_seen && core_compression_seen)
 		pack_compression_level = core_compression_level;
 
diff --git a/t/t5310-pack-bitmaps.sh b/t/t5310-pack-bitmaps.sh
index 533fc31..9fab2bb 100755
--- a/t/t5310-pack-bitmaps.sh
+++ b/t/t5310-pack-bitmaps.sh
@@ -122,9 +122,14 @@ test_expect_success 'pack-objects to file can use bitmap' '
 	# make sure we still have 1 bitmap index from previous tests
 	ls .git/objects/pack/ | grep bitmap >output &&
 	test_line_count = 1 output &&
-	# pack-objects uses bitmap index by default, when it is available
-	packsha1=$(git pack-objects --all mypack </dev/null) &&
-	git verify-pack mypack-$packsha1.pack
+	# verify equivalent packs are generated with/without using bitmap index
+	packasha1=$(git pack-objects --no-use-bitmap-index --all packa </dev/null) &&
+	packbsha1=$(git pack-objects --use-bitmap-index --all packb </dev/null) &&
+	git verify-pack -v packa-$packasha1.pack >packa.verify &&
+	git verify-pack -v packb-$packbsha1.pack >packb.verify &&
+	grep -o "^$_x40" packa.verify |sort >packa.objects &&
+	grep -o "^$_x40" packb.verify |sort >packb.objects &&
+	test_cmp packa.objects packb.objects
 '
 
 test_expect_success 'full repack, reusing previous bitmaps' '


---- 8< ----
From: Kirill Smelkov <kirr@nexedi.com>
Date: Thu, 7 Jul 2016 20:12:00 +0300
Subject: [PATCH v2] pack-objects: Teach it to use reachability bitmap index when
 generating non-stdout pack too

Starting from 6b8fda2d (pack-objects: use bitmaps when packing objects)
if a repository has bitmap index, pack-objects can nicely speedup
"Counting objects" graph traversal phase. That however was done only for
case when resultant pack is sent to stdout, not written into a file.

We can teach pack-objects to use bitmap index for initial object
counting phase when generating resultant pack file too:

- if we know bitmap index generation is not enabled for resultant pack:

  Current code has singleton bitmap_git so cannot work simultaneously
  with two bitmap indices.

- if we keep pack reuse enabled still only for "send-to-stdout" case:

  Because on pack reuse raw entries are directly written out to destination
  pack by write_reused_pack() bypassing needed for pack index generation
  bookkeeping done by regular codepath in write_one() and friends.

  (at least that's my understanding after briefly looking at the code)

We also need to care and teach add_object_entry_from_bitmap() to respect
--local via not adding nonlocal loose object to resultant pack (this
is bitmap-codepath counterpart of daae0625 (pack-objects: extend --local
to mean ignore non-local loose objects too) -- not to break 'loose
objects in alternate ODB are not repacked' in t7700-repack.sh .

Otherwise all git tests pass, and for pack-objects -> file we get nice
speedup:

    erp5.git[1] (~230MB) extracted from ~ 5GB lab.nexedi.com backup
    repository managed by git-backup[2] via

    time echo 0186ac99 | git pack-objects --revs erp5pack

before:  37.2s
after:   26.2s

And for `git repack -adb` packed git.git

    time echo 5c589a73 | git pack-objects --revs gitpack

before:   7.1s
after:    3.6s

i.e. it can be 30% - 50% speedup for pack extraction.

git-backup extracts many packs on repositories restoration. That was my
initial motivation for the patch.

[1] https://lab.nexedi.com/nexedi/erp5
[2] https://lab.nexedi.com/kirr/git-backup

NOTE

Jeff King suggested that it might be not generally a good idea to
use bitmap reachability index when repacking a repository. For this
reason when packing to a file the default is not to use bitmap, while
for packing-to-stdout case the default stays to be "bitmap is used".

The defaults can be configured with

    pack.useBitmaps.stdout      (renamed from pack.useBitmaps), and
    pack.useBitmaps.file

More context:

    http://article.gmane.org/gmane.comp.version-control.git/299063
    http://article.gmane.org/gmane.comp.version-control.git/299107

Cc: Vicent Marti <tanoku@gmail.com>
Cc: Jeff King <peff@peff.net>
Signed-off-by: Kirill Smelkov <kirr@nexedi.com>
---
 Documentation/config.txt | 10 +++++++++-
 builtin/pack-objects.c   | 19 ++++++++++++++-----
 t/t5310-pack-bitmaps.sh  | 14 ++++++++++++++
 3 files changed, 37 insertions(+), 6 deletions(-)

diff --git a/Documentation/config.txt b/Documentation/config.txt
index e455fae..1888f42 100644
--- a/Documentation/config.txt
+++ b/Documentation/config.txt
@@ -2241,12 +2241,20 @@ pack.packSizeLimit::
 	Common unit suffixes of 'k', 'm', or 'g' are
 	supported.
 
-pack.useBitmaps::
+pack.useBitmaps (deprecated)::
+	This is a deprecated synonym for `pack.useBitmaps.stdout`.
+
+pack.useBitmaps.stdout::
 	When true, git will use pack bitmaps (if available) when packing
 	to stdout (e.g., during the server side of a fetch). Defaults to
 	true. You should not generally need to turn this off unless
 	you are debugging pack bitmaps.
 
+pack.useBitmaps.file::
+	When true, git will use pack bitmaps (if available) when packing
+	to file (e.g., on repack). Defaults to false. You should not
+	generally need to turn this on unless you know what you are doing.
+
 pack.writeBitmaps (deprecated)::
 	This is a deprecated synonym for `repack.writeBitmaps`.
 
diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index a2f8cfd..7aaa1af 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -66,7 +66,8 @@ static struct packed_git *reuse_packfile;
 static uint32_t reuse_packfile_objects;
 static off_t reuse_packfile_offset;
 
-static int use_bitmap_index = 1;
+static int use_bitmap_stdout = 1, use_bitmap_file = 0;
+static int use_bitmap_index;
 static int write_bitmap_index;
 static uint16_t write_bitmap_options;
 
@@ -1052,6 +1053,9 @@ static int add_object_entry_from_bitmap(const unsigned char *sha1,
 {
 	uint32_t index_pos;
 
+	if (local && has_loose_object_nonlocal(sha1))
+		return 0;
+
 	if (have_duplicate_entry(sha1, 0, &index_pos))
 		return 0;
 
@@ -2224,8 +2228,12 @@ static int git_pack_config(const char *k, const char *v, void *cb)
 		else
 			write_bitmap_options &= ~BITMAP_OPT_HASH_CACHE;
 	}
-	if (!strcmp(k, "pack.usebitmaps")) {
-		use_bitmap_index = git_config_bool(k, v);
+	if (!strcmp(k, "pack.usebitmaps") || !strcmp(k, "pack.usebitmaps.stdout")) {
+		use_bitmap_stdout = git_config_bool(k, v);
+		return 0;
+	}
+	if (!strcmp(k, "pack.usebitmaps.file")) {
+		use_bitmap_file = git_config_bool(k, v);
 		return 0;
 	}
 	if (!strcmp(k, "pack.threads")) {
@@ -2488,7 +2496,7 @@ static int get_object_list_from_bitmap(struct rev_info *revs)
 	if (prepare_bitmap_walk(revs) < 0)
 		return -1;
 
-	if (pack_options_allow_reuse() &&
+	if (pack_options_allow_reuse() && pack_to_stdout &&
 	    !reuse_partial_packfile_from_bitmap(
 			&reuse_packfile,
 			&reuse_packfile_objects,
@@ -2702,6 +2710,7 @@ int cmd_pack_objects(int argc, const char **argv, const char *prefix)
 
 	reset_pack_idx_option(&pack_idx_opts);
 	git_config(git_pack_config, NULL);
+	use_bitmap_index = pack_to_stdout ? use_bitmap_stdout : use_bitmap_file;
 	if (!pack_compression_seen && core_compression_seen)
 		pack_compression_level = core_compression_level;
 
@@ -2773,7 +2782,7 @@ int cmd_pack_objects(int argc, const char **argv, const char *prefix)
 	if (!rev_list_all || !rev_list_reflog || !rev_list_index)
 		unpack_unreachable_expiration = 0;
 
-	if (!use_internal_rev_list || !pack_to_stdout || is_repository_shallow())
+	if (!use_internal_rev_list || (!pack_to_stdout && write_bitmap_index) || is_repository_shallow())
 		use_bitmap_index = 0;
 
 	if (pack_to_stdout || !rev_list_all)
diff --git a/t/t5310-pack-bitmaps.sh b/t/t5310-pack-bitmaps.sh
index 3893afd..9fab2bb 100755
--- a/t/t5310-pack-bitmaps.sh
+++ b/t/t5310-pack-bitmaps.sh
@@ -118,6 +118,20 @@ test_expect_success 'incremental repack can disable bitmaps' '
 	git repack -d --no-write-bitmap-index
 '
 
+test_expect_success 'pack-objects to file can use bitmap' '
+	# make sure we still have 1 bitmap index from previous tests
+	ls .git/objects/pack/ | grep bitmap >output &&
+	test_line_count = 1 output &&
+	# verify equivalent packs are generated with/without using bitmap index
+	packasha1=$(git pack-objects --no-use-bitmap-index --all packa </dev/null) &&
+	packbsha1=$(git pack-objects --use-bitmap-index --all packb </dev/null) &&
+	git verify-pack -v packa-$packasha1.pack >packa.verify &&
+	git verify-pack -v packb-$packbsha1.pack >packb.verify &&
+	grep -o "^$_x40" packa.verify |sort >packa.objects &&
+	grep -o "^$_x40" packb.verify |sort >packb.objects &&
+	test_cmp packa.objects packb.objects
+'
+
 test_expect_success 'full repack, reusing previous bitmaps' '
 	git repack -ad &&
 	ls .git/objects/pack/ | grep bitmap >output &&
-- 
2.9.0.431.g3cb5c84

^ permalink raw reply related	[flat|nested] 62+ messages in thread

* Re: [PATCH] pack-objects: Use reachability bitmap index when generating non-stdout pack too
  2016-07-08 10:38   ` Kirill Smelkov
  2016-07-12 19:08     ` Kirill Smelkov
@ 2016-07-13  8:26     ` Jeff King
  2016-07-13 10:52       ` Kirill Smelkov
  1 sibling, 1 reply; 62+ messages in thread
From: Jeff King @ 2016-07-13  8:26 UTC (permalink / raw)
  To: Kirill Smelkov
  Cc: Junio C Hamano, Jérome Perrin, Isabelle Vallet,
	Kazuhiko Shiozaki, Julien Muchembled, git, Vicent Marti

On Fri, Jul 08, 2016 at 01:38:55PM +0300, Kirill Smelkov wrote:

> >   - we will not compute the same write order (which is based on
> >     traversal order), leading to packs that have less efficient cache
> >     characteristics
> 
> I agree the order can be not exactly the same. Still if original pack is
> packed well (with good recency order), while using bitmap we will tend
> to traverse it in close to original order.
> 
> Maybe I'm not completely right on this, but to me it looks to be the
> case because if objects in original pack are put there linearly sorted
> by recency order, and we use bitmap index to set of all reachable
> objects from a root, and then just _linearly_ gather all those objects
> from original pack by 1s in bitmap and put them in the same order into
> destination pack, the recency order won't be broken.
> 
> Or am I maybe misunderstanding something?

Yeah, I think you can go some of the way by reusing the order from the
old pack. But keep in mind that the bitmap result may also contain
objects that are not yet packed. Those will just come in a big lump at
the end of the bitmap (these are the "extended entries" in the bitmap
code).

So I think if you were to repeatedly "git repack -adb" over time, you
would get worse and worse ordering as objects are added to the
repository.

As an aside, two other things that pack order matters for: it makes the
bitmaps themselves compress better (because it increases locality of
reachability, so you get nice runs of "1" or "0" bits). It also makes
the pack-reuse code more efficient (since in an ideal case, you can just
dump a big block of data from the front of the pack). Note that the
pack-reuse code that's in upstream git isn't that great; I have a better
system on my big pile of patches to send upstream (that never seems to
get smaller; <sigh>).

> >   - we don't learn about the filename of trees and blobs, which is going
> >     to make the delta step much less efficient. This might be mitigated
> >     by turning on the bitmap name-hash cache; I don't recall how much
> >     detail pack-objects needs on the name (i.e., the full name versus
> >     just the hash).
> 
> If I understand it right, it uses only uint32_t name hash while searching. From
> pack-objects.{h,c} :

Yeah, I think you are right. Not having the real names is a problem for
doing rev-list output, but I think pack-objects doesn't care (though do
note that the name-hash cache is not enabled by default).

> > There may be other subtle things, too. The general idea of tying the
> > bitmap use to pack_to_stdout is that you _do_ want to use it for
> > serving fetches and pushes, but for a full on-disk repack via gc, it's
> > more important to generate a good pack.
> 
> It is better we send good packs to clients too, right? And with
> pack.writeBitmapHashCache=true and retaining recency order (please see
> above, but again maybe I'm not completely right) to me we should be still
> generating a good pack while using bitmap reachability index for object
> graph traversal.

We do want to send the client a good pack, but it's always a tradeoff.
We could spend much more time searching for the perfect delta, but at
some point we have to decide on how much CPU to spend serving them.
Likewise, even if the bitmapped packs we send are in slightly worse
order, saving a minute of CPU time off of every clone of the kernel is a
big deal.

We also take robustness shortcuts when sending to clients. For example,
when doing an on-disk repack we re-crc32 all of the delta data we are
reusing, even if we don't actually inflate it (because we would want to
stop immediately if we see even a single bit flipped on disk). But we
don't check them when sending to a client, because we know they are
going to actually `index-pack` it and get a stronger consistency check
anyway, and don't want to waste server CPU.

The bitmaps are sort of the same. If there is a bug or corruption in the
bitmap, the worst case is that we send a broken pack to the client, who
will complain that we did not give them all of the objects. It's a
momentary problem that can be fixed. If you use them for an on-disk
repack, then the next step is usually to delete all of the old packs. So
a corruption there carries forward, and is irreversible.

As I understand your use case, it is OK to do the less careful things.
It's just that pack-objects until now has been split into two modes:
packing to a file is careful, and packing to stdout is less so. And you
want to pack to a file in the non-careful mode.

> > but I wonder if you should be using "pack-objects --stdout" yourself.
> 
> I already tried --stdout. The problem is on repository extraction we
> need to both extract the pack and index it. While `pack-object file`
> does both, for --stdout case we need to additionally index extracted
> pack with `git index-pack`, and standalone `git index-pack` is very slow
> - in my experience much slower than generating the pack itself:

Ah, right, that makes sense. The packfile does not carry the sha1 of the
objects. A receiving index-pack has to compute them itself, including
inflating and applying all of the deltas! By contrast, a pack to stdout
can be quite quick, because in most cases it can avoid even inflating
most of the data; where possible it just sends the zlib data straight
from disk to the client.

So I do agree "--stdout" is not ideal for you (or at the very least, you
really want pack-objects to generate the index from its internal table
rather than having to reconstruct it just from the pack stream).

> > But even if it is the right thing for your use case to be using bitmaps
> > to generate an on-disk bitmap, I think we should be making sure it
> > _doesn't_ trigger when doing a normal repack.
> 
> So seems the way forward here is to teach pack-objects not to silently
> drop explicit --use-pack-bitmap for cases when it can handle it?
> (currently even if this option was given, for !stdout cases pack-objects
> simply drop use_bitmap_index to 0).
> 
> And to make sure default for use_bitmap_index is 0 for !stdout cases?

I think it would be reasonable to accept "--use-bitmap-index" on the
command line as an override for "yes, really, this is what I want". So
the logic would be something like:

  static int use_bitmap_index_default = 1;
  static int use_bitmap_index = -1;

  ... parse config; if we see pack.usebitmaps, set
      use_bitmap_index_default ...

  ... parse command line, setting use_bitmap_index ...

  /* "soft" reasons not to use bitmaps */
  if (!pack_to_stdout)
	use_bitmap_index_default = 0;

  /* now install our default if the user didn't otherwise specify */
  if (use_bitmap_index < 0)
	use_bitmap_index = use_bitmap_index_default;

  /* "hard" reasons not to use bitmaps; these just won't work at all */
  if (!use_internal_rev_list || is_repository_shallow())
	use_bitmap_index = 0;

-Peff

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH] pack-objects: Use reachability bitmap index when generating non-stdout pack too
  2016-07-12 19:08     ` Kirill Smelkov
@ 2016-07-13  8:30       ` Jeff King
  0 siblings, 0 replies; 62+ messages in thread
From: Jeff King @ 2016-07-13  8:30 UTC (permalink / raw)
  To: Kirill Smelkov
  Cc: Junio C Hamano, Jérome Perrin, Isabelle Vallet,
	Kazuhiko Shiozaki, Julien Muchembled, git, Vicent Marti

On Tue, Jul 12, 2016 at 10:08:08PM +0300, Kirill Smelkov wrote:

> > Or are we fine with my arguments about recency order staying the same
> > when using bitmap reachability index for object graph traversal, and this
> > way the patch is fine to go in as it is?
> 
> Since there is no reply I assume the safe way to go is to let default
> for pack-to-file case to be "not using bitmap index". Please find updated
> patch and interdiff below. I would still be grateful for feedback on
> my above use-bitmap-for-pack-to-file arguments.

Yeah, I think that is a reasonable approach. I see here you've added new
config, though, and I don't think we want that.

For your purposes, where you're driving pack-objects individually, I
think a command-line option makes more sense.

If we did want to have a flag for "use bitmaps when repacking via
repack", I think it should be "repack.useBitmaps", and git-repack should
pass the command-line option to pack-objects. pack-objects is porcelain
and should not really be reading config at all. You'll note that
pack.writeBitmaps was a mistake and got deprecated in favor of
repack.writeBitmaps. I think pack.useBitmaps is a mistake, too, but
nobody has really noticed or cared because there's no good reason to set
it (the more interesting question is: are there bitmaps available? and
if so, we try to use them).

-Peff

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH] pack-objects: Use reachability bitmap index when generating non-stdout pack too
  2016-07-13  8:26     ` Jeff King
@ 2016-07-13 10:52       ` Kirill Smelkov
  2016-07-17 17:06         ` Kirill Smelkov
  2016-07-25 18:40         ` Jeff King
  0 siblings, 2 replies; 62+ messages in thread
From: Kirill Smelkov @ 2016-07-13 10:52 UTC (permalink / raw)
  To: Jeff King
  Cc: Junio C Hamano, Jérome Perrin, Isabelle Vallet,
	Kazuhiko Shiozaki, Julien Muchembled, git, Vicent Marti

On Wed, Jul 13, 2016 at 04:26:53AM -0400, Jeff King wrote:
> On Fri, Jul 08, 2016 at 01:38:55PM +0300, Kirill Smelkov wrote:
> 
> > >   - we will not compute the same write order (which is based on
> > >     traversal order), leading to packs that have less efficient cache
> > >     characteristics
> > 
> > I agree the order can be not exactly the same. Still if original pack is
> > packed well (with good recency order), while using bitmap we will tend
> > to traverse it in close to original order.
> > 
> > Maybe I'm not completely right on this, but to me it looks to be the
> > case because if objects in original pack are put there linearly sorted
> > by recency order, and we use bitmap index to set of all reachable
> > objects from a root, and then just _linearly_ gather all those objects
> > from original pack by 1s in bitmap and put them in the same order into
> > destination pack, the recency order won't be broken.
> > 
> > Or am I maybe misunderstanding something?
> 
> Yeah, I think you can go some of the way by reusing the order from the
> old pack. But keep in mind that the bitmap result may also contain
> objects that are not yet packed. Those will just come in a big lump at
> the end of the bitmap (these are the "extended entries" in the bitmap
> code).
> 
> So I think if you were to repeatedly "git repack -adb" over time, you
> would get worse and worse ordering as objects are added to the
> repository.

Jeff, first of all thanks for clarifying.

So it is not-yet-packed-objects which make packing with bitmap less
efficient. I was originally keeping in mind fresh repacked repository
with just built bitmap index and for that case extracting pack with
bitmap index seems to be just ok, but the more not-yet-packed objects we
have the worse the result can be.

> As an aside, two other things that pack order matters for: it makes the
> bitmaps themselves compress better (because it increases locality of
> reachability, so you get nice runs of "1" or "0" bits).

Yes I agree and thanks for bringing this up - putting objects in recency
order in pack also makes bitmap index to have larger runs of same 1 or 0.

> It also makes
> the pack-reuse code more efficient (since in an ideal case, you can just
> dump a big block of data from the front of the pack). Note that the
> pack-reuse code that's in upstream git isn't that great; I have a better
> system on my big pile of patches to send upstream (that never seems to
> get smaller; <sigh>).

Yes, it also make sense. I saw write_reused_pack() in upstream git just
copy raw bytes from original to destination pack. You mentioned you have
something better for pack reuse - in your patch queue, in two words, is
it now reusing pack based on object, not raw bytes, or is it something
else?

In other words in which way it works better? (I'm just curious here as
it is interesting to know)


> > >   - we don't learn about the filename of trees and blobs, which is going
> > >     to make the delta step much less efficient. This might be mitigated
> > >     by turning on the bitmap name-hash cache; I don't recall how much
> > >     detail pack-objects needs on the name (i.e., the full name versus
> > >     just the hash).
> > 
> > If I understand it right, it uses only uint32_t name hash while searching. From
> > pack-objects.{h,c} :
> 
> Yeah, I think you are right. Not having the real names is a problem for
> doing rev-list output, but I think pack-objects doesn't care (though do
> note that the name-hash cache is not enabled by default).

Yes, for packing it is only hash which is used. And I assume name-hash
for bitmap is not enabled by default for compatibility with JGit code.

It would make sense to me to eventually enable name-hash bitmap
extension by default, as packing result is much better with it. And
those who care about compatibility with JGit can just turn it off in
their git config.

Just my thoughts.

> > > There may be other subtle things, too. The general idea of tying the
> > > bitmap use to pack_to_stdout is that you _do_ want to use it for
> > > serving fetches and pushes, but for a full on-disk repack via gc, it's
> > > more important to generate a good pack.
> > 
> > It is better we send good packs to clients too, right? And with
> > pack.writeBitmapHashCache=true and retaining recency order (please see
> > above, but again maybe I'm not completely right) to me we should be still
> > generating a good pack while using bitmap reachability index for object
> > graph traversal.
> 
> We do want to send the client a good pack, but it's always a tradeoff.
> We could spend much more time searching for the perfect delta, but at
> some point we have to decide on how much CPU to spend serving them.
> Likewise, even if the bitmapped packs we send are in slightly worse
> order, saving a minute of CPU time off of every clone of the kernel is a
> big deal.

Yes, this I understand and agree. Like I said above I was imagining
freshly repacked repo with recently rebuilt bitmap index and for that
case we send a good pack with bitmaps out-of-the-box.

> We also take robustness shortcuts when sending to clients. For example,
> when doing an on-disk repack we re-crc32 all of the delta data we are
> reusing, even if we don't actually inflate it (because we would want to
> stop immediately if we see even a single bit flipped on disk). But we
> don't check them when sending to a client, because we know they are
> going to actually `index-pack` it and get a stronger consistency check
> anyway, and don't want to waste server CPU.
> 
> The bitmaps are sort of the same. If there is a bug or corruption in the
> bitmap, the worst case is that we send a broken pack to the client, who
> will complain that we did not give them all of the objects. It's a
> momentary problem that can be fixed. If you use them for an on-disk
> repack, then the next step is usually to delete all of the old packs. So
> a corruption there carries forward, and is irreversible.

Thanks for clarifying here. I did not knew pack-to-file is assumed to be
robust and pack-to-stdout is assumed to be allowed to be less so. Or at
least I did not thought about it this way before.

> As I understand your use case, it is OK to do the less careful things.
> It's just that pack-objects until now has been split into two modes:
> packing to a file is careful, and packing to stdout is less so. And you
> want to pack to a file in the non-careful mode.

Yes, it should be ok, as after repository extraction git-backup
verifies rev-list for all refs

    https://lab.nexedi.com/kirr/git-backup/blob/7fcb8c67/git-backup.go#L855

And if an object is missing - e.g. a blob - rev-list complains:

    fatal: missing blob object '980a0d5f19a64b4b30a87d4206aade58726b60e3'

though it does not catch blob corruptions.

As with when using bitmap index (due to bug in bitmap code or bitmap
index corruprtion) the worst that can happen is not all objects are
extracted, this should be effective measure to catch it.

The original whole-backup repository is also not removed, so we can
re-extract objects anytime.

So yes, using bitmap reachability index for faster extraction from
freshly repacked and bitmap indexed backup repository should be ok and
make sense to me.


> > > but I wonder if you should be using "pack-objects --stdout" yourself.
> > 
> > I already tried --stdout. The problem is on repository extraction we
> > need to both extract the pack and index it. While `pack-object file`
> > does both, for --stdout case we need to additionally index extracted
> > pack with `git index-pack`, and standalone `git index-pack` is very slow
> > - in my experience much slower than generating the pack itself:
> 
> Ah, right, that makes sense. The packfile does not carry the sha1 of the
> objects. A receiving index-pack has to compute them itself, including
> inflating and applying all of the deltas! By contrast, a pack to stdout
> can be quite quick, because in most cases it can avoid even inflating
> most of the data; where possible it just sends the zlib data straight
> from disk to the client.
> 
> So I do agree "--stdout" is not ideal for you (or at the very least, you
> really want pack-objects to generate the index from its internal table
> rather than having to reconstruct it just from the pack stream).

Yes, and thanks for clarifying a bit why standalone index-pack can be
slow.

> > > But even if it is the right thing for your use case to be using bitmaps
> > > to generate an on-disk bitmap, I think we should be making sure it
> > > _doesn't_ trigger when doing a normal repack.
> > 
> > So seems the way forward here is to teach pack-objects not to silently
> > drop explicit --use-pack-bitmap for cases when it can handle it?
> > (currently even if this option was given, for !stdout cases pack-objects
> > simply drop use_bitmap_index to 0).
> > 
> > And to make sure default for use_bitmap_index is 0 for !stdout cases?
> 
> I think it would be reasonable to accept "--use-bitmap-index" on the
> command line as an override for "yes, really, this is what I want". So
> the logic would be something like:
> 
>   static int use_bitmap_index_default = 1;
>   static int use_bitmap_index = -1;
> 
>   ... parse config; if we see pack.usebitmaps, set
>       use_bitmap_index_default ...
> 
>   ... parse command line, setting use_bitmap_index ...
> 
>   /* "soft" reasons not to use bitmaps */
>   if (!pack_to_stdout)
> 	use_bitmap_index_default = 0;
> 
>   /* now install our default if the user didn't otherwise specify */
>   if (use_bitmap_index < 0)
> 	use_bitmap_index = use_bitmap_index_default;
> 
>   /* "hard" reasons not to use bitmaps; these just won't work at all */
>   if (!use_internal_rev_list || is_repository_shallow())
> 	use_bitmap_index = 0;


On Wed, Jul 13, 2016 at 04:30:44AM -0400, Jeff King wrote:
> On Tue, Jul 12, 2016 at 10:08:08PM +0300, Kirill Smelkov wrote:
> 
> > > Or are we fine with my arguments about recency order staying the same
> > > when using bitmap reachability index for object graph traversal, and this
> > > way the patch is fine to go in as it is?
> > 
> > Since there is no reply I assume the safe way to go is to let default
> > for pack-to-file case to be "not using bitmap index". Please find updated
> > patch and interdiff below. I would still be grateful for feedback on
> > my above use-bitmap-for-pack-to-file arguments.
> 
> Yeah, I think that is a reasonable approach. I see here you've added new
> config, though, and I don't think we want that.
> 
> For your purposes, where you're driving pack-objects individually, I
> think a command-line option makes more sense.

Yes, I was going to use --use-bitmap-index explicitly, but I thought
since we already have pack.useBitmaps for consistency it is better to
introduce controlling to-file config point.


> If we did want to have a flag for "use bitmaps when repacking via
> repack", I think it should be "repack.useBitmaps", and git-repack should
> pass the command-line option to pack-objects. pack-objects is porcelain
> and should not really be reading config at all. You'll note that
> pack.writeBitmaps was a mistake and got deprecated in favor of
> repack.writeBitmaps. I think pack.useBitmaps is a mistake, too, but
> nobody has really noticed or cared because there's no good reason to set
> it (the more interesting question is: are there bitmaps available? and
> if so, we try to use them).

Probably pack.useBitmaps is of no use in normal situation, but for
debugging problems related to bitmaps it can be handy. Though when
someone debugs he/she can just adjust pack-objects.c . So should we
deprecate and eventually remove pack.useBitmaps ?

Anyway, please find below updated patch according to your suggestion.
Hope it is ok now.

Thanks again,
Kirill


(interdiff)
diff --git a/Documentation/config.txt b/Documentation/config.txt
index 8027951..4b14806 100644
--- a/Documentation/config.txt
+++ b/Documentation/config.txt
@@ -2229,19 +2229,14 @@ pack.packSizeLimit::
 	Common unit suffixes of 'k', 'm', or 'g' are
 	supported.
 
-pack.useBitmaps (deprecated)::
-	This is a deprecated synonym for `pack.useBitmaps.stdout`.
-
-pack.useBitmaps.stdout::
+pack.useBitmaps::
 	When true, git will use pack bitmaps (if available) when packing
 	to stdout (e.g., during the server side of a fetch). Defaults to
 	true. You should not generally need to turn this off unless
 	you are debugging pack bitmaps.
-
-pack.useBitmaps.file::
-	When true, git will use pack bitmaps (if available) when packing
-	to file (e.g., on repack). Defaults to false. You should not
-	generally need to turn this on unless you know what you are doing.
++
+*NOTE*: when packing to file (e.g., on repack) the default is always not to use
+	pack bitmaps.
 
 pack.writeBitmaps (deprecated)::
 	This is a deprecated synonym for `repack.writeBitmaps`.
diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index 7aaa1af..ffe8da6 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -66,8 +66,8 @@ static struct packed_git *reuse_packfile;
 static uint32_t reuse_packfile_objects;
 static off_t reuse_packfile_offset;
 
-static int use_bitmap_stdout = 1, use_bitmap_file = 0;
-static int use_bitmap_index;
+static int use_bitmap_index_default = 1;
+static int use_bitmap_index = -1;
 static int write_bitmap_index;
 static uint16_t write_bitmap_options;
 
@@ -2228,12 +2228,8 @@ static int git_pack_config(const char *k, const char *v, void *cb)
 		else
 			write_bitmap_options &= ~BITMAP_OPT_HASH_CACHE;
 	}
-	if (!strcmp(k, "pack.usebitmaps") || !strcmp(k, "pack.usebitmaps.stdout")) {
-		use_bitmap_stdout = git_config_bool(k, v);
-		return 0;
-	}
-	if (!strcmp(k, "pack.usebitmaps.file")) {
-		use_bitmap_file = git_config_bool(k, v);
+	if (!strcmp(k, "pack.usebitmaps")) {
+		use_bitmap_index_default = git_config_bool(k, v);
 		return 0;
 	}
 	if (!strcmp(k, "pack.threads")) {
@@ -2710,7 +2706,6 @@ int cmd_pack_objects(int argc, const char **argv, const char *prefix)
 
 	reset_pack_idx_option(&pack_idx_opts);
 	git_config(git_pack_config, NULL);
-	use_bitmap_index = pack_to_stdout ? use_bitmap_stdout : use_bitmap_file;
 	if (!pack_compression_seen && core_compression_seen)
 		pack_compression_level = core_compression_level;
 
@@ -2782,6 +2777,22 @@ int cmd_pack_objects(int argc, const char **argv, const char *prefix)
 	if (!rev_list_all || !rev_list_reflog || !rev_list_index)
 		unpack_unreachable_expiration = 0;
 
+	/*
+	 * "soft" reasons not to use bitmaps - for on-disk repack by default we want
+	 *
+	 * - to produce good pack (with bitmap index not-yet-packed objects are
+	 *   packed in suboptimal order).
+	 *
+	 * - to use more robust pack-generation codepath (avoiding possible
+	 *   bugs in bitmap code and possible bitmap index corruption).
+	 */
+	if (!pack_to_stdout)
+		use_bitmap_index_default = 0;
+
+	if (use_bitmap_index < 0)
+		use_bitmap_index = use_bitmap_index_default;
+
+	/* "hard" reasons not to use bitmaps; these just won't work at all */
 	if (!use_internal_rev_list || (!pack_to_stdout && write_bitmap_index) || is_repository_shallow())
 		use_bitmap_index = 0;
 

---- 8< ----
From: Kirill Smelkov <kirr@nexedi.com>
Subject: [PATCH v3] pack-objects: Teach it to use reachability bitmap index when
 generating non-stdout pack too

Starting from 6b8fda2d (pack-objects: use bitmaps when packing objects)
if a repository has bitmap index, pack-objects can nicely speedup
"Counting objects" graph traversal phase. That however was done only for
case when resultant pack is sent to stdout, not written into a file.

We can teach pack-objects to use bitmap index for initial object
counting phase when generating resultant pack file too:

- if we know bitmap index generation is not enabled for resultant pack:

  Current code has singleton bitmap_git so cannot work simultaneously
  with two bitmap indices.

- if we keep pack reuse enabled still only for "send-to-stdout" case:

  Because on pack reuse raw entries are directly written out to destination
  pack by write_reused_pack() bypassing needed for pack index generation
  bookkeeping done by regular codepath in write_one() and friends.

  (at least that's my understanding after briefly looking at the code)

We also need to care and teach add_object_entry_from_bitmap() to respect
--local via not adding nonlocal loose object to resultant pack (this
is bitmap-codepath counterpart of daae0625 (pack-objects: extend --local
to mean ignore non-local loose objects too) -- not to break 'loose
objects in alternate ODB are not repacked' in t7700-repack.sh .

Otherwise all git tests pass, and for pack-objects -> file we get nice
speedup:

    erp5.git[1] (~230MB) extracted from ~ 5GB lab.nexedi.com backup
    repository managed by git-backup[2] via

    time echo 0186ac99 | git pack-objects --revs erp5pack

before:  37.2s
after:   26.2s

And for `git repack -adb` packed git.git

    time echo 5c589a73 | git pack-objects --revs gitpack

before:   7.1s
after:    3.6s

i.e. it can be 30% - 50% speedup for pack extraction.

git-backup extracts many packs on repositories restoration. That was my
initial motivation for the patch.

[1] https://lab.nexedi.com/nexedi/erp5
[2] https://lab.nexedi.com/kirr/git-backup

NOTE

Jeff King suggested that it might be not generally a good idea to
use bitmap reachability index when repacking a repository. The reason
here is for on-disk repack by default we want

- to produce good pack (with bitmap index not-yet-packed objects are
  emitted to pack in suboptimal order).

- to use more robust pack-generation codepath (avoiding possible
  bugs in bitmap code and possible bitmap index corruption).

Jeff also suggests that pack.useBitmaps was probably a mistake to
introduce originally. This way we are not adding another config point,
but instead just always default to-file pack-objects not to use bitmap
index: Tools which need to generate on-disk packs with using bitmap, can
pass --use-bitmap-index explicitly.

More context:

    http://article.gmane.org/gmane.comp.version-control.git/299063
    http://article.gmane.org/gmane.comp.version-control.git/299107
    http://article.gmane.org/gmane.comp.version-control.git/299420

Cc: Vicent Marti <tanoku@gmail.com>
Helped-by: Jeff King <peff@peff.net>
Signed-off-by: Kirill Smelkov <kirr@nexedi.com>
---
 Documentation/config.txt |  3 +++
 builtin/pack-objects.c   | 28 ++++++++++++++++++++++++----
 t/t5310-pack-bitmaps.sh  | 14 ++++++++++++++
 3 files changed, 41 insertions(+), 4 deletions(-)

diff --git a/Documentation/config.txt b/Documentation/config.txt
index db05dec..4b14806 100644
--- a/Documentation/config.txt
+++ b/Documentation/config.txt
@@ -2234,6 +2234,9 @@ pack.useBitmaps::
 	to stdout (e.g., during the server side of a fetch). Defaults to
 	true. You should not generally need to turn this off unless
 	you are debugging pack bitmaps.
++
+*NOTE*: when packing to file (e.g., on repack) the default is always not to use
+	pack bitmaps.
 
 pack.writeBitmaps (deprecated)::
 	This is a deprecated synonym for `repack.writeBitmaps`.
diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index a2f8cfd..ffe8da6 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -66,7 +66,8 @@ static struct packed_git *reuse_packfile;
 static uint32_t reuse_packfile_objects;
 static off_t reuse_packfile_offset;
 
-static int use_bitmap_index = 1;
+static int use_bitmap_index_default = 1;
+static int use_bitmap_index = -1;
 static int write_bitmap_index;
 static uint16_t write_bitmap_options;
 
@@ -1052,6 +1053,9 @@ static int add_object_entry_from_bitmap(const unsigned char *sha1,
 {
 	uint32_t index_pos;
 
+	if (local && has_loose_object_nonlocal(sha1))
+		return 0;
+
 	if (have_duplicate_entry(sha1, 0, &index_pos))
 		return 0;
 
@@ -2225,7 +2229,7 @@ static int git_pack_config(const char *k, const char *v, void *cb)
 			write_bitmap_options &= ~BITMAP_OPT_HASH_CACHE;
 	}
 	if (!strcmp(k, "pack.usebitmaps")) {
-		use_bitmap_index = git_config_bool(k, v);
+		use_bitmap_index_default = git_config_bool(k, v);
 		return 0;
 	}
 	if (!strcmp(k, "pack.threads")) {
@@ -2488,7 +2492,7 @@ static int get_object_list_from_bitmap(struct rev_info *revs)
 	if (prepare_bitmap_walk(revs) < 0)
 		return -1;
 
-	if (pack_options_allow_reuse() &&
+	if (pack_options_allow_reuse() && pack_to_stdout &&
 	    !reuse_partial_packfile_from_bitmap(
 			&reuse_packfile,
 			&reuse_packfile_objects,
@@ -2773,7 +2777,23 @@ int cmd_pack_objects(int argc, const char **argv, const char *prefix)
 	if (!rev_list_all || !rev_list_reflog || !rev_list_index)
 		unpack_unreachable_expiration = 0;
 
-	if (!use_internal_rev_list || !pack_to_stdout || is_repository_shallow())
+	/*
+	 * "soft" reasons not to use bitmaps - for on-disk repack by default we want
+	 *
+	 * - to produce good pack (with bitmap index not-yet-packed objects are
+	 *   packed in suboptimal order).
+	 *
+	 * - to use more robust pack-generation codepath (avoiding possible
+	 *   bugs in bitmap code and possible bitmap index corruption).
+	 */
+	if (!pack_to_stdout)
+		use_bitmap_index_default = 0;
+
+	if (use_bitmap_index < 0)
+		use_bitmap_index = use_bitmap_index_default;
+
+	/* "hard" reasons not to use bitmaps; these just won't work at all */
+	if (!use_internal_rev_list || (!pack_to_stdout && write_bitmap_index) || is_repository_shallow())
 		use_bitmap_index = 0;
 
 	if (pack_to_stdout || !rev_list_all)
diff --git a/t/t5310-pack-bitmaps.sh b/t/t5310-pack-bitmaps.sh
index 3893afd..9fab2bb 100755
--- a/t/t5310-pack-bitmaps.sh
+++ b/t/t5310-pack-bitmaps.sh
@@ -118,6 +118,20 @@ test_expect_success 'incremental repack can disable bitmaps' '
 	git repack -d --no-write-bitmap-index
 '
 
+test_expect_success 'pack-objects to file can use bitmap' '
+	# make sure we still have 1 bitmap index from previous tests
+	ls .git/objects/pack/ | grep bitmap >output &&
+	test_line_count = 1 output &&
+	# verify equivalent packs are generated with/without using bitmap index
+	packasha1=$(git pack-objects --no-use-bitmap-index --all packa </dev/null) &&
+	packbsha1=$(git pack-objects --use-bitmap-index --all packb </dev/null) &&
+	git verify-pack -v packa-$packasha1.pack >packa.verify &&
+	git verify-pack -v packb-$packbsha1.pack >packb.verify &&
+	grep -o "^$_x40" packa.verify |sort >packa.objects &&
+	grep -o "^$_x40" packb.verify |sort >packb.objects &&
+	test_cmp packa.objects packb.objects
+'
+
 test_expect_success 'full repack, reusing previous bitmaps' '
 	git repack -ad &&
 	ls .git/objects/pack/ | grep bitmap >output &&
-- 
2.9.0.431.g3cb5c84

^ permalink raw reply related	[flat|nested] 62+ messages in thread

* Re: [PATCH] pack-objects: Use reachability bitmap index when generating non-stdout pack too
  2016-07-13 10:52       ` Kirill Smelkov
@ 2016-07-17 17:06         ` Kirill Smelkov
  2016-07-19 11:29           ` Jeff King
  2016-07-25 18:40         ` Jeff King
  1 sibling, 1 reply; 62+ messages in thread
From: Kirill Smelkov @ 2016-07-17 17:06 UTC (permalink / raw)
  To: Jeff King
  Cc: Junio C Hamano, Jérome Perrin, Isabelle Vallet,
	Kazuhiko Shiozaki, Julien Muchembled, git, Vicent Marti

On Wed, Jul 13, 2016 at 01:52:16PM +0300, Kirill Smelkov wrote:
> On Wed, Jul 13, 2016 at 04:26:53AM -0400, Jeff King wrote:
> > On Fri, Jul 08, 2016 at 01:38:55PM +0300, Kirill Smelkov wrote:
> > 
> > > >   - we will not compute the same write order (which is based on
> > > >     traversal order), leading to packs that have less efficient cache
> > > >     characteristics
> > > 
> > > I agree the order can be not exactly the same. Still if original pack is
> > > packed well (with good recency order), while using bitmap we will tend
> > > to traverse it in close to original order.
> > > 
> > > Maybe I'm not completely right on this, but to me it looks to be the
> > > case because if objects in original pack are put there linearly sorted
> > > by recency order, and we use bitmap index to set of all reachable
> > > objects from a root, and then just _linearly_ gather all those objects
> > > from original pack by 1s in bitmap and put them in the same order into
> > > destination pack, the recency order won't be broken.
> > > 
> > > Or am I maybe misunderstanding something?
> > 
> > Yeah, I think you can go some of the way by reusing the order from the
> > old pack. But keep in mind that the bitmap result may also contain
> > objects that are not yet packed. Those will just come in a big lump at
> > the end of the bitmap (these are the "extended entries" in the bitmap
> > code).
> > 
> > So I think if you were to repeatedly "git repack -adb" over time, you
> > would get worse and worse ordering as objects are added to the
> > repository.
> 
> Jeff, first of all thanks for clarifying.
> 
> So it is not-yet-packed-objects which make packing with bitmap less
> efficient. I was originally keeping in mind fresh repacked repository
> with just built bitmap index and for that case extracting pack with
> bitmap index seems to be just ok, but the more not-yet-packed objects we
> have the worse the result can be.
> 
> > As an aside, two other things that pack order matters for: it makes the
> > bitmaps themselves compress better (because it increases locality of
> > reachability, so you get nice runs of "1" or "0" bits).
> 
> Yes I agree and thanks for bringing this up - putting objects in recency
> order in pack also makes bitmap index to have larger runs of same 1 or 0.
> 
> > It also makes
> > the pack-reuse code more efficient (since in an ideal case, you can just
> > dump a big block of data from the front of the pack). Note that the
> > pack-reuse code that's in upstream git isn't that great; I have a better
> > system on my big pile of patches to send upstream (that never seems to
> > get smaller; <sigh>).
> 
> Yes, it also make sense. I saw write_reused_pack() in upstream git just
> copy raw bytes from original to destination pack. You mentioned you have
> something better for pack reuse - in your patch queue, in two words, is
> it now reusing pack based on object, not raw bytes, or is it something
> else?
> 
> In other words in which way it works better? (I'm just curious here as
> it is interesting to know)
> 
> 
> > > >   - we don't learn about the filename of trees and blobs, which is going
> > > >     to make the delta step much less efficient. This might be mitigated
> > > >     by turning on the bitmap name-hash cache; I don't recall how much
> > > >     detail pack-objects needs on the name (i.e., the full name versus
> > > >     just the hash).
> > > 
> > > If I understand it right, it uses only uint32_t name hash while searching. From
> > > pack-objects.{h,c} :
> > 
> > Yeah, I think you are right. Not having the real names is a problem for
> > doing rev-list output, but I think pack-objects doesn't care (though do
> > note that the name-hash cache is not enabled by default).
> 
> Yes, for packing it is only hash which is used. And I assume name-hash
> for bitmap is not enabled by default for compatibility with JGit code.
> 
> It would make sense to me to eventually enable name-hash bitmap
> extension by default, as packing result is much better with it. And
> those who care about compatibility with JGit can just turn it off in
> their git config.
> 
> Just my thoughts.
> 
> > > > There may be other subtle things, too. The general idea of tying the
> > > > bitmap use to pack_to_stdout is that you _do_ want to use it for
> > > > serving fetches and pushes, but for a full on-disk repack via gc, it's
> > > > more important to generate a good pack.
> > > 
> > > It is better we send good packs to clients too, right? And with
> > > pack.writeBitmapHashCache=true and retaining recency order (please see
> > > above, but again maybe I'm not completely right) to me we should be still
> > > generating a good pack while using bitmap reachability index for object
> > > graph traversal.
> > 
> > We do want to send the client a good pack, but it's always a tradeoff.
> > We could spend much more time searching for the perfect delta, but at
> > some point we have to decide on how much CPU to spend serving them.
> > Likewise, even if the bitmapped packs we send are in slightly worse
> > order, saving a minute of CPU time off of every clone of the kernel is a
> > big deal.
> 
> Yes, this I understand and agree. Like I said above I was imagining
> freshly repacked repo with recently rebuilt bitmap index and for that
> case we send a good pack with bitmaps out-of-the-box.
> 
> > We also take robustness shortcuts when sending to clients. For example,
> > when doing an on-disk repack we re-crc32 all of the delta data we are
> > reusing, even if we don't actually inflate it (because we would want to
> > stop immediately if we see even a single bit flipped on disk). But we
> > don't check them when sending to a client, because we know they are
> > going to actually `index-pack` it and get a stronger consistency check
> > anyway, and don't want to waste server CPU.
> > 
> > The bitmaps are sort of the same. If there is a bug or corruption in the
> > bitmap, the worst case is that we send a broken pack to the client, who
> > will complain that we did not give them all of the objects. It's a
> > momentary problem that can be fixed. If you use them for an on-disk
> > repack, then the next step is usually to delete all of the old packs. So
> > a corruption there carries forward, and is irreversible.
> 
> Thanks for clarifying here. I did not knew pack-to-file is assumed to be
> robust and pack-to-stdout is assumed to be allowed to be less so. Or at
> least I did not thought about it this way before.
> 
> > As I understand your use case, it is OK to do the less careful things.
> > It's just that pack-objects until now has been split into two modes:
> > packing to a file is careful, and packing to stdout is less so. And you
> > want to pack to a file in the non-careful mode.
> 
> Yes, it should be ok, as after repository extraction git-backup
> verifies rev-list for all refs
> 
>     https://lab.nexedi.com/kirr/git-backup/blob/7fcb8c67/git-backup.go#L855
> 
> And if an object is missing - e.g. a blob - rev-list complains:
> 
>     fatal: missing blob object '980a0d5f19a64b4b30a87d4206aade58726b60e3'
> 
> though it does not catch blob corruptions.
> 
> As with when using bitmap index (due to bug in bitmap code or bitmap
> index corruprtion) the worst that can happen is not all objects are
> extracted, this should be effective measure to catch it.
> 
> The original whole-backup repository is also not removed, so we can
> re-extract objects anytime.
> 
> So yes, using bitmap reachability index for faster extraction from
> freshly repacked and bitmap indexed backup repository should be ok and
> make sense to me.
> 
> 
> > > > but I wonder if you should be using "pack-objects --stdout" yourself.
> > > 
> > > I already tried --stdout. The problem is on repository extraction we
> > > need to both extract the pack and index it. While `pack-object file`
> > > does both, for --stdout case we need to additionally index extracted
> > > pack with `git index-pack`, and standalone `git index-pack` is very slow
> > > - in my experience much slower than generating the pack itself:
> > 
> > Ah, right, that makes sense. The packfile does not carry the sha1 of the
> > objects. A receiving index-pack has to compute them itself, including
> > inflating and applying all of the deltas! By contrast, a pack to stdout
> > can be quite quick, because in most cases it can avoid even inflating
> > most of the data; where possible it just sends the zlib data straight
> > from disk to the client.
> > 
> > So I do agree "--stdout" is not ideal for you (or at the very least, you
> > really want pack-objects to generate the index from its internal table
> > rather than having to reconstruct it just from the pack stream).
> 
> Yes, and thanks for clarifying a bit why standalone index-pack can be
> slow.
> 
> > > > But even if it is the right thing for your use case to be using bitmaps
> > > > to generate an on-disk bitmap, I think we should be making sure it
> > > > _doesn't_ trigger when doing a normal repack.
> > > 
> > > So seems the way forward here is to teach pack-objects not to silently
> > > drop explicit --use-pack-bitmap for cases when it can handle it?
> > > (currently even if this option was given, for !stdout cases pack-objects
> > > simply drop use_bitmap_index to 0).
> > > 
> > > And to make sure default for use_bitmap_index is 0 for !stdout cases?
> > 
> > I think it would be reasonable to accept "--use-bitmap-index" on the
> > command line as an override for "yes, really, this is what I want". So
> > the logic would be something like:
> > 
> >   static int use_bitmap_index_default = 1;
> >   static int use_bitmap_index = -1;
> > 
> >   ... parse config; if we see pack.usebitmaps, set
> >       use_bitmap_index_default ...
> > 
> >   ... parse command line, setting use_bitmap_index ...
> > 
> >   /* "soft" reasons not to use bitmaps */
> >   if (!pack_to_stdout)
> > 	use_bitmap_index_default = 0;
> > 
> >   /* now install our default if the user didn't otherwise specify */
> >   if (use_bitmap_index < 0)
> > 	use_bitmap_index = use_bitmap_index_default;
> > 
> >   /* "hard" reasons not to use bitmaps; these just won't work at all */
> >   if (!use_internal_rev_list || is_repository_shallow())
> > 	use_bitmap_index = 0;
> 
> 
> On Wed, Jul 13, 2016 at 04:30:44AM -0400, Jeff King wrote:
> > On Tue, Jul 12, 2016 at 10:08:08PM +0300, Kirill Smelkov wrote:
> > 
> > > > Or are we fine with my arguments about recency order staying the same
> > > > when using bitmap reachability index for object graph traversal, and this
> > > > way the patch is fine to go in as it is?
> > > 
> > > Since there is no reply I assume the safe way to go is to let default
> > > for pack-to-file case to be "not using bitmap index". Please find updated
> > > patch and interdiff below. I would still be grateful for feedback on
> > > my above use-bitmap-for-pack-to-file arguments.
> > 
> > Yeah, I think that is a reasonable approach. I see here you've added new
> > config, though, and I don't think we want that.
> > 
> > For your purposes, where you're driving pack-objects individually, I
> > think a command-line option makes more sense.
> 
> Yes, I was going to use --use-bitmap-index explicitly, but I thought
> since we already have pack.useBitmaps for consistency it is better to
> introduce controlling to-file config point.
> 
> 
> > If we did want to have a flag for "use bitmaps when repacking via
> > repack", I think it should be "repack.useBitmaps", and git-repack should
> > pass the command-line option to pack-objects. pack-objects is porcelain
> > and should not really be reading config at all. You'll note that
> > pack.writeBitmaps was a mistake and got deprecated in favor of
> > repack.writeBitmaps. I think pack.useBitmaps is a mistake, too, but
> > nobody has really noticed or cared because there's no good reason to set
> > it (the more interesting question is: are there bitmaps available? and
> > if so, we try to use them).
> 
> Probably pack.useBitmaps is of no use in normal situation, but for
> debugging problems related to bitmaps it can be handy. Though when
> someone debugs he/she can just adjust pack-objects.c . So should we
> deprecate and eventually remove pack.useBitmaps ?
> 
> Anyway, please find below updated patch according to your suggestion.
> Hope it is ok now.

Ping. Is the patch ok or something needs to be improved still?

Thanks beforehand for feedback,
Kirill


> (interdiff)
> diff --git a/Documentation/config.txt b/Documentation/config.txt
> index 8027951..4b14806 100644
> --- a/Documentation/config.txt
> +++ b/Documentation/config.txt
> @@ -2229,19 +2229,14 @@ pack.packSizeLimit::
>  	Common unit suffixes of 'k', 'm', or 'g' are
>  	supported.
>  
> -pack.useBitmaps (deprecated)::
> -	This is a deprecated synonym for `pack.useBitmaps.stdout`.
> -
> -pack.useBitmaps.stdout::
> +pack.useBitmaps::
>  	When true, git will use pack bitmaps (if available) when packing
>  	to stdout (e.g., during the server side of a fetch). Defaults to
>  	true. You should not generally need to turn this off unless
>  	you are debugging pack bitmaps.
> -
> -pack.useBitmaps.file::
> -	When true, git will use pack bitmaps (if available) when packing
> -	to file (e.g., on repack). Defaults to false. You should not
> -	generally need to turn this on unless you know what you are doing.
> ++
> +*NOTE*: when packing to file (e.g., on repack) the default is always not to use
> +	pack bitmaps.
>  
>  pack.writeBitmaps (deprecated)::
>  	This is a deprecated synonym for `repack.writeBitmaps`.
> diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
> index 7aaa1af..ffe8da6 100644
> --- a/builtin/pack-objects.c
> +++ b/builtin/pack-objects.c
> @@ -66,8 +66,8 @@ static struct packed_git *reuse_packfile;
>  static uint32_t reuse_packfile_objects;
>  static off_t reuse_packfile_offset;
>  
> -static int use_bitmap_stdout = 1, use_bitmap_file = 0;
> -static int use_bitmap_index;
> +static int use_bitmap_index_default = 1;
> +static int use_bitmap_index = -1;
>  static int write_bitmap_index;
>  static uint16_t write_bitmap_options;
>  
> @@ -2228,12 +2228,8 @@ static int git_pack_config(const char *k, const char *v, void *cb)
>  		else
>  			write_bitmap_options &= ~BITMAP_OPT_HASH_CACHE;
>  	}
> -	if (!strcmp(k, "pack.usebitmaps") || !strcmp(k, "pack.usebitmaps.stdout")) {
> -		use_bitmap_stdout = git_config_bool(k, v);
> -		return 0;
> -	}
> -	if (!strcmp(k, "pack.usebitmaps.file")) {
> -		use_bitmap_file = git_config_bool(k, v);
> +	if (!strcmp(k, "pack.usebitmaps")) {
> +		use_bitmap_index_default = git_config_bool(k, v);
>  		return 0;
>  	}
>  	if (!strcmp(k, "pack.threads")) {
> @@ -2710,7 +2706,6 @@ int cmd_pack_objects(int argc, const char **argv, const char *prefix)
>  
>  	reset_pack_idx_option(&pack_idx_opts);
>  	git_config(git_pack_config, NULL);
> -	use_bitmap_index = pack_to_stdout ? use_bitmap_stdout : use_bitmap_file;
>  	if (!pack_compression_seen && core_compression_seen)
>  		pack_compression_level = core_compression_level;
>  
> @@ -2782,6 +2777,22 @@ int cmd_pack_objects(int argc, const char **argv, const char *prefix)
>  	if (!rev_list_all || !rev_list_reflog || !rev_list_index)
>  		unpack_unreachable_expiration = 0;
>  
> +	/*
> +	 * "soft" reasons not to use bitmaps - for on-disk repack by default we want
> +	 *
> +	 * - to produce good pack (with bitmap index not-yet-packed objects are
> +	 *   packed in suboptimal order).
> +	 *
> +	 * - to use more robust pack-generation codepath (avoiding possible
> +	 *   bugs in bitmap code and possible bitmap index corruption).
> +	 */
> +	if (!pack_to_stdout)
> +		use_bitmap_index_default = 0;
> +
> +	if (use_bitmap_index < 0)
> +		use_bitmap_index = use_bitmap_index_default;
> +
> +	/* "hard" reasons not to use bitmaps; these just won't work at all */
>  	if (!use_internal_rev_list || (!pack_to_stdout && write_bitmap_index) || is_repository_shallow())
>  		use_bitmap_index = 0;
>  
> 
> ---- 8< ----
> From: Kirill Smelkov <kirr@nexedi.com>
> Subject: [PATCH v3] pack-objects: Teach it to use reachability bitmap index when
>  generating non-stdout pack too
> 
> Starting from 6b8fda2d (pack-objects: use bitmaps when packing objects)
> if a repository has bitmap index, pack-objects can nicely speedup
> "Counting objects" graph traversal phase. That however was done only for
> case when resultant pack is sent to stdout, not written into a file.
> 
> We can teach pack-objects to use bitmap index for initial object
> counting phase when generating resultant pack file too:
> 
> - if we know bitmap index generation is not enabled for resultant pack:
> 
>   Current code has singleton bitmap_git so cannot work simultaneously
>   with two bitmap indices.
> 
> - if we keep pack reuse enabled still only for "send-to-stdout" case:
> 
>   Because on pack reuse raw entries are directly written out to destination
>   pack by write_reused_pack() bypassing needed for pack index generation
>   bookkeeping done by regular codepath in write_one() and friends.
> 
>   (at least that's my understanding after briefly looking at the code)
> 
> We also need to care and teach add_object_entry_from_bitmap() to respect
> --local via not adding nonlocal loose object to resultant pack (this
> is bitmap-codepath counterpart of daae0625 (pack-objects: extend --local
> to mean ignore non-local loose objects too) -- not to break 'loose
> objects in alternate ODB are not repacked' in t7700-repack.sh .
> 
> Otherwise all git tests pass, and for pack-objects -> file we get nice
> speedup:
> 
>     erp5.git[1] (~230MB) extracted from ~ 5GB lab.nexedi.com backup
>     repository managed by git-backup[2] via
> 
>     time echo 0186ac99 | git pack-objects --revs erp5pack
> 
> before:  37.2s
> after:   26.2s
> 
> And for `git repack -adb` packed git.git
> 
>     time echo 5c589a73 | git pack-objects --revs gitpack
> 
> before:   7.1s
> after:    3.6s
> 
> i.e. it can be 30% - 50% speedup for pack extraction.
> 
> git-backup extracts many packs on repositories restoration. That was my
> initial motivation for the patch.
> 
> [1] https://lab.nexedi.com/nexedi/erp5
> [2] https://lab.nexedi.com/kirr/git-backup
> 
> NOTE
> 
> Jeff King suggested that it might be not generally a good idea to
> use bitmap reachability index when repacking a repository. The reason
> here is for on-disk repack by default we want
> 
> - to produce good pack (with bitmap index not-yet-packed objects are
>   emitted to pack in suboptimal order).
> 
> - to use more robust pack-generation codepath (avoiding possible
>   bugs in bitmap code and possible bitmap index corruption).
> 
> Jeff also suggests that pack.useBitmaps was probably a mistake to
> introduce originally. This way we are not adding another config point,
> but instead just always default to-file pack-objects not to use bitmap
> index: Tools which need to generate on-disk packs with using bitmap, can
> pass --use-bitmap-index explicitly.
> 
> More context:
> 
>     http://article.gmane.org/gmane.comp.version-control.git/299063
>     http://article.gmane.org/gmane.comp.version-control.git/299107
>     http://article.gmane.org/gmane.comp.version-control.git/299420
> 
> Cc: Vicent Marti <tanoku@gmail.com>
> Helped-by: Jeff King <peff@peff.net>
> Signed-off-by: Kirill Smelkov <kirr@nexedi.com>
> ---
>  Documentation/config.txt |  3 +++
>  builtin/pack-objects.c   | 28 ++++++++++++++++++++++++----
>  t/t5310-pack-bitmaps.sh  | 14 ++++++++++++++
>  3 files changed, 41 insertions(+), 4 deletions(-)
> 
> diff --git a/Documentation/config.txt b/Documentation/config.txt
> index db05dec..4b14806 100644
> --- a/Documentation/config.txt
> +++ b/Documentation/config.txt
> @@ -2234,6 +2234,9 @@ pack.useBitmaps::
>  	to stdout (e.g., during the server side of a fetch). Defaults to
>  	true. You should not generally need to turn this off unless
>  	you are debugging pack bitmaps.
> ++
> +*NOTE*: when packing to file (e.g., on repack) the default is always not to use
> +	pack bitmaps.
>  
>  pack.writeBitmaps (deprecated)::
>  	This is a deprecated synonym for `repack.writeBitmaps`.
> diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
> index a2f8cfd..ffe8da6 100644
> --- a/builtin/pack-objects.c
> +++ b/builtin/pack-objects.c
> @@ -66,7 +66,8 @@ static struct packed_git *reuse_packfile;
>  static uint32_t reuse_packfile_objects;
>  static off_t reuse_packfile_offset;
>  
> -static int use_bitmap_index = 1;
> +static int use_bitmap_index_default = 1;
> +static int use_bitmap_index = -1;
>  static int write_bitmap_index;
>  static uint16_t write_bitmap_options;
>  
> @@ -1052,6 +1053,9 @@ static int add_object_entry_from_bitmap(const unsigned char *sha1,
>  {
>  	uint32_t index_pos;
>  
> +	if (local && has_loose_object_nonlocal(sha1))
> +		return 0;
> +
>  	if (have_duplicate_entry(sha1, 0, &index_pos))
>  		return 0;
>  
> @@ -2225,7 +2229,7 @@ static int git_pack_config(const char *k, const char *v, void *cb)
>  			write_bitmap_options &= ~BITMAP_OPT_HASH_CACHE;
>  	}
>  	if (!strcmp(k, "pack.usebitmaps")) {
> -		use_bitmap_index = git_config_bool(k, v);
> +		use_bitmap_index_default = git_config_bool(k, v);
>  		return 0;
>  	}
>  	if (!strcmp(k, "pack.threads")) {
> @@ -2488,7 +2492,7 @@ static int get_object_list_from_bitmap(struct rev_info *revs)
>  	if (prepare_bitmap_walk(revs) < 0)
>  		return -1;
>  
> -	if (pack_options_allow_reuse() &&
> +	if (pack_options_allow_reuse() && pack_to_stdout &&
>  	    !reuse_partial_packfile_from_bitmap(
>  			&reuse_packfile,
>  			&reuse_packfile_objects,
> @@ -2773,7 +2777,23 @@ int cmd_pack_objects(int argc, const char **argv, const char *prefix)
>  	if (!rev_list_all || !rev_list_reflog || !rev_list_index)
>  		unpack_unreachable_expiration = 0;
>  
> -	if (!use_internal_rev_list || !pack_to_stdout || is_repository_shallow())
> +	/*
> +	 * "soft" reasons not to use bitmaps - for on-disk repack by default we want
> +	 *
> +	 * - to produce good pack (with bitmap index not-yet-packed objects are
> +	 *   packed in suboptimal order).
> +	 *
> +	 * - to use more robust pack-generation codepath (avoiding possible
> +	 *   bugs in bitmap code and possible bitmap index corruption).
> +	 */
> +	if (!pack_to_stdout)
> +		use_bitmap_index_default = 0;
> +
> +	if (use_bitmap_index < 0)
> +		use_bitmap_index = use_bitmap_index_default;
> +
> +	/* "hard" reasons not to use bitmaps; these just won't work at all */
> +	if (!use_internal_rev_list || (!pack_to_stdout && write_bitmap_index) || is_repository_shallow())
>  		use_bitmap_index = 0;
>  
>  	if (pack_to_stdout || !rev_list_all)
> diff --git a/t/t5310-pack-bitmaps.sh b/t/t5310-pack-bitmaps.sh
> index 3893afd..9fab2bb 100755
> --- a/t/t5310-pack-bitmaps.sh
> +++ b/t/t5310-pack-bitmaps.sh
> @@ -118,6 +118,20 @@ test_expect_success 'incremental repack can disable bitmaps' '
>  	git repack -d --no-write-bitmap-index
>  '
>  
> +test_expect_success 'pack-objects to file can use bitmap' '
> +	# make sure we still have 1 bitmap index from previous tests
> +	ls .git/objects/pack/ | grep bitmap >output &&
> +	test_line_count = 1 output &&
> +	# verify equivalent packs are generated with/without using bitmap index
> +	packasha1=$(git pack-objects --no-use-bitmap-index --all packa </dev/null) &&
> +	packbsha1=$(git pack-objects --use-bitmap-index --all packb </dev/null) &&
> +	git verify-pack -v packa-$packasha1.pack >packa.verify &&
> +	git verify-pack -v packb-$packbsha1.pack >packb.verify &&
> +	grep -o "^$_x40" packa.verify |sort >packa.objects &&
> +	grep -o "^$_x40" packb.verify |sort >packb.objects &&
> +	test_cmp packa.objects packb.objects
> +'
> +
>  test_expect_success 'full repack, reusing previous bitmaps' '
>  	git repack -ad &&
>  	ls .git/objects/pack/ | grep bitmap >output &&
> -- 
> 2.9.0.431.g3cb5c84

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH] pack-objects: Use reachability bitmap index when generating non-stdout pack too
  2016-07-17 17:06         ` Kirill Smelkov
@ 2016-07-19 11:29           ` Jeff King
  2016-07-19 12:14             ` Kirill Smelkov
  0 siblings, 1 reply; 62+ messages in thread
From: Jeff King @ 2016-07-19 11:29 UTC (permalink / raw)
  To: Kirill Smelkov
  Cc: Junio C Hamano, Jérome Perrin, Isabelle Vallet,
	Kazuhiko Shiozaki, Julien Muchembled, git, Vicent Marti

On Sun, Jul 17, 2016 at 08:06:49PM +0300, Kirill Smelkov wrote:

> > Anyway, please find below updated patch according to your suggestion.
> > Hope it is ok now.
> 
> Ping. Is the patch ok or something needs to be improved still?

Sorry, I'm traveling and haven't carefully reviewed it yet. It's still
on my list, but it may be a few days.

-Peff

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH] pack-objects: Use reachability bitmap index when generating non-stdout pack too
  2016-07-19 11:29           ` Jeff King
@ 2016-07-19 12:14             ` Kirill Smelkov
  0 siblings, 0 replies; 62+ messages in thread
From: Kirill Smelkov @ 2016-07-19 12:14 UTC (permalink / raw)
  To: Jeff King
  Cc: Junio C Hamano, Jérome Perrin, Isabelle Vallet,
	Kazuhiko Shiozaki, Julien Muchembled, git, Vicent Marti

On Tue, Jul 19, 2016 at 05:29:07AM -0600, Jeff King wrote:
> On Sun, Jul 17, 2016 at 08:06:49PM +0300, Kirill Smelkov wrote:
> 
> > > Anyway, please find below updated patch according to your suggestion.
> > > Hope it is ok now.
> > 
> > Ping. Is the patch ok or something needs to be improved still?
> 
> Sorry, I'm traveling and haven't carefully reviewed it yet. It's still
> on my list, but it may be a few days.

Jeff thanks for feedback. Have a good traveling and good to know the patch
was not forgotten. I will be waiting for the time while you are on trip.

Thanks again for feedback,
Kirill

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH] pack-objects: Use reachability bitmap index when generating non-stdout pack too
  2016-07-13 10:52       ` Kirill Smelkov
  2016-07-17 17:06         ` Kirill Smelkov
@ 2016-07-25 18:40         ` Jeff King
  2016-07-25 18:53           ` Jeff King
  2016-07-27 20:15           ` Kirill Smelkov
  1 sibling, 2 replies; 62+ messages in thread
From: Jeff King @ 2016-07-25 18:40 UTC (permalink / raw)
  To: Kirill Smelkov
  Cc: Junio C Hamano, Jérome Perrin, Isabelle Vallet,
	Kazuhiko Shiozaki, Julien Muchembled, git, Vicent Marti

On Wed, Jul 13, 2016 at 01:52:17PM +0300, Kirill Smelkov wrote:

> > So I think if you were to repeatedly "git repack -adb" over time, you
> > would get worse and worse ordering as objects are added to the
> > repository.
> 
> Jeff, first of all thanks for clarifying.
> 
> So it is not-yet-packed-objects which make packing with bitmap less
> efficient. I was originally keeping in mind fresh repacked repository
> with just built bitmap index and for that case extracting pack with
> bitmap index seems to be just ok, but the more not-yet-packed objects we
> have the worse the result can be.

Right. So I think your scheme is fine as long as you are doing your
regular "pack all into one" repacks with a real walk, and then
"branching" off of that with one-off bitmap-computed packs into files
(even if you later take a bunch of those files and pull them into a
single bitmapped, as long as that final "all into one" does the walk).

Or I guess another way to think about it would be that if you're
computing bitmaps, you'd want to do the actual traversal.

> Yes, it also make sense. I saw write_reused_pack() in upstream git just
> copy raw bytes from original to destination pack. You mentioned you have
> something better for pack reuse - in your patch queue, in two words, is
> it now reusing pack based on object, not raw bytes, or is it something
> else?
> 
> In other words in which way it works better? (I'm just curious here as
> it is interesting to know)

The problem with the existing pack-reuse code is that it doesn't kick in
often enough. I think it looks to see that the client wants some
percentage of the pack (e.g., 90%), and then just sends the whole
beginning. This works especially badly if you have a bunch of related
repositories packed together (e.g., all of the forks of torvalds/linux
on GitHub), because you'll never hit 90% of that big pack; it has too
much unrelated cruft, even if most of the stuff you want _is_ at the
beginning. And "percent of pack" is not really a useful metric anyway.

So the better scheme is more like:

  1. Generate the bitmap of objects to send using reachability bitmaps.

  2. Do a quick scan of their content in the packfile to see which can
     be reused verbatim. If they're base objects, we can send them
     as-is. If they're deltas, we can send them if their base is going
     to be sent. This fills in another bitmap of "reusable" objects.

     After a long string of unusable objects, you can give up and set
     the rest of the bitmap to zeroes.

  3. Walk the "reuse" bitmap and send out the objects more-or-less
     verbatim. You do have make adjustments to delta-base-offsets for
     any "holes" (so if an object's entry says "my base is 500 bytes
     back", but you omitted some objects in between, you have to adjust
     that offset).

The upside is that you can send out those objects without even making a
"struct object_entry" for them, which drastically reduces the memory
requirements for serving a clone. Any objects which didn't get marked
for reuse just get handled in the usual way (so stuff that was not close
by in the pack, or stuff that was pushed since your last big repack).

The downside is that because those objects aren't in our normal packing
list, they're not available as delta bases for the new objects we _do_
send. So it can make the resulting pack a little bit bigger.

> Yes, for packing it is only hash which is used. And I assume name-hash
> for bitmap is not enabled by default for compatibility with JGit code.
>
> It would make sense to me to eventually enable name-hash bitmap
> extension by default, as packing result is much better with it. And
> those who care about compatibility with JGit can just turn it off in
> their git config.

Correct, the defaults are for JGit compatibility. If you are not using
JGit, you should have it on all the time.  We went with the conservative
default, but as more people using regular Git bitmaps, it would probably
be good to make them less arcane and confusing to use.

> > As I understand your use case, it is OK to do the less careful things.
> > It's just that pack-objects until now has been split into two modes:
> > packing to a file is careful, and packing to stdout is less so. And you
> > want to pack to a file in the non-careful mode.
> 
> Yes, it should be ok, as after repository extraction git-backup
> verifies rev-list for all refs
> 
>     https://lab.nexedi.com/kirr/git-backup/blob/7fcb8c67/git-backup.go#L855
> 
> And if an object is missing - e.g. a blob - rev-list complains:
> 
>     fatal: missing blob object '980a0d5f19a64b4b30a87d4206aade58726b60e3'
> 
> though it does not catch blob corruptions.

Right, that makes sense. Even the pack-to-disk code invoked by
git-repack is not foolproof for blob corruptions. It is only checking a
crc, not the full sha1. So it's better than nothing, but not as careful
as a full index-pack.

> Probably pack.useBitmaps is of no use in normal situation, but for
> debugging problems related to bitmaps it can be handy. Though when
> someone debugs he/she can just adjust pack-objects.c . So should we
> deprecate and eventually remove pack.useBitmaps ?

In my opinion, yes. If we had any debugging option, it should be
something like "core.usebitmaps", to tell _all_ of git to pretend that
bitmaps don't exist (right now only pack-objects respects it, but we
could be using them to optimize more traversals).

> ---- 8< ----
> From: Kirill Smelkov <kirr@nexedi.com>
> Subject: [PATCH v3] pack-objects: Teach it to use reachability bitmap index when
>  generating non-stdout pack too
> 
> Starting from 6b8fda2d (pack-objects: use bitmaps when packing objects)
> if a repository has bitmap index, pack-objects can nicely speedup
> "Counting objects" graph traversal phase. That however was done only for
> case when resultant pack is sent to stdout, not written into a file.

I think we can give more motivation and context here with some of the
bits that have come out in our discussion. Like:

  The reason for this split is that pack-objects tries to determine how
  "careful" it should be based on whether we are packing to disk or to
  stdout. Packing to disk implies "git repack", and that we will likely
  delete the old packs after finishing. We want to be more careful (so
  as not to carry forward a corruption, and to generate a more optimal
  pack), and we presumably run less frequently and can afford extra CPU.
  Whereas packing to stdout implies serving a remote via "git fetch" or
  "git push". This happens more frequently (e.g., a server handling many
  fetching clients), and we assume the receiving end takes more
  responsibility for verifying the data.

  But this isn't always the case. One might want to generate on-disk
  packfiles for a specialized object transfer. Just using "--stdout" and
  writing to a file is not optimal, as it will not generate the matching
  pack index.

  So it would be useful to have some way of overriding this heuristic:
  to tell pack-objects that even though it should generate on-disk
  files, it is still OK to use the reachability bitmaps to do the
  traversal.

> We can teach pack-objects to use bitmap index for initial object
> counting phase when generating resultant pack file too:
> 
> - if we know bitmap index generation is not enabled for resultant pack:
> 
>   Current code has singleton bitmap_git so cannot work simultaneously
>   with two bitmap indices.

So one reason is that it is not currently possible with the
implementation. But I think it also gets to the above bit about
"optimal" packs. We do not want to generate bitmaps off of bitmaps,
because we lose information about the write order. That's probably worth
mentioning here.

> - if we keep pack reuse enabled still only for "send-to-stdout" case:
> 
>   Because on pack reuse raw entries are directly written out to destination
>   pack by write_reused_pack() bypassing needed for pack index generation
>   bookkeeping done by regular codepath in write_one() and friends.
> 
>   (at least that's my understanding after briefly looking at the code)

Yes, that's right. We definitely want pack-reuse off for this case.

> NOTE
> 
> Jeff King suggested that it might be not generally a good idea to
> use bitmap reachability index when repacking a repository. The reason
> here is for on-disk repack by default we want
> 
> - to produce good pack (with bitmap index not-yet-packed objects are
>   emitted to pack in suboptimal order).
> 
> - to use more robust pack-generation codepath (avoiding possible
>   bugs in bitmap code and possible bitmap index corruption).

Ah, this kind of covers the bits I talked about above. I think it makes
more sense to introduce them as part of the motivation, though, rather
than as a note here.

> Jeff also suggests that pack.useBitmaps was probably a mistake to
> introduce originally. This way we are not adding another config point,
> but instead just always default to-file pack-objects not to use bitmap
> index: Tools which need to generate on-disk packs with using bitmap, can
> pass --use-bitmap-index explicitly.

This part is important, though. Basically the reason we respect the
command-line option is that we know that git-repack would never set it
explicitly, so it is the hint that pack-objects can use to know which
case we are serving: a careful repack of our data, or just extraction of
some objects.

> @@ -1052,6 +1053,9 @@ static int add_object_entry_from_bitmap(const unsigned char *sha1,
>  {
>  	uint32_t index_pos;
>  
> +	if (local && has_loose_object_nonlocal(sha1))
> +		return 0;
> +
>  	if (have_duplicate_entry(sha1, 0, &index_pos))
>  		return 0;

Hrm. Adding entries from the bitmap should ideally be very fast, but
here we're introducing extra lookups in the object database. I guess it
only kicks in when --local is given, though, which most bitmap-using
paths would not do.

But is this check enough? The non-bitmap code path calls
want_object_in_pack, which checks not only loose objects, but also
non-local packs, and .keep.

Those don't kick in for your use case. I wonder if we should simply have
something like:

  if (local || ignore_packed_keep)
	use_bitmap_index = 0;

and just skip bitmaps for those cases. That's easy to reason about, and
I don't think anybody would care (your use case does not, and the repack
use case is already not going to use bitmaps).

> @@ -2773,7 +2777,23 @@ int cmd_pack_objects(int argc, const char **argv, const char *prefix)
>  	if (!rev_list_all || !rev_list_reflog || !rev_list_index)
>  		unpack_unreachable_expiration = 0;
>  
> -	if (!use_internal_rev_list || !pack_to_stdout || is_repository_shallow())
> +	/*
> +	 * "soft" reasons not to use bitmaps - for on-disk repack by default we want
> +	 *
> +	 * - to produce good pack (with bitmap index not-yet-packed objects are
> +	 *   packed in suboptimal order).
> +	 *
> +	 * - to use more robust pack-generation codepath (avoiding possible
> +	 *   bugs in bitmap code and possible bitmap index corruption).
> +	 */
> +	if (!pack_to_stdout)
> +		use_bitmap_index_default = 0;
> +
> +	if (use_bitmap_index < 0)
> +		use_bitmap_index = use_bitmap_index_default;
> +
> +	/* "hard" reasons not to use bitmaps; these just won't work at all */
> +	if (!use_internal_rev_list || (!pack_to_stdout && write_bitmap_index) || is_repository_shallow())
>  		use_bitmap_index = 0;

And that local/keep logic above would just become "hard" reasons
included here.

-Peff

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH] pack-objects: Use reachability bitmap index when generating non-stdout pack too
  2016-07-25 18:40         ` Jeff King
@ 2016-07-25 18:53           ` Jeff King
  2016-07-27 20:15           ` Kirill Smelkov
  1 sibling, 0 replies; 62+ messages in thread
From: Jeff King @ 2016-07-25 18:53 UTC (permalink / raw)
  To: Kirill Smelkov
  Cc: Junio C Hamano, Jérome Perrin, Isabelle Vallet,
	Kazuhiko Shiozaki, Julien Muchembled, git, Vicent Marti

On Mon, Jul 25, 2016 at 02:40:25PM -0400, Jeff King wrote:

> > @@ -1052,6 +1053,9 @@ static int add_object_entry_from_bitmap(const unsigned char *sha1,
> >  {
> >  	uint32_t index_pos;
> >  
> > +	if (local && has_loose_object_nonlocal(sha1))
> > +		return 0;
> > +
> >  	if (have_duplicate_entry(sha1, 0, &index_pos))
> >  		return 0;
> 
> Hrm. Adding entries from the bitmap should ideally be very fast, but
> here we're introducing extra lookups in the object database. I guess it
> only kicks in when --local is given, though, which most bitmap-using
> paths would not do.
> 
> But is this check enough? The non-bitmap code path calls
> want_object_in_pack, which checks not only loose objects, but also
> non-local packs, and .keep.
> 
> Those don't kick in for your use case. I wonder if we should simply have
> something like:
> 
>   if (local || ignore_packed_keep)
> 	use_bitmap_index = 0;
> 
> and just skip bitmaps for those cases. That's easy to reason about, and
> I don't think anybody would care (your use case does not, and the repack
> use case is already not going to use bitmaps).

BTW, I thought we had more optimizations in this area, but I realized
that I had never sent them to the list. I just did, and you may want to
take a peek at:

  http://thread.gmane.org/gmane.comp.version-control.git/300218

I doubt it will speed up your case much (unless you really do have tons
of packs in your extraction). And I think it is still worth doing
disabling I showed above, even with the optimizations, just because it's
easier to reason about.

So I _think_ those optimizations are orthogonal to what we're discussing
here, but I wanted to point you at them just in case.

-Peff

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH] pack-objects: Use reachability bitmap index when generating non-stdout pack too
  2016-07-25 18:40         ` Jeff King
  2016-07-25 18:53           ` Jeff King
@ 2016-07-27 20:15           ` Kirill Smelkov
  2016-07-27 20:40             ` Junio C Hamano
  1 sibling, 1 reply; 62+ messages in thread
From: Kirill Smelkov @ 2016-07-27 20:15 UTC (permalink / raw)
  To: Jeff King
  Cc: Junio C Hamano, Jérome Perrin, Isabelle Vallet,
	Kazuhiko Shiozaki, Julien Muchembled, git, Vicent Marti

On Mon, Jul 25, 2016 at 02:40:25PM -0400, Jeff King wrote:
> On Wed, Jul 13, 2016 at 01:52:17PM +0300, Kirill Smelkov wrote:
> 
> > > So I think if you were to repeatedly "git repack -adb" over time, you
> > > would get worse and worse ordering as objects are added to the
> > > repository.
> > 
> > Jeff, first of all thanks for clarifying.
> > 
> > So it is not-yet-packed-objects which make packing with bitmap less
> > efficient. I was originally keeping in mind fresh repacked repository
> > with just built bitmap index and for that case extracting pack with
> > bitmap index seems to be just ok, but the more not-yet-packed objects we
> > have the worse the result can be.
> 
> Right. So I think your scheme is fine as long as you are doing your
> regular "pack all into one" repacks with a real walk, and then
> "branching" off of that with one-off bitmap-computed packs into files
> (even if you later take a bunch of those files and pull them into a
> single bitmapped, as long as that final "all into one" does the walk).
> 
> Or I guess another way to think about it would be that if you're
> computing bitmaps, you'd want to do the actual traversal.

Yes, exactly, and thanks for stating it clearly. We are doing repacks
and recomputing bitmaps doing the real walk. As you say this should be
fine.


> > Yes, it also make sense. I saw write_reused_pack() in upstream git just
> > copy raw bytes from original to destination pack. You mentioned you have
> > something better for pack reuse - in your patch queue, in two words, is
> > it now reusing pack based on object, not raw bytes, or is it something
> > else?
> > 
> > In other words in which way it works better? (I'm just curious here as
> > it is interesting to know)
> 
> The problem with the existing pack-reuse code is that it doesn't kick in
> often enough. I think it looks to see that the client wants some
> percentage of the pack (e.g., 90%), and then just sends the whole
> beginning. This works especially badly if you have a bunch of related
> repositories packed together (e.g., all of the forks of torvalds/linux
> on GitHub), because you'll never hit 90% of that big pack; it has too
> much unrelated cruft, even if most of the stuff you want _is_ at the
> beginning. And "percent of pack" is not really a useful metric anyway.
> 
> So the better scheme is more like:
> 
>   1. Generate the bitmap of objects to send using reachability bitmaps.
> 
>   2. Do a quick scan of their content in the packfile to see which can
>      be reused verbatim. If they're base objects, we can send them
>      as-is. If they're deltas, we can send them if their base is going
>      to be sent. This fills in another bitmap of "reusable" objects.
> 
>      After a long string of unusable objects, you can give up and set
>      the rest of the bitmap to zeroes.
> 
>   3. Walk the "reuse" bitmap and send out the objects more-or-less
>      verbatim. You do have make adjustments to delta-base-offsets for
>      any "holes" (so if an object's entry says "my base is 500 bytes
>      back", but you omitted some objects in between, you have to adjust
>      that offset).
> 
> The upside is that you can send out those objects without even making a
> "struct object_entry" for them, which drastically reduces the memory
> requirements for serving a clone. Any objects which didn't get marked
> for reuse just get handled in the usual way (so stuff that was not close
> by in the pack, or stuff that was pushed since your last big repack).

Thanks for clarifying. Yes, you are right, current upstream code checks
to see whether >= 90% of pack is what destination wants and only reuse
in such case. (I forgot about it, initially putting reuse at side in my
head as "not applicable to git-backup" because of that >= 90% reason).

So with the scheme you are drawing above it can be indeed more
efficient, and applicable to both torvalds/linux+forks and git-backup
case (extracting packs from big pack of all repos).

I'm looking forward to your patches on this topic. Please cc me on those
if you find it convenient.

> The downside is that because those objects aren't in our normal packing
> list, they're not available as delta bases for the new objects we _do_
> send. So it can make the resulting pack a little bit bigger.

So once again, the badness effect is the more, the more we have such
"new" objects not in original main pack - i.e. as loose objects or
objects living in other smaller packs. The badness comes to zero in
ideal case of freshly repacked repo with only one big pack.

Also: after sending reused object, with more code effort, in principle
we can hook reused object for being considered as delta-bases for new
objects. I mean this should not be impossible in principle, or am I
missing something?


> > Yes, for packing it is only hash which is used. And I assume name-hash
> > for bitmap is not enabled by default for compatibility with JGit code.
> >
> > It would make sense to me to eventually enable name-hash bitmap
> > extension by default, as packing result is much better with it. And
> > those who care about compatibility with JGit can just turn it off in
> > their git config.
> 
> Correct, the defaults are for JGit compatibility. If you are not using
> JGit, you should have it on all the time.  We went with the conservative
> default, but as more people using regular Git bitmaps, it would probably
> be good to make them less arcane and confusing to use.

I've just checked - JGit 3.7.1.201504261725-r (the version from Debian -
quite old) does _not_ barf on seeing bitmaps with "name hash" section.
I mean at least it does not error-exit on `jgit gc`, like
t5310-pack-bitmaps.sh says it can:

    # jgit gc will barf if it does not like our bitmaps
    jgit gc


I will be sending another mail with relevant JGit people cc'ed to turn
pack.writeBitmapHashCache=true by default.


> > ---- 8< ----
> > From: Kirill Smelkov <kirr@nexedi.com>
> > Subject: [PATCH v3] pack-objects: Teach it to use reachability bitmap index when
> >  generating non-stdout pack too
> > 
> > Starting from 6b8fda2d (pack-objects: use bitmaps when packing objects)
> > if a repository has bitmap index, pack-objects can nicely speedup
> > "Counting objects" graph traversal phase. That however was done only for
> > case when resultant pack is sent to stdout, not written into a file.
> 
> I think we can give more motivation and context here with some of the
> bits that have come out in our discussion. Like:
> 
>   The reason for this split is that pack-objects tries to determine how
>   "careful" it should be based on whether we are packing to disk or to
>   stdout. Packing to disk implies "git repack", and that we will likely
>   delete the old packs after finishing. We want to be more careful (so
>   as not to carry forward a corruption, and to generate a more optimal
>   pack), and we presumably run less frequently and can afford extra CPU.
>   Whereas packing to stdout implies serving a remote via "git fetch" or
>   "git push". This happens more frequently (e.g., a server handling many
>   fetching clients), and we assume the receiving end takes more
>   responsibility for verifying the data.
> 
>   But this isn't always the case. One might want to generate on-disk
>   packfiles for a specialized object transfer. Just using "--stdout" and
>   writing to a file is not optimal, as it will not generate the matching
>   pack index.
> 
>   So it would be useful to have some way of overriding this heuristic:
>   to tell pack-objects that even though it should generate on-disk
>   files, it is still OK to use the reachability bitmaps to do the
>   traversal.

Thanks, I'm adding this to the patch message in appropriate place.


> > We can teach pack-objects to use bitmap index for initial object
> > counting phase when generating resultant pack file too:
> > 
> > - if we know bitmap index generation is not enabled for resultant pack:
> > 
> >   Current code has singleton bitmap_git so cannot work simultaneously
> >   with two bitmap indices.
> 
> So one reason is that it is not currently possible with the
> implementation. But I think it also gets to the above bit about
> "optimal" packs. We do not want to generate bitmaps off of bitmaps,
> because we lose information about the write order. That's probably worth
> mentioning here.

Ok. I'm adding relevant note.


> > - if we keep pack reuse enabled still only for "send-to-stdout" case:
> > 
> >   Because on pack reuse raw entries are directly written out to destination
> >   pack by write_reused_pack() bypassing needed for pack index generation
> >   bookkeeping done by regular codepath in write_one() and friends.
> > 
> >   (at least that's my understanding after briefly looking at the code)
> 
> Yes, that's right. We definitely want pack-reuse off for this case.

Ok, thanks for clarifying.

> > NOTE
> > 
> > Jeff King suggested that it might be not generally a good idea to
> > use bitmap reachability index when repacking a repository. The reason
> > here is for on-disk repack by default we want
> > 
> > - to produce good pack (with bitmap index not-yet-packed objects are
> >   emitted to pack in suboptimal order).
> > 
> > - to use more robust pack-generation codepath (avoiding possible
> >   bugs in bitmap code and possible bitmap index corruption).
> 
> Ah, this kind of covers the bits I talked about above. I think it makes
> more sense to introduce them as part of the motivation, though, rather
> than as a note here.

Thanks, good idea (after we discussed the robustness issues and start to
take them into account). I'm moving this close to head of the
description.


> > Jeff also suggests that pack.useBitmaps was probably a mistake to
> > introduce originally. This way we are not adding another config point,
> > but instead just always default to-file pack-objects not to use bitmap
> > index: Tools which need to generate on-disk packs with using bitmap, can
> > pass --use-bitmap-index explicitly.
> 
> This part is important, though. Basically the reason we respect the
> command-line option is that we know that git-repack would never set it
> explicitly, so it is the hint that pack-objects can use to know which
> case we are serving: a careful repack of our data, or just extraction of
> some objects.

Yes. To make this very clear I'm also adding explicit note git-repack
never passes --use-bitmap-index to pack-objects, so this way we can be
sure regular on-disk repacking remains robust.


> > @@ -1052,6 +1053,9 @@ static int add_object_entry_from_bitmap(const unsigned char *sha1,
> >  {
> >  	uint32_t index_pos;
> >  
> > +	if (local && has_loose_object_nonlocal(sha1))
> > +		return 0;
> > +
> >  	if (have_duplicate_entry(sha1, 0, &index_pos))
> >  		return 0;
> 
> Hrm. Adding entries from the bitmap should ideally be very fast, but
> here we're introducing extra lookups in the object database. I guess it
> only kicks in when --local is given, though, which most bitmap-using
> paths would not do.
> 
> But is this check enough? The non-bitmap code path calls
> want_object_in_pack, which checks not only loose objects, but also
> non-local packs, and .keep.
> 
> Those don't kick in for your use case. I wonder if we should simply have
> something like:
> 
>   if (local || ignore_packed_keep)
> 	use_bitmap_index = 0;
> 
> and just skip bitmaps for those cases. That's easy to reason about, and
> I don't think anybody would care (your use case does not, and the repack
> use case is already not going to use bitmaps).

You are right - this is not enough. Initially I did not delved into this
--local case and only cared to make tests pass (which were failing without this
check when initial patch was using --use-bitmap-index by default).

I agree it is simpler to just not handle this case for now.

Actually after thinking about it a bit more, I can see that even current
code, allows `git pack-objects --stdout --local or --honor-pack-keep`
and does not handle those options properly. Thus I suggest to apply the
following patch as the first one in this now series:

---- 8< ----
From: Kirill Smelkov <kirr@nexedi.com>
Subject: [PATCH 1/2] pack-objects: Make sure use_bitmap_index is not active under
 --local or --honor-pack-keep

Since 6b8fda2d (pack-objects: use bitmaps when packing objects) there
are two codepaths in pack-objects: with & without using bitmap
reachability index.

However add_object_entry_from_bitmap(), despite its non-bitmapped
counterpart add_object_entry(), in no way does check for whether --local
or --honor-pack-keep should be respected. In non-bitmapped codepath this
is handled in want_object_in_pack(), but bitmapped codepath has simply
no such checking at all.

The bitmapped codepath however was allowing to pass --local and
--honor-pack-keep and bitmap indices were still used under such
conditions - potentially giving wrong output (including objects from
non-local or .keep'ed pack).

Instead of fixing bitmapped codepath to respect those options, since
currently no one actually need or use them in combination with bitmaps,
let's just force use_bitmap_index=0 when any of --local or
--honor-pack-keep are used and add appropriate comment about
not-checking for those in add_object_entry_from_bitmap()

Suggested-by: Jeff King <peff@peff.net>
---
 builtin/pack-objects.c | 15 +++++++++++++++
 1 file changed, 15 insertions(+)

diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index 15866d7..d7cf782 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -1055,6 +1055,12 @@ static int add_object_entry_from_bitmap(const unsigned char *sha1,
 	if (have_duplicate_entry(sha1, 0, &index_pos))
 		return 0;
 
+	/*
+	 * for simplicity we always want object to be in pack, as
+	 * use_bitmap_index codepath assumes neither --local nor --honor-pack-keep
+	 * is active.
+	 */
+
 	create_object_entry(sha1, type, name_hash, 0, 0, index_pos, pack, offset);
 
 	display_progress(progress_state, nr_result);
@@ -2776,6 +2782,15 @@ int cmd_pack_objects(int argc, const char **argv, const char *prefix)
 	if (!use_internal_rev_list || !pack_to_stdout || is_repository_shallow())
 		use_bitmap_index = 0;
 
+	/*
+	 * "lazy" reasons not to use bitmaps; it is easier to reason about when
+	 * neither --local nor --honor-pack-keep is in action, and so far no one
+	 * needed nor implemented such support yet.
+	 */
+	if (local || ignore_packed_keep)
+		use_bitmap_index = 0;
+
+
 	if (pack_to_stdout || !rev_list_all)
 		write_bitmap_index = 0;
 
-- 
2.9.0.431.g3cb5c84
---- 8< ----


On Mon, Jul 25, 2016 at 02:53:13PM -0400, Jeff King wrote:
> On Mon, Jul 25, 2016 at 02:40:25PM -0400, Jeff King wrote:
> 
> > > @@ -1052,6 +1053,9 @@ static int add_object_entry_from_bitmap(const unsigned char *sha1,
> > >  {
> > >  	uint32_t index_pos;
> > >  
> > > +	if (local && has_loose_object_nonlocal(sha1))
> > > +		return 0;
> > > +
> > >  	if (have_duplicate_entry(sha1, 0, &index_pos))
> > >  		return 0;
> > 
> > Hrm. Adding entries from the bitmap should ideally be very fast, but
> > here we're introducing extra lookups in the object database. I guess it
> > only kicks in when --local is given, though, which most bitmap-using
> > paths would not do.
> > 
> > But is this check enough? The non-bitmap code path calls
> > want_object_in_pack, which checks not only loose objects, but also
> > non-local packs, and .keep.
> > 
> > Those don't kick in for your use case. I wonder if we should simply have
> > something like:
> > 
> >   if (local || ignore_packed_keep)
> > 	use_bitmap_index = 0;
> > 
> > and just skip bitmaps for those cases. That's easy to reason about, and
> > I don't think anybody would care (your use case does not, and the repack
> > use case is already not going to use bitmaps).
> 
> BTW, I thought we had more optimizations in this area, but I realized
> that I had never sent them to the list. I just did, and you may want to
> take a peek at:
> 
>   http://thread.gmane.org/gmane.comp.version-control.git/300218
>
> I doubt it will speed up your case much (unless you really do have tons
> of packs in your extraction).

> And I think it is still worth doing disabling I showed above, even
> with the optimizations, just because it's easier to reason about.
> 
> So I _think_ those optimizations are orthogonal to what we're discussing
> here, but I wanted to point you at them just in case.

Thanks for the head-ups and for sending it. Yes, for git-backup we
usually do restore from freshly repacked repo, But the optimization is
useful in many other cases.  After reading the patches I wonder why
current state was for so a long time. I frequently have close to 50
packs in a repository, with only automatic gc triggering to do full
repack, and for that case looping always through whole 50 packs for
every object, even when we already found the pack an object lives in, is
just a waste of time. And yes, on client side I almost never use
alternate objects store and almost never do concurrent fetches etc (so if
I understand correctly, no .keep files). Thanks for sending it.

( Btw, if we are talking about optimizations, here is something related
  to pack extractions, I think it is worth mentioning just in case:

  https://lab.nexedi.com/kirr/git-backup/blob/ad6c6853/NOTES.restore

  It is a scheme how to compute "non-overlapping" set of packs when
  restoring repositories from big backup repo, so both disk size (same
  objects in many packs) and time (computing packs with many same objects
  many times) are not wasted. Then shared between repositories packs are
  just hardlinked to appropriate places.
  
  It is in line with e.g. 
  
  https://git.kernel.org/cgit/git/git.git/commit/tree-diff.c?id=72441af7
  
  because it is algorithmical optimization, only for now I do not have
  working code implementing it yet. )


anyway updated main patch goes below:

(whole-patch interdiff)
diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index 1ef85a6..f8b173d 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -1053,12 +1053,15 @@ static int add_object_entry_from_bitmap(const unsigned char *sha1,
 {
        uint32_t index_pos;
 
-       if (local && has_loose_object_nonlocal(sha1))
-               return 0;
-
        if (have_duplicate_entry(sha1, 0, &index_pos))
                return 0;
 
+       /*
+        * for simplicity we always want object to be in pack, as
+        * use_bitmap_index path assumes neither --local nor --honor-pack-keep
+        * is active.
+        */
+
        create_object_entry(sha1, type, name_hash, 0, 0, index_pos, pack, offset);
 
        display_progress(progress_state, nr_result);
@@ -2796,6 +2799,15 @@ int cmd_pack_objects(int argc, const char **argv, const char *prefix)
        if (!use_internal_rev_list || (!pack_to_stdout && write_bitmap_index) || is_repository_shallow())
                use_bitmap_index = 0;
 
+       /*
+        * "lazy" reasons not to use bitmaps; it is easier to reason about when
+        * neither --local nor --honor-pack-keep is in action, and so far no one
+        * needed nor implemented such support yet.
+        */
+       if (local || ignore_packed_keep)
+               use_bitmap_index = 0;
+
+
        if (pack_to_stdout || !rev_list_all)
                write_bitmap_index = 0;
 

(log interdiff)
@@ -5,29 +5,69 @@ if a repository has bitmap index, pack-objects can nicely speedup
 "Counting objects" graph traversal phase. That however was done only for
 case when resultant pack is sent to stdout, not written into a file.
 
-We can teach pack-objects to use bitmap index for initial object
+The reason here is for on-disk repack by default we want:
+
+- to produce good pack (with bitmap index not-yet-packed objects are
+  emitted to pack in suboptimal order).
+
+- to use more robust pack-generation codepath (avoiding possible
+  bugs in bitmap code and possible bitmap index corruption).
+
+Jeff Kind further explains:
+
+    The reason for this split is that pack-objects tries to determine how
+    "careful" it should be based on whether we are packing to disk or to
+    stdout. Packing to disk implies "git repack", and that we will likely
+    delete the old packs after finishing. We want to be more careful (so
+    as not to carry forward a corruption, and to generate a more optimal
+    pack), and we presumably run less frequently and can afford extra CPU.
+    Whereas packing to stdout implies serving a remote via "git fetch" or
+    "git push". This happens more frequently (e.g., a server handling many
+    fetching clients), and we assume the receiving end takes more
+    responsibility for verifying the data.
+
+    But this isn't always the case. One might want to generate on-disk
+    packfiles for a specialized object transfer. Just using "--stdout" and
+    writing to a file is not optimal, as it will not generate the matching
+    pack index.
+
+    So it would be useful to have some way of overriding this heuristic:
+    to tell pack-objects that even though it should generate on-disk
+    files, it is still OK to use the reachability bitmaps to do the
+    traversal.
+
+So we can teach pack-objects to use bitmap index for initial object
 counting phase when generating resultant pack file too:
 
+- if we care it is not activated under git-repack:
+
+  See above about repack robustness and not forward-carrying corruption.
+
 - if we know bitmap index generation is not enabled for resultant pack:
 
   Current code has singleton bitmap_git so cannot work simultaneously
   with two bitmap indices.
 
+  We also want to avoid (at least with current implementation)
+  generating bitmaps off of bitmaps. The reason here is: when generating
+  a pack, not-yet-packed objects will be emitted into pack in
+  suboptimal order and added to tail of the bitmap as "extended entries".
+  When the resultant pack + some new objects in associated repository
+  are in turn used to generate another pack with bitmap, the situation
+  repeats: new objects are again not emitted optimally and just added to
+  bitmap tail - not in recency order.
+
+  So the pack badness can grow over time when at each step we have
+  bitmapped pack + some other objects. That's why we want to avoid
+  generating bitmaps off of bitmaps, not to let pack badness grow.
+
 - if we keep pack reuse enabled still only for "send-to-stdout" case:
 
   Because on pack reuse raw entries are directly written out to destination
   pack by write_reused_pack() bypassing needed for pack index generation
   bookkeeping done by regular codepath in write_one() and friends.
 
-  (at least that's my understanding after briefly looking at the code)
-
-We also need to care and teach add_object_entry_from_bitmap() to respect
---local via not adding nonlocal loose object to resultant pack (this
-is bitmap-codepath counterpart of daae0625 (pack-objects: extend --local
-to mean ignore non-local loose objects too) -- not to break 'loose
-objects in alternate ODB are not repacked' in t7700-repack.sh .
-
-Otherwise all git tests pass, and for pack-objects -> file we get nice
+This way all git tests pass, and for pack-objects -> file we get nice
 speedup:
 
     erp5.git[1] (~230MB) extracted from ~ 5GB lab.nexedi.com backup
@@ -55,27 +95,62 @@ initial motivation for the patch.
 
 NOTE
 
-Jeff King suggested that it might be not generally a good idea to
-use bitmap reachability index when repacking a repository. The reason
-here is for on-disk repack by default we want
-
-- to produce good pack (with bitmap index not-yet-packed objects are
-  emitted to pack in suboptimal order).
-
-- to use more robust pack-generation codepath (avoiding possible
-  bugs in bitmap code and possible bitmap index corruption).
-
 Jeff also suggests that pack.useBitmaps was probably a mistake to
 introduce originally. This way we are not adding another config point,
 but instead just always default to-file pack-objects not to use bitmap
 index: Tools which need to generate on-disk packs with using bitmap, can
-pass --use-bitmap-index explicitly.
+pass --use-bitmap-index explicitly. And git-repack does never pass
+--use-bitmap-index, so this way we can be sure regular on-disk repacking
+remains robust.
+
+NOTE2
+
+`git pack-objects --stdout >file.pack` + `git index-pack file.pack` is much slower
+than `git pack-objects file.pack`. Extracting erp5.git pack from
+lab.nexedi.com backup repository:
+
+---- 8< ----
+$ time echo 0186ac99 | git pack-objects --stdout --revs >erp5pack-stdout.pack
+
+real    0m22.309s
+user    0m21.148s
+sys     0m0.932s
+
+$ time git index-pack erp5pack-stdout.pack
+
+real    0m50.873s   <-- more than 2 times slower than time to generate pack itself!
+user    0m49.300s
+sys     0m1.360s
+---- 8< ----
+
+So the time for
+
+    `pack-object --stdout >file.pack` + `index-pack file.pack`  is  72s,
+
+while
+
+    `pack-objects file.pack` which does both pack and index     is  27s.
+
+And even
+
+    `pack-objects --no-use-bitmap-index file.pack`              is  37s.
+
+Jeff explains:
+
+    The packfile does not carry the sha1 of the objects. A receiving
+    index-pack has to compute them itself, including inflating and applying
+    all of the deltas.
+
+that's why for `git-backup restore` we want to teach `git pack-objects
+file.pack` to use bitmaps instead of using `git pack-objects --stdout
+>file.pack` + `git index-pack file.pack`.
 
 More context:
 
     http://article.gmane.org/gmane.comp.version-control.git/299063
     http://article.gmane.org/gmane.comp.version-control.git/299107
     http://article.gmane.org/gmane.comp.version-control.git/299420
+    http://article.gmane.org/gmane.comp.version-control.git/300217
 
 Cc: Vicent Marti <tanoku@gmail.com>
 Helped-by: Jeff King <peff@peff.net>


(patch itself)
---- 8< ----
From: Kirill Smelkov <kirr@nexedi.com>
Subject: [PATCH 2/2] pack-objects: Teach it to use reachability bitmap index when
 generating non-stdout pack too

Starting from 6b8fda2d (pack-objects: use bitmaps when packing objects)
if a repository has bitmap index, pack-objects can nicely speedup
"Counting objects" graph traversal phase. That however was done only for
case when resultant pack is sent to stdout, not written into a file.

The reason here is for on-disk repack by default we want:

- to produce good pack (with bitmap index not-yet-packed objects are
  emitted to pack in suboptimal order).

- to use more robust pack-generation codepath (avoiding possible
  bugs in bitmap code and possible bitmap index corruption).

Jeff Kind further explains:

    The reason for this split is that pack-objects tries to determine how
    "careful" it should be based on whether we are packing to disk or to
    stdout. Packing to disk implies "git repack", and that we will likely
    delete the old packs after finishing. We want to be more careful (so
    as not to carry forward a corruption, and to generate a more optimal
    pack), and we presumably run less frequently and can afford extra CPU.
    Whereas packing to stdout implies serving a remote via "git fetch" or
    "git push". This happens more frequently (e.g., a server handling many
    fetching clients), and we assume the receiving end takes more
    responsibility for verifying the data.

    But this isn't always the case. One might want to generate on-disk
    packfiles for a specialized object transfer. Just using "--stdout" and
    writing to a file is not optimal, as it will not generate the matching
    pack index.

    So it would be useful to have some way of overriding this heuristic:
    to tell pack-objects that even though it should generate on-disk
    files, it is still OK to use the reachability bitmaps to do the
    traversal.

So we can teach pack-objects to use bitmap index for initial object
counting phase when generating resultant pack file too:

- if we care it is not activated under git-repack:

  See above about repack robustness and not forward-carrying corruption.

- if we know bitmap index generation is not enabled for resultant pack:

  Current code has singleton bitmap_git so cannot work simultaneously
  with two bitmap indices.

  We also want to avoid (at least with current implementation)
  generating bitmaps off of bitmaps. The reason here is: when generating
  a pack, not-yet-packed objects will be emitted into pack in
  suboptimal order and added to tail of the bitmap as "extended entries".
  When the resultant pack + some new objects in associated repository
  are in turn used to generate another pack with bitmap, the situation
  repeats: new objects are again not emitted optimally and just added to
  bitmap tail - not in recency order.

  So the pack badness can grow over time when at each step we have
  bitmapped pack + some other objects. That's why we want to avoid
  generating bitmaps off of bitmaps, not to let pack badness grow.

- if we keep pack reuse enabled still only for "send-to-stdout" case:

  Because on pack reuse raw entries are directly written out to destination
  pack by write_reused_pack() bypassing needed for pack index generation
  bookkeeping done by regular codepath in write_one() and friends.

This way all git tests pass, and for pack-objects -> file we get nice
speedup:

    erp5.git[1] (~230MB) extracted from ~ 5GB lab.nexedi.com backup
    repository managed by git-backup[2] via

    time echo 0186ac99 | git pack-objects --revs erp5pack

before:  37.2s
after:   26.2s

And for `git repack -adb` packed git.git

    time echo 5c589a73 | git pack-objects --revs gitpack

before:   7.1s
after:    3.6s

i.e. it can be 30% - 50% speedup for pack extraction.

git-backup extracts many packs on repositories restoration. That was my
initial motivation for the patch.

[1] https://lab.nexedi.com/nexedi/erp5
[2] https://lab.nexedi.com/kirr/git-backup

NOTE

Jeff also suggests that pack.useBitmaps was probably a mistake to
introduce originally. This way we are not adding another config point,
but instead just always default to-file pack-objects not to use bitmap
index: Tools which need to generate on-disk packs with using bitmap, can
pass --use-bitmap-index explicitly. And git-repack does never pass
--use-bitmap-index, so this way we can be sure regular on-disk repacking
remains robust.

NOTE2

`git pack-objects --stdout >file.pack` + `git index-pack file.pack` is much slower
than `git pack-objects file.pack`. Extracting erp5.git pack from
lab.nexedi.com backup repository:

---- 8< ----
$ time echo 0186ac99 | git pack-objects --stdout --revs >erp5pack-stdout.pack

real    0m22.309s
user    0m21.148s
sys     0m0.932s

$ time git index-pack erp5pack-stdout.pack

real    0m50.873s   <-- more than 2 times slower than time to generate pack itself!
user    0m49.300s
sys     0m1.360s
---- 8< ----

So the time for

    `pack-object --stdout >file.pack` + `index-pack file.pack`  is  72s,

while

    `pack-objects file.pack` which does both pack and index     is  27s.

And even

    `pack-objects --no-use-bitmap-index file.pack`              is  37s.

Jeff explains:

    The packfile does not carry the sha1 of the objects. A receiving
    index-pack has to compute them itself, including inflating and applying
    all of the deltas.

that's why for `git-backup restore` we want to teach `git pack-objects
file.pack` to use bitmaps instead of using `git pack-objects --stdout
>file.pack` + `git index-pack file.pack`.

More context:

    http://article.gmane.org/gmane.comp.version-control.git/299063
    http://article.gmane.org/gmane.comp.version-control.git/299107
    http://article.gmane.org/gmane.comp.version-control.git/299420
    http://article.gmane.org/gmane.comp.version-control.git/300217

Cc: Vicent Marti <tanoku@gmail.com>
Helped-by: Jeff King <peff@peff.net>
Signed-off-by: Kirill Smelkov <kirr@nexedi.com>
---
 Documentation/config.txt |  3 +++
 builtin/pack-objects.c   | 25 +++++++++++++++++++++----
 t/t5310-pack-bitmaps.sh  | 14 ++++++++++++++
 3 files changed, 38 insertions(+), 4 deletions(-)

diff --git a/Documentation/config.txt b/Documentation/config.txt
index b0ed71f..39ab41d 100644
--- a/Documentation/config.txt
+++ b/Documentation/config.txt
@@ -2244,6 +2244,9 @@ pack.useBitmaps::
 	to stdout (e.g., during the server side of a fetch). Defaults to
 	true. You should not generally need to turn this off unless
 	you are debugging pack bitmaps.
++
+*NOTE*: when packing to file (e.g., on repack) the default is always not to use
+	pack bitmaps.
 
 pack.writeBitmaps (deprecated)::
 	This is a deprecated synonym for `repack.writeBitmaps`.
diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index d7cf782..f8b173d 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -66,7 +66,8 @@ static struct packed_git *reuse_packfile;
 static uint32_t reuse_packfile_objects;
 static off_t reuse_packfile_offset;
 
-static int use_bitmap_index = 1;
+static int use_bitmap_index_default = 1;
+static int use_bitmap_index = -1;
 static int write_bitmap_index;
 static uint16_t write_bitmap_options = BITMAP_OPT_HASH_CACHE;
 
@@ -2231,7 +2232,7 @@ static int git_pack_config(const char *k, const char *v, void *cb)
 			write_bitmap_options &= ~BITMAP_OPT_HASH_CACHE;
 	}
 	if (!strcmp(k, "pack.usebitmaps")) {
-		use_bitmap_index = git_config_bool(k, v);
+		use_bitmap_index_default = git_config_bool(k, v);
 		return 0;
 	}
 	if (!strcmp(k, "pack.threads")) {
@@ -2494,7 +2495,7 @@ static int get_object_list_from_bitmap(struct rev_info *revs)
 	if (prepare_bitmap_walk(revs) < 0)
 		return -1;
 
-	if (pack_options_allow_reuse() &&
+	if (pack_options_allow_reuse() && pack_to_stdout &&
 	    !reuse_partial_packfile_from_bitmap(
 			&reuse_packfile,
 			&reuse_packfile_objects,
@@ -2779,7 +2780,23 @@ int cmd_pack_objects(int argc, const char **argv, const char *prefix)
 	if (!rev_list_all || !rev_list_reflog || !rev_list_index)
 		unpack_unreachable_expiration = 0;
 
-	if (!use_internal_rev_list || !pack_to_stdout || is_repository_shallow())
+	/*
+	 * "soft" reasons not to use bitmaps - for on-disk repack by default we want
+	 *
+	 * - to produce good pack (with bitmap index not-yet-packed objects are
+	 *   packed in suboptimal order).
+	 *
+	 * - to use more robust pack-generation codepath (avoiding possible
+	 *   bugs in bitmap code and possible bitmap index corruption).
+	 */
+	if (!pack_to_stdout)
+		use_bitmap_index_default = 0;
+
+	if (use_bitmap_index < 0)
+		use_bitmap_index = use_bitmap_index_default;
+
+	/* "hard" reasons not to use bitmaps; these just won't work at all */
+	if (!use_internal_rev_list || (!pack_to_stdout && write_bitmap_index) || is_repository_shallow())
 		use_bitmap_index = 0;
 
 	/*
diff --git a/t/t5310-pack-bitmaps.sh b/t/t5310-pack-bitmaps.sh
index 0d03583..0802b7c 100755
--- a/t/t5310-pack-bitmaps.sh
+++ b/t/t5310-pack-bitmaps.sh
@@ -117,6 +117,20 @@ test_expect_success 'incremental repack can disable bitmaps' '
 	git repack -d --no-write-bitmap-index
 '
 
+test_expect_success 'pack-objects to file can use bitmap' '
+	# make sure we still have 1 bitmap index from previous tests
+	ls .git/objects/pack/ | grep bitmap >output &&
+	test_line_count = 1 output &&
+	# verify equivalent packs are generated with/without using bitmap index
+	packasha1=$(git pack-objects --no-use-bitmap-index --all packa </dev/null) &&
+	packbsha1=$(git pack-objects --use-bitmap-index --all packb </dev/null) &&
+	git verify-pack -v packa-$packasha1.pack >packa.verify &&
+	git verify-pack -v packb-$packbsha1.pack >packb.verify &&
+	grep -o "^$_x40" packa.verify |sort >packa.objects &&
+	grep -o "^$_x40" packb.verify |sort >packb.objects &&
+	test_cmp packa.objects packb.objects
+'
+
 test_expect_success 'full repack, reusing previous bitmaps' '
 	git repack -ad &&
 	ls .git/objects/pack/ | grep bitmap >output &&
-- 
2.9.0.431.g3cb5c84

^ permalink raw reply related	[flat|nested] 62+ messages in thread

* Re: [PATCH] pack-objects: Use reachability bitmap index when generating non-stdout pack too
  2016-07-27 20:15           ` Kirill Smelkov
@ 2016-07-27 20:40             ` Junio C Hamano
  2016-07-28 20:22               ` Kirill Smelkov
  0 siblings, 1 reply; 62+ messages in thread
From: Junio C Hamano @ 2016-07-27 20:40 UTC (permalink / raw)
  To: Kirill Smelkov
  Cc: Jeff King, Jérome Perrin, Isabelle Vallet, Kazuhiko Shiozaki,
	Julien Muchembled, git, Vicent Marti

Kirill Smelkov <kirr@nexedi.com> writes:

> > From: Kirill Smelkov <kirr@nexedi.com>
> Subject: [PATCH 1/2] pack-objects: Make sure use_bitmap_index is not active under
>  --local or --honor-pack-keep
>
> Since 6b8fda2d (pack-objects: use bitmaps when packing objects) there
> are two codepaths in pack-objects: with & without using bitmap
> reachability index.
>
> However add_object_entry_from_bitmap(), despite its non-bitmapped
> counterpart add_object_entry(), in no way does check for whether --local
> or --honor-pack-keep should be respected. In non-bitmapped codepath this
> is handled in want_object_in_pack(), but bitmapped codepath has simply
> no such checking at all.
>
> The bitmapped codepath however was allowing to pass --local and
> --honor-pack-keep and bitmap indices were still used under such
> conditions - potentially giving wrong output (including objects from
> non-local or .keep'ed pack).
>
> Instead of fixing bitmapped codepath to respect those options, since
> currently no one actually need or use them in combination with bitmaps,
> let's just force use_bitmap_index=0 when any of --local or
> --honor-pack-keep are used and add appropriate comment about
> not-checking for those in add_object_entry_from_bitmap()
>
> Suggested-by: Jeff King <peff@peff.net>
> ---
>  builtin/pack-objects.c | 15 +++++++++++++++
>  1 file changed, 15 insertions(+)
>
> diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
> index 15866d7..d7cf782 100644
> --- a/builtin/pack-objects.c
> +++ b/builtin/pack-objects.c
> @@ -1055,6 +1055,12 @@ static int add_object_entry_from_bitmap(const unsigned char *sha1,
>  	if (have_duplicate_entry(sha1, 0, &index_pos))
>  		return 0;
>  
> +	/*
> +	 * for simplicity we always want object to be in pack, as
> +	 * use_bitmap_index codepath assumes neither --local nor --honor-pack-keep
> +	 * is active.
> +	 */

I am not sure this comment is useful to readers.

Unless the readers are comparing add_object_entry() and this
function and wondering why this side lacks a check here, iow, when
they are merely following from a caller of this function through
this function down to its callee to understand what goes on, this
comment would not help them and only confuse them.

If we were to say something to help those who are comparing these
two functions, I think we should be more explicit, i.e.

    The caller disables use-bitmap-index when --local or
    --honor-pack-keep options are in effect because bitmap code is
    not prepared to handle them.  Because the control does not reach
    here if these options are in effect, the check with
    want_object_in_pack() to skip objects is not done.

or something like that.

Or is the rest of the bitmap codepath prepared to handle these
options and it is just the matter of adding the missing check with
want_object_in_pack() here to make it work correctly?

>  	create_object_entry(sha1, type, name_hash, 0, 0, index_pos, pack, offset);
>  
>  	display_progress(progress_state, nr_result);
> @@ -2776,6 +2782,15 @@ int cmd_pack_objects(int argc, const char **argv, const char *prefix)
>  	if (!use_internal_rev_list || !pack_to_stdout || is_repository_shallow())
>  		use_bitmap_index = 0;
>  
> +	/*
> +	 * "lazy" reasons not to use bitmaps; it is easier to reason about when
> +	 * neither --local nor --honor-pack-keep is in action, and so far no one
> +	 * needed nor implemented such support yet.
> +	 */

Justifying comment like this is a good idea, but the comment above
does not make it very clear that this is a correctness fix, i.e. if
we do not disable, the code will do a wrong thing.

The other logic to disable use of bitmap we can see in the
pre-context would also benefit from some description as to why;
6b8fda2d (pack-objects: use bitmaps when packing objects,
2013-12-21) didn't do a very good job in that---the reason is not
clear in its log message, either.

> +	if (local || ignore_packed_keep)
> +		use_bitmap_index = 0;
> +
> +

I see one extra blank line here ;-)

>  	if (pack_to_stdout || !rev_list_all)
>  		write_bitmap_index = 0;

Thanks.

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH] pack-objects: Use reachability bitmap index when generating non-stdout pack too
  2016-07-27 20:40             ` Junio C Hamano
@ 2016-07-28 20:22               ` Kirill Smelkov
  2016-07-28 21:18                 ` Junio C Hamano
  0 siblings, 1 reply; 62+ messages in thread
From: Kirill Smelkov @ 2016-07-28 20:22 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Jeff King, Jérome Perrin, Isabelle Vallet, Kazuhiko Shiozaki,
	Julien Muchembled, git, Vicent Marti

Junio, first of all thanks for feedback,

On Wed, Jul 27, 2016 at 01:40:36PM -0700, Junio C Hamano wrote:
> Kirill Smelkov <kirr@nexedi.com> writes:
> 
> > > From: Kirill Smelkov <kirr@nexedi.com>
> > Subject: [PATCH 1/2] pack-objects: Make sure use_bitmap_index is not active under
> >  --local or --honor-pack-keep
> >
> > Since 6b8fda2d (pack-objects: use bitmaps when packing objects) there
> > are two codepaths in pack-objects: with & without using bitmap
> > reachability index.
> >
> > However add_object_entry_from_bitmap(), despite its non-bitmapped
> > counterpart add_object_entry(), in no way does check for whether --local
> > or --honor-pack-keep should be respected. In non-bitmapped codepath this
> > is handled in want_object_in_pack(), but bitmapped codepath has simply
> > no such checking at all.
> >
> > The bitmapped codepath however was allowing to pass --local and
> > --honor-pack-keep and bitmap indices were still used under such
> > conditions - potentially giving wrong output (including objects from
> > non-local or .keep'ed pack).
> >
> > Instead of fixing bitmapped codepath to respect those options, since
> > currently no one actually need or use them in combination with bitmaps,
> > let's just force use_bitmap_index=0 when any of --local or
> > --honor-pack-keep are used and add appropriate comment about
> > not-checking for those in add_object_entry_from_bitmap()
> >
> > Suggested-by: Jeff King <peff@peff.net>
> > ---
> >  builtin/pack-objects.c | 15 +++++++++++++++
> >  1 file changed, 15 insertions(+)
> >
> > diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
> > index 15866d7..d7cf782 100644
> > --- a/builtin/pack-objects.c
> > +++ b/builtin/pack-objects.c
> > @@ -1055,6 +1055,12 @@ static int add_object_entry_from_bitmap(const unsigned char *sha1,
> >  	if (have_duplicate_entry(sha1, 0, &index_pos))
> >  		return 0;
> >  
> > +	/*
> > +	 * for simplicity we always want object to be in pack, as
> > +	 * use_bitmap_index codepath assumes neither --local nor --honor-pack-keep
> > +	 * is active.
> > +	 */
> 
> I am not sure this comment is useful to readers.
> 
> Unless the readers are comparing add_object_entry() and this
> function and wondering why this side lacks a check here, iow, when
> they are merely following from a caller of this function through
> this function down to its callee to understand what goes on, this
> comment would not help them and only confuse them.
> 
> If we were to say something to help those who are comparing these
> two functions, I think we should be more explicit, i.e.
> 
>     The caller disables use-bitmap-index when --local or
>     --honor-pack-keep options are in effect because bitmap code is
>     not prepared to handle them.  Because the control does not reach
>     here if these options are in effect, the check with
>     want_object_in_pack() to skip objects is not done.
> 
> or something like that.

You are probably right.


> Or is the rest of the bitmap codepath prepared to handle these
> options and it is just the matter of adding the missing check with
> want_object_in_pack() here to make it work correctly?

I'm waiting so long for main patch to be at least queued to pu, that I'm
now a bit frustrated and ready to do something not related to main goal :)
(they say every joke contains part of a joke). Here is something from
sleepy me:

---- 8< ----
From: Kirill Smelkov <kirr@nexedi.com>
Date: Wed, 27 Jul 2016 22:18:04 +0300
Subject: [PATCH 1/2] pack-objects: Teach --use-bitmap-index codepath to
 respect --local, --honor-pack-keep and --incremental

Since 6b8fda2d (pack-objects: use bitmaps when packing objects) there
are two codepaths in pack-objects: with & without using bitmap
reachability index.

However add_object_entry_from_bitmap(), despite its non-bitmapped
counterpart add_object_entry(), in no way does check for whether --local
or --honor-pack-keep or --incremental should be respected. In
non-bitmapped codepath this is handled in want_object_in_pack(), but
bitmapped codepath has simply no such checking at all.

The bitmapped codepath however was allowing to pass in all those options
and with bitmap indices still being used under such conditions -
potentially giving wrong output (e.g. including objects from non-local or
.keep'ed pack).

We can easily fix this by noting the following: when an object comes to
add_object_entry_from_bitmap() it can come for two reasons:

    1. entries coming from main pack covered by bitmap index, and
    2. object coming from, possibly alternate, loose or other packs.

For "2" we always have pack not yet found by bitmap traversal code, and
thus we can simply reuse non-bitmapped want_object_in_pack() to find in
which pack an object lives and also for taking omitting decision.

For "1" we always have pack already found by bitmap traversal code and we
only need to check that pack for same omission criteria used in
want_object_in_pack() for found_pack.

Suggested-by: Junio C Hamano <gitster@pobox.com>
Discussed-with: Jeff King <peff@peff.net>
---
 builtin/pack-objects.c  |  39 +++++++++++++++++++
 t/t5310-pack-bitmaps.sh | 100 ++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 139 insertions(+)

diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index a2f8cfd..34b3019 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -987,6 +987,42 @@ static int want_object_in_pack(const unsigned char *sha1,
 	return 1;
 }
 
+/* Like want_object_in_pack() but for objects coming from-under bitmapped traversal */
+static int want_object_in_pack_bitmap(const unsigned char *sha1,
+				      struct packed_git **found_pack,
+				      off_t *found_offset)
+{
+	struct packed_git *p = *found_pack;
+
+	/*
+	 * There are two types of requests coming here:
+	 * 1. entries coming from main pack covered by bitmap index, and
+	 * 2. object coming from, possibly alternate, loose or other packs.
+	 *
+	 * For "1" we always have *found_pack != NULL passed here from
+	 * traverse_bitmap_commit_list(). (*found_pack is bitmap_git.pack
+	 * actually).
+	 *
+	 * For "2" we always have *found_pack == NULL passed here from
+	 * traverse_bitmap_commit_list() - since this is the way bitmap
+	 * traversal passes here "extended" bitmap entries.
+	 */
+
+	/* objects not covered by bitmap */
+	if (!p)
+		return want_object_in_pack(sha1, 0, found_pack, found_offset);
+
+	/* objects covered by bitmap - we only have to check p wrt local and .keep */
+	if (incremental)
+		return 0;
+	if (local && !p->pack_local)
+		return 0;
+	if (ignore_packed_keep && p->pack_local && p->pack_keep)
+		return 0;
+
+	return 1;
+}
+
 static void create_object_entry(const unsigned char *sha1,
 				enum object_type type,
 				uint32_t hash,
@@ -1055,6 +1091,9 @@ static int add_object_entry_from_bitmap(const unsigned char *sha1,
 	if (have_duplicate_entry(sha1, 0, &index_pos))
 		return 0;
 
+	if (!want_object_in_pack_bitmap(sha1, &pack, &offset))
+		return 0;
+
 	create_object_entry(sha1, type, name_hash, 0, 0, index_pos, pack, offset);
 
 	display_progress(progress_state, nr_result);
diff --git a/t/t5310-pack-bitmaps.sh b/t/t5310-pack-bitmaps.sh
index 3893afd..a76f6ca 100755
--- a/t/t5310-pack-bitmaps.sh
+++ b/t/t5310-pack-bitmaps.sh
@@ -118,6 +118,88 @@ test_expect_success 'incremental repack can disable bitmaps' '
 	git repack -d --no-write-bitmap-index
 '
 
+test_expect_success 'pack-objects respects --local (non-local loose)' '
+	mkdir -p alt_objects/pack &&
+	echo $(pwd)/alt_objects > .git/objects/info/alternates &&
+	echo content1 > file1 &&
+	objsha1=$(GIT_OBJECT_DIRECTORY=alt_objects git hash-object -w file1) &&
+	git add file1 &&
+	test_tick &&
+	git commit -m commit_file1 &&
+	echo HEAD | \
+	git pack-objects --local --stdout --revs >1.pack &&
+	git index-pack 1.pack &&
+	git verify-pack -v 1.pack >1.objects &&
+	if egrep "^$objsha1" 1.objects; then
+		echo "Non-local object present in pack generated with --local: $objsha1"
+		return 1
+	fi
+'
+
+test_expect_success 'pack-objects respects --honor-pack-keep (local non-bitmapped pack)' '
+	echo content2 > file2 &&
+	objsha2=$(git hash-object -w file2) &&
+	git add file2 &&
+	test_tick &&
+	git commit -m commit_file2 &&
+	pack2=$(echo $objsha2 | \
+		git pack-objects pack2) &&
+	mv pack2-$pack2.* .git/objects/pack/ &&
+	touch .git/objects/pack/pack2-$pack2.keep &&
+	rm $(objpath $objsha2) &&
+	echo HEAD | \
+	git pack-objects --honor-pack-keep --stdout --revs >2a.pack &&
+	git index-pack 2a.pack &&
+	git verify-pack -v 2a.pack >2a.objects &&
+	if egrep "^$objsha2" 2a.objects; then
+		echo "Object from .keeped pack present in pack generated with --honor-pack-keep: $objsha2"
+		return 1
+	fi
+'
+
+test_expect_success 'pack-objects respects --local (non-local pack)' '
+	mv .git/objects/pack/pack2-$pack2.* alt_objects/pack/ &&
+	echo HEAD | \
+	git pack-objects --local --stdout --revs >2b.pack &&
+	git index-pack 2b.pack &&
+	git verify-pack -v 2b.pack >2b.objects &&
+	if egrep "^$objsha2" 2b.objects; then
+		echo "Non-local object present in pack generated with --local: $objsha2"
+		return 1
+	fi
+'
+
+test_expect_success 'pack-objects respects --honor-pack-keep (local bitmapped pack)' '
+	ls .git/objects/pack/ | grep bitmap >output &&
+	test_line_count = 1 output &&
+	packbitmap=$(basename $(cat output) .bitmap) &&
+	git verify-pack -v .git/objects/pack/$packbitmap.pack >packbitmap.verify &&
+	grep -o "^$_x40" packbitmap.verify |sort >packbitmap.objects &&
+	touch .git/objects/pack/$packbitmap.keep &&
+	echo HEAD | \
+	git pack-objects --honor-pack-keep --stdout --revs >3a.pack &&
+	git index-pack 3a.pack &&
+	git verify-pack -v 3a.pack >3a.objects &&
+	if grep -qFf packbitmap.objects 3a.objects; then
+		echo "Object from .keeped bitmapped pack present in pack generated with --honour-pack-keep"
+		return 1
+	fi &&
+	rm .git/objects/pack/$packbitmap.keep
+'
+
+test_expect_success 'pack-objects respects --local (non-local bitmapped pack)' '
+	mv .git/objects/pack/$packbitmap.* alt_objects/pack/ &&
+	echo HEAD | \
+	git pack-objects --local --stdout --revs >3b.pack &&
+	git index-pack 3b.pack &&
+	git verify-pack -v 3b.pack >3b.objects &&
+	if grep -qFf packbitmap.objects 3b.objects; then
+		echo "Non-local object from bitmapped pack present in pack generated with --local"
+		return 1
+	fi &&
+	mv alt_objects/pack/$packbitmap.* .git/objects/pack/
+'
+
 test_expect_success 'full repack, reusing previous bitmaps' '
 	git repack -ad &&
 	ls .git/objects/pack/ | grep bitmap >output &&
@@ -143,6 +225,24 @@ test_expect_success 'create objects for missing-HAVE tests' '
 	EOF
 '
 
+test_expect_success 'pack-objects respects --incremental' '
+	cat >revs2 <<-EOF &&
+	HEAD
+	$commit
+	EOF
+	git pack-objects --incremental --stdout --revs <revs2 >4.pack &&
+	git index-pack 4.pack &&
+	git verify-pack -v 4.pack >4.verify &&
+	grep -o "^$_x40" 4.verify |sort >4.objects &&
+	test_line_count = 4 4.objects &&
+	git rev-list --objects $commit >revlist &&
+	grep -o "^$_x40" revlist |sort >objects &&
+	if grep -qvFf objects 4.objects; then
+		echo "Expected objects not present in incremental pack"
+		return 1
+	fi
+'
+
 test_expect_success 'pack with missing blob' '
 	rm $(objpath $blob) &&
 	git pack-objects --stdout --revs <revs >/dev/null
-- 
2.9.0.431.g3cb5c84
---- 8< ----


and main patch updated to avoid trivial conflicts

---- 8< ----
From: Kirill Smelkov <kirr@nexedi.com>
Date: Thu, 7 Jul 2016 20:12:00 +0300
Subject: [PATCH 2/2] pack-objects: Teach it to use reachability bitmap index
 when generating non-stdout pack too

Starting from 6b8fda2d (pack-objects: use bitmaps when packing objects)
if a repository has bitmap index, pack-objects can nicely speedup
"Counting objects" graph traversal phase. That however was done only for
case when resultant pack is sent to stdout, not written into a file.

The reason here is for on-disk repack by default we want:

- to produce good pack (with bitmap index not-yet-packed objects are
  emitted to pack in suboptimal order).

- to use more robust pack-generation codepath (avoiding possible
  bugs in bitmap code and possible bitmap index corruption).

Jeff Kind further explains:

    The reason for this split is that pack-objects tries to determine how
    "careful" it should be based on whether we are packing to disk or to
    stdout. Packing to disk implies "git repack", and that we will likely
    delete the old packs after finishing. We want to be more careful (so
    as not to carry forward a corruption, and to generate a more optimal
    pack), and we presumably run less frequently and can afford extra CPU.
    Whereas packing to stdout implies serving a remote via "git fetch" or
    "git push". This happens more frequently (e.g., a server handling many
    fetching clients), and we assume the receiving end takes more
    responsibility for verifying the data.

    But this isn't always the case. One might want to generate on-disk
    packfiles for a specialized object transfer. Just using "--stdout" and
    writing to a file is not optimal, as it will not generate the matching
    pack index.

    So it would be useful to have some way of overriding this heuristic:
    to tell pack-objects that even though it should generate on-disk
    files, it is still OK to use the reachability bitmaps to do the
    traversal.

So we can teach pack-objects to use bitmap index for initial object
counting phase when generating resultant pack file too:

- if we care it is not activated under git-repack:

  See above about repack robustness and not forward-carrying corruption.

- if we know bitmap index generation is not enabled for resultant pack:

  Current code has singleton bitmap_git so cannot work simultaneously
  with two bitmap indices.

  We also want to avoid (at least with current implementation)
  generating bitmaps off of bitmaps. The reason here is: when generating
  a pack, not-yet-packed objects will be emitted into pack in
  suboptimal order and added to tail of the bitmap as "extended entries".
  When the resultant pack + some new objects in associated repository
  are in turn used to generate another pack with bitmap, the situation
  repeats: new objects are again not emitted optimally and just added to
  bitmap tail - not in recency order.

  So the pack badness can grow over time when at each step we have
  bitmapped pack + some other objects. That's why we want to avoid
  generating bitmaps off of bitmaps, not to let pack badness grow.

- if we keep pack reuse enabled still only for "send-to-stdout" case:

  Because on pack reuse raw entries are directly written out to destination
  pack by write_reused_pack() bypassing needed for pack index generation
  bookkeeping done by regular codepath in write_one() and friends.

This way all git tests pass, and for pack-objects -> file we get nice
speedup:

    erp5.git[1] (~230MB) extracted from ~ 5GB lab.nexedi.com backup
    repository managed by git-backup[2] via

    time echo 0186ac99 | git pack-objects --revs erp5pack

before:  37.2s
after:   26.2s

And for `git repack -adb` packed git.git

    time echo 5c589a73 | git pack-objects --revs gitpack

before:   7.1s
after:    3.6s

i.e. it can be 30% - 50% speedup for pack extraction.

git-backup extracts many packs on repositories restoration. That was my
initial motivation for the patch.

[1] https://lab.nexedi.com/nexedi/erp5
[2] https://lab.nexedi.com/kirr/git-backup

NOTE

Jeff also suggests that pack.useBitmaps was probably a mistake to
introduce originally. This way we are not adding another config point,
but instead just always default to-file pack-objects not to use bitmap
index: Tools which need to generate on-disk packs with using bitmap, can
pass --use-bitmap-index explicitly. And git-repack does never pass
--use-bitmap-index, so this way we can be sure regular on-disk repacking
remains robust.

NOTE2

`git pack-objects --stdout >file.pack` + `git index-pack file.pack` is much slower
than `git pack-objects file.pack`. Extracting erp5.git pack from
lab.nexedi.com backup repository:

---- 8< ----
$ time echo 0186ac99 | git pack-objects --stdout --revs >erp5pack-stdout.pack

real    0m22.309s
user    0m21.148s
sys     0m0.932s

$ time git index-pack erp5pack-stdout.pack

real    0m50.873s   <-- more than 2 times slower than time to generate pack itself!
user    0m49.300s
sys     0m1.360s
---- 8< ----

So the time for

    `pack-object --stdout >file.pack` + `index-pack file.pack`  is  72s,

while

    `pack-objects file.pack` which does both pack and index     is  27s.

And even

    `pack-objects --no-use-bitmap-index file.pack`              is  37s.

Jeff explains:

    The packfile does not carry the sha1 of the objects. A receiving
    index-pack has to compute them itself, including inflating and applying
    all of the deltas.

that's why for `git-backup restore` we want to teach `git pack-objects
file.pack` to use bitmaps instead of using `git pack-objects --stdout
>file.pack` + `git index-pack file.pack`.

More context:

    http://article.gmane.org/gmane.comp.version-control.git/299063
    http://article.gmane.org/gmane.comp.version-control.git/299107
    http://article.gmane.org/gmane.comp.version-control.git/299420
    http://article.gmane.org/gmane.comp.version-control.git/300217

Cc: Vicent Marti <tanoku@gmail.com>
Helped-by: Jeff King <peff@peff.net>
Signed-off-by: Kirill Smelkov <kirr@nexedi.com>
---
 Documentation/config.txt |  3 +++
 builtin/pack-objects.c   | 25 +++++++++++++++++++++----
 t/t5310-pack-bitmaps.sh  | 14 ++++++++++++++
 3 files changed, 38 insertions(+), 4 deletions(-)

diff --git a/Documentation/config.txt b/Documentation/config.txt
index 8b1aee4..6a903c0 100644
--- a/Documentation/config.txt
+++ b/Documentation/config.txt
@@ -2244,6 +2244,9 @@ pack.useBitmaps::
 	to stdout (e.g., during the server side of a fetch). Defaults to
 	true. You should not generally need to turn this off unless
 	you are debugging pack bitmaps.
++
+*NOTE*: when packing to file (e.g., on repack) the default is always not to use
+	pack bitmaps.
 
 pack.writeBitmaps (deprecated)::
 	This is a deprecated synonym for `repack.writeBitmaps`.
diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index 34b3019..2b2e74a 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -66,7 +66,8 @@ static struct packed_git *reuse_packfile;
 static uint32_t reuse_packfile_objects;
 static off_t reuse_packfile_offset;
 
-static int use_bitmap_index = 1;
+static int use_bitmap_index_default = 1;
+static int use_bitmap_index = -1;
 static int write_bitmap_index;
 static uint16_t write_bitmap_options;
 
@@ -2264,7 +2265,7 @@ static int git_pack_config(const char *k, const char *v, void *cb)
 			write_bitmap_options &= ~BITMAP_OPT_HASH_CACHE;
 	}
 	if (!strcmp(k, "pack.usebitmaps")) {
-		use_bitmap_index = git_config_bool(k, v);
+		use_bitmap_index_default = git_config_bool(k, v);
 		return 0;
 	}
 	if (!strcmp(k, "pack.threads")) {
@@ -2527,7 +2528,7 @@ static int get_object_list_from_bitmap(struct rev_info *revs)
 	if (prepare_bitmap_walk(revs) < 0)
 		return -1;
 
-	if (pack_options_allow_reuse() &&
+	if (pack_options_allow_reuse() && pack_to_stdout &&
 	    !reuse_partial_packfile_from_bitmap(
 			&reuse_packfile,
 			&reuse_packfile_objects,
@@ -2812,7 +2813,23 @@ int cmd_pack_objects(int argc, const char **argv, const char *prefix)
 	if (!rev_list_all || !rev_list_reflog || !rev_list_index)
 		unpack_unreachable_expiration = 0;
 
-	if (!use_internal_rev_list || !pack_to_stdout || is_repository_shallow())
+	/*
+	 * "soft" reasons not to use bitmaps - for on-disk repack by default we want
+	 *
+	 * - to produce good pack (with bitmap index not-yet-packed objects are
+	 *   packed in suboptimal order).
+	 *
+	 * - to use more robust pack-generation codepath (avoiding possible
+	 *   bugs in bitmap code and possible bitmap index corruption).
+	 */
+	if (!pack_to_stdout)
+		use_bitmap_index_default = 0;
+
+	if (use_bitmap_index < 0)
+		use_bitmap_index = use_bitmap_index_default;
+
+	/* "hard" reasons not to use bitmaps; these just won't work at all */
+	if (!use_internal_rev_list || (!pack_to_stdout && write_bitmap_index) || is_repository_shallow())
 		use_bitmap_index = 0;
 
 	if (pack_to_stdout || !rev_list_all)
diff --git a/t/t5310-pack-bitmaps.sh b/t/t5310-pack-bitmaps.sh
index a76f6ca..58c3b29 100755
--- a/t/t5310-pack-bitmaps.sh
+++ b/t/t5310-pack-bitmaps.sh
@@ -200,6 +200,20 @@ test_expect_success 'pack-objects respects --local (non-local bitmapped pack)' '
 	mv alt_objects/pack/$packbitmap.* .git/objects/pack/
 '
 
+test_expect_success 'pack-objects to file can use bitmap' '
+	# make sure we still have 1 bitmap index from previous tests
+	ls .git/objects/pack/ | grep bitmap >output &&
+	test_line_count = 1 output &&
+	# verify equivalent packs are generated with/without using bitmap index
+	packasha1=$(git pack-objects --no-use-bitmap-index --all packa </dev/null) &&
+	packbsha1=$(git pack-objects --use-bitmap-index --all packb </dev/null) &&
+	git verify-pack -v packa-$packasha1.pack >packa.verify &&
+	git verify-pack -v packb-$packbsha1.pack >packb.verify &&
+	grep -o "^$_x40" packa.verify |sort >packa.objects &&
+	grep -o "^$_x40" packb.verify |sort >packb.objects &&
+	test_cmp packa.objects packb.objects
+'
+
 test_expect_success 'full repack, reusing previous bitmaps' '
 	git repack -ad &&
 	ls .git/objects/pack/ | grep bitmap >output &&
-- 
2.9.0.431.g3cb5c84
---- 8< ----

Thanks,
Kirill

^ permalink raw reply related	[flat|nested] 62+ messages in thread

* Re: [PATCH] pack-objects: Use reachability bitmap index when generating non-stdout pack too
  2016-07-28 20:22               ` Kirill Smelkov
@ 2016-07-28 21:18                 ` Junio C Hamano
  2016-07-29  7:40                   ` Kirill Smelkov
  0 siblings, 1 reply; 62+ messages in thread
From: Junio C Hamano @ 2016-07-28 21:18 UTC (permalink / raw)
  To: Kirill Smelkov
  Cc: Jeff King, Jérome Perrin, Isabelle Vallet, Kazuhiko Shiozaki,
	Julien Muchembled, git, Vicent Marti

Kirill Smelkov <kirr@nexedi.com> writes:

> I'm waiting so long for main patch to be at least queued to pu, that I'm
> now a bit frustrated and ready to do something not related to main goal :)

Perhaps the first step would be to stop putting multiple patches in
a single e-mail buried after a few pages of discussion.  I will not
even find that there _are_ multiple patches in the message if I am
not involved directly in the discussion, and the discussion is still
ongoing, because it is likely that I'd skim just a few paragraphs at
the top before going on to other messages.

I won't touch the message I am responding to, as your -- 8< -- cut
mark does not even seem to be a reliable marker between patches
(i.e.  I see something like this that is clearly not a message
boundary:

than `git pack-objects file.pack`. Extracting erp5.git pack from
lab.nexedi.com backup repository:

---- 8< ----
$ time echo 0186ac99 | git pack-objects --stdout --revs >erp5pack-stdout.pack

real    0m22.309s
...
)


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH] pack-objects: Use reachability bitmap index when generating non-stdout pack too
  2016-07-28 21:18                 ` Junio C Hamano
@ 2016-07-29  7:40                   ` Kirill Smelkov
  2016-07-29  7:46                     ` [PATCH 1/2] pack-objects: Teach --use-bitmap-index codepath to respect --local, --honor-pack-keep and --incremental Kirill Smelkov
  2016-07-29  7:47                     ` [PATCH v4 2/2] pack-objects: Teach it to use reachability bitmap index when generating non-stdout pack too Kirill Smelkov
  0 siblings, 2 replies; 62+ messages in thread
From: Kirill Smelkov @ 2016-07-29  7:40 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Jeff King, Jérome Perrin, Isabelle Vallet, Kazuhiko Shiozaki,
	Julien Muchembled, git, Vicent Marti

On Thu, Jul 28, 2016 at 02:18:29PM -0700, Junio C Hamano wrote:
> Kirill Smelkov <kirr@nexedi.com> writes:
> 
> > I'm waiting so long for main patch to be at least queued to pu, that I'm
> > now a bit frustrated and ready to do something not related to main goal :)
> 
> Perhaps the first step would be to stop putting multiple patches in
> a single e-mail buried after a few pages of discussion.  I will not
> even find that there _are_ multiple patches in the message if I am
> not involved directly in the discussion, and the discussion is still
> ongoing, because it is likely that I'd skim just a few paragraphs at
> the top before going on to other messages.
> 
> I won't touch the message I am responding to, as your -- 8< -- cut
> mark does not even seem to be a reliable marker between patches
> (i.e.  I see something like this that is clearly not a message
> boundary:
> 
> than `git pack-objects file.pack`. Extracting erp5.git pack from
> lab.nexedi.com backup repository:
> 
> ---- 8< ----
> $ time echo 0186ac99 | git pack-objects --stdout --revs >erp5pack-stdout.pack
> 
> real    0m22.309s
> ...
> )

Ok, makes sense and my fault. I'm resending each patch as separate
message in reply to this mail.

^ permalink raw reply	[flat|nested] 62+ messages in thread

* [PATCH 1/2] pack-objects: Teach --use-bitmap-index codepath to respect --local, --honor-pack-keep and --incremental
  2016-07-29  7:40                   ` Kirill Smelkov
@ 2016-07-29  7:46                     ` Kirill Smelkov
  2016-08-01 18:17                       ` Junio C Hamano
  2016-07-29  7:47                     ` [PATCH v4 2/2] pack-objects: Teach it to use reachability bitmap index when generating non-stdout pack too Kirill Smelkov
  1 sibling, 1 reply; 62+ messages in thread
From: Kirill Smelkov @ 2016-07-29  7:46 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Jeff King, Vicent Marti, Jérome Perrin, Isabelle Vallet,
	Kazuhiko Shiozaki, Julien Muchembled, git, Kirill Smelkov

Since 6b8fda2d (pack-objects: use bitmaps when packing objects) there
are two codepaths in pack-objects: with & without using bitmap
reachability index.

However add_object_entry_from_bitmap(), despite its non-bitmapped
counterpart add_object_entry(), in no way does check for whether --local
or --honor-pack-keep or --incremental should be respected. In
non-bitmapped codepath this is handled in want_object_in_pack(), but
bitmapped codepath has simply no such checking at all.

The bitmapped codepath however was allowing to pass in all those options
and with bitmap indices still being used under such conditions -
potentially giving wrong output (e.g. including objects from non-local or
.keep'ed pack).

We can easily fix this by noting the following: when an object comes to
add_object_entry_from_bitmap() it can come for two reasons:

    1. entries coming from main pack covered by bitmap index, and
    2. object coming from, possibly alternate, loose or other packs.

For "2" we always have pack not yet found by bitmap traversal code, and
thus we can simply reuse non-bitmapped want_object_in_pack() to find in
which pack an object lives and also for taking omitting decision.

For "1" we always have pack already found by bitmap traversal code and we
only need to check that pack for same criteria used in
want_object_in_pack() for found_pack.

Suggested-by: Junio C Hamano <gitster@pobox.com>
Discussed-with: Jeff King <peff@peff.net>
---
 builtin/pack-objects.c  |  39 +++++++++++++++++++
 t/t5310-pack-bitmaps.sh | 100 ++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 139 insertions(+)

diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index a2f8cfd..34b3019 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -987,6 +987,42 @@ static int want_object_in_pack(const unsigned char *sha1,
 	return 1;
 }
 
+/* Like want_object_in_pack() but for objects coming from-under bitmapped traversal */
+static int want_object_in_pack_bitmap(const unsigned char *sha1,
+				      struct packed_git **found_pack,
+				      off_t *found_offset)
+{
+	struct packed_git *p = *found_pack;
+
+	/*
+	 * There are two types of requests coming here:
+	 * 1. entries coming from main pack covered by bitmap index, and
+	 * 2. object coming from, possibly alternate, loose or other packs.
+	 *
+	 * For "1" we always have *found_pack != NULL passed here from
+	 * traverse_bitmap_commit_list(). (*found_pack is bitmap_git.pack
+	 * actually).
+	 *
+	 * For "2" we always have *found_pack == NULL passed here from
+	 * traverse_bitmap_commit_list() - since this is the way bitmap
+	 * traversal passes here "extended" bitmap entries.
+	 */
+
+	/* objects not covered by bitmap */
+	if (!p)
+		return want_object_in_pack(sha1, 0, found_pack, found_offset);
+
+	/* objects covered by bitmap - we only have to check p wrt local and .keep */
+	if (incremental)
+		return 0;
+	if (local && !p->pack_local)
+		return 0;
+	if (ignore_packed_keep && p->pack_local && p->pack_keep)
+		return 0;
+
+	return 1;
+}
+
 static void create_object_entry(const unsigned char *sha1,
 				enum object_type type,
 				uint32_t hash,
@@ -1055,6 +1091,9 @@ static int add_object_entry_from_bitmap(const unsigned char *sha1,
 	if (have_duplicate_entry(sha1, 0, &index_pos))
 		return 0;
 
+	if (!want_object_in_pack_bitmap(sha1, &pack, &offset))
+		return 0;
+
 	create_object_entry(sha1, type, name_hash, 0, 0, index_pos, pack, offset);
 
 	display_progress(progress_state, nr_result);
diff --git a/t/t5310-pack-bitmaps.sh b/t/t5310-pack-bitmaps.sh
index 3893afd..a76f6ca 100755
--- a/t/t5310-pack-bitmaps.sh
+++ b/t/t5310-pack-bitmaps.sh
@@ -118,6 +118,88 @@ test_expect_success 'incremental repack can disable bitmaps' '
 	git repack -d --no-write-bitmap-index
 '
 
+test_expect_success 'pack-objects respects --local (non-local loose)' '
+	mkdir -p alt_objects/pack &&
+	echo $(pwd)/alt_objects > .git/objects/info/alternates &&
+	echo content1 > file1 &&
+	objsha1=$(GIT_OBJECT_DIRECTORY=alt_objects git hash-object -w file1) &&
+	git add file1 &&
+	test_tick &&
+	git commit -m commit_file1 &&
+	echo HEAD | \
+	git pack-objects --local --stdout --revs >1.pack &&
+	git index-pack 1.pack &&
+	git verify-pack -v 1.pack >1.objects &&
+	if egrep "^$objsha1" 1.objects; then
+		echo "Non-local object present in pack generated with --local: $objsha1"
+		return 1
+	fi
+'
+
+test_expect_success 'pack-objects respects --honor-pack-keep (local non-bitmapped pack)' '
+	echo content2 > file2 &&
+	objsha2=$(git hash-object -w file2) &&
+	git add file2 &&
+	test_tick &&
+	git commit -m commit_file2 &&
+	pack2=$(echo $objsha2 | \
+		git pack-objects pack2) &&
+	mv pack2-$pack2.* .git/objects/pack/ &&
+	touch .git/objects/pack/pack2-$pack2.keep &&
+	rm $(objpath $objsha2) &&
+	echo HEAD | \
+	git pack-objects --honor-pack-keep --stdout --revs >2a.pack &&
+	git index-pack 2a.pack &&
+	git verify-pack -v 2a.pack >2a.objects &&
+	if egrep "^$objsha2" 2a.objects; then
+		echo "Object from .keeped pack present in pack generated with --honor-pack-keep: $objsha2"
+		return 1
+	fi
+'
+
+test_expect_success 'pack-objects respects --local (non-local pack)' '
+	mv .git/objects/pack/pack2-$pack2.* alt_objects/pack/ &&
+	echo HEAD | \
+	git pack-objects --local --stdout --revs >2b.pack &&
+	git index-pack 2b.pack &&
+	git verify-pack -v 2b.pack >2b.objects &&
+	if egrep "^$objsha2" 2b.objects; then
+		echo "Non-local object present in pack generated with --local: $objsha2"
+		return 1
+	fi
+'
+
+test_expect_success 'pack-objects respects --honor-pack-keep (local bitmapped pack)' '
+	ls .git/objects/pack/ | grep bitmap >output &&
+	test_line_count = 1 output &&
+	packbitmap=$(basename $(cat output) .bitmap) &&
+	git verify-pack -v .git/objects/pack/$packbitmap.pack >packbitmap.verify &&
+	grep -o "^$_x40" packbitmap.verify |sort >packbitmap.objects &&
+	touch .git/objects/pack/$packbitmap.keep &&
+	echo HEAD | \
+	git pack-objects --honor-pack-keep --stdout --revs >3a.pack &&
+	git index-pack 3a.pack &&
+	git verify-pack -v 3a.pack >3a.objects &&
+	if grep -qFf packbitmap.objects 3a.objects; then
+		echo "Object from .keeped bitmapped pack present in pack generated with --honour-pack-keep"
+		return 1
+	fi &&
+	rm .git/objects/pack/$packbitmap.keep
+'
+
+test_expect_success 'pack-objects respects --local (non-local bitmapped pack)' '
+	mv .git/objects/pack/$packbitmap.* alt_objects/pack/ &&
+	echo HEAD | \
+	git pack-objects --local --stdout --revs >3b.pack &&
+	git index-pack 3b.pack &&
+	git verify-pack -v 3b.pack >3b.objects &&
+	if grep -qFf packbitmap.objects 3b.objects; then
+		echo "Non-local object from bitmapped pack present in pack generated with --local"
+		return 1
+	fi &&
+	mv alt_objects/pack/$packbitmap.* .git/objects/pack/
+'
+
 test_expect_success 'full repack, reusing previous bitmaps' '
 	git repack -ad &&
 	ls .git/objects/pack/ | grep bitmap >output &&
@@ -143,6 +225,24 @@ test_expect_success 'create objects for missing-HAVE tests' '
 	EOF
 '
 
+test_expect_success 'pack-objects respects --incremental' '
+	cat >revs2 <<-EOF &&
+	HEAD
+	$commit
+	EOF
+	git pack-objects --incremental --stdout --revs <revs2 >4.pack &&
+	git index-pack 4.pack &&
+	git verify-pack -v 4.pack >4.verify &&
+	grep -o "^$_x40" 4.verify |sort >4.objects &&
+	test_line_count = 4 4.objects &&
+	git rev-list --objects $commit >revlist &&
+	grep -o "^$_x40" revlist |sort >objects &&
+	if grep -qvFf objects 4.objects; then
+		echo "Expected objects not present in incremental pack"
+		return 1
+	fi
+'
+
 test_expect_success 'pack with missing blob' '
 	rm $(objpath $blob) &&
 	git pack-objects --stdout --revs <revs >/dev/null
-- 
2.9.0.431.g3cb5c84

^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH v4 2/2] pack-objects: Teach it to use reachability bitmap index when generating non-stdout pack too
  2016-07-29  7:40                   ` Kirill Smelkov
  2016-07-29  7:46                     ` [PATCH 1/2] pack-objects: Teach --use-bitmap-index codepath to respect --local, --honor-pack-keep and --incremental Kirill Smelkov
@ 2016-07-29  7:47                     ` Kirill Smelkov
  2016-08-08 13:56                       ` Jeff King
  1 sibling, 1 reply; 62+ messages in thread
From: Kirill Smelkov @ 2016-07-29  7:47 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Jeff King, Vicent Marti, Jérome Perrin, Isabelle Vallet,
	Kazuhiko Shiozaki, Julien Muchembled, git, Kirill Smelkov

Starting from 6b8fda2d (pack-objects: use bitmaps when packing objects)
if a repository has bitmap index, pack-objects can nicely speedup
"Counting objects" graph traversal phase. That however was done only for
case when resultant pack is sent to stdout, not written into a file.

The reason here is for on-disk repack by default we want:

- to produce good pack (with bitmap index not-yet-packed objects are
  emitted to pack in suboptimal order).

- to use more robust pack-generation codepath (avoiding possible
  bugs in bitmap code and possible bitmap index corruption).

Jeff Kind further explains:

    The reason for this split is that pack-objects tries to determine how
    "careful" it should be based on whether we are packing to disk or to
    stdout. Packing to disk implies "git repack", and that we will likely
    delete the old packs after finishing. We want to be more careful (so
    as not to carry forward a corruption, and to generate a more optimal
    pack), and we presumably run less frequently and can afford extra CPU.
    Whereas packing to stdout implies serving a remote via "git fetch" or
    "git push". This happens more frequently (e.g., a server handling many
    fetching clients), and we assume the receiving end takes more
    responsibility for verifying the data.

    But this isn't always the case. One might want to generate on-disk
    packfiles for a specialized object transfer. Just using "--stdout" and
    writing to a file is not optimal, as it will not generate the matching
    pack index.

    So it would be useful to have some way of overriding this heuristic:
    to tell pack-objects that even though it should generate on-disk
    files, it is still OK to use the reachability bitmaps to do the
    traversal.

So we can teach pack-objects to use bitmap index for initial object
counting phase when generating resultant pack file too:

- if we care it is not activated under git-repack:

  See above about repack robustness and not forward-carrying corruption.

- if we know bitmap index generation is not enabled for resultant pack:

  Current code has singleton bitmap_git so cannot work simultaneously
  with two bitmap indices.

  We also want to avoid (at least with current implementation)
  generating bitmaps off of bitmaps. The reason here is: when generating
  a pack, not-yet-packed objects will be emitted into pack in
  suboptimal order and added to tail of the bitmap as "extended entries".
  When the resultant pack + some new objects in associated repository
  are in turn used to generate another pack with bitmap, the situation
  repeats: new objects are again not emitted optimally and just added to
  bitmap tail - not in recency order.

  So the pack badness can grow over time when at each step we have
  bitmapped pack + some other objects. That's why we want to avoid
  generating bitmaps off of bitmaps, not to let pack badness grow.

- if we keep pack reuse enabled still only for "send-to-stdout" case:

  Because on pack reuse raw entries are directly written out to destination
  pack by write_reused_pack() bypassing needed for pack index generation
  bookkeeping done by regular codepath in write_one() and friends.

This way all git tests pass, and for pack-objects -> file we get nice
speedup:

    erp5.git[1] (~230MB) extracted from ~ 5GB lab.nexedi.com backup
    repository managed by git-backup[2] via

    time echo 0186ac99 | git pack-objects --revs erp5pack

before:  37.2s
after:   26.2s

And for `git repack -adb` packed git.git

    time echo 5c589a73 | git pack-objects --revs gitpack

before:   7.1s
after:    3.6s

i.e. it can be 30% - 50% speedup for pack extraction.

git-backup extracts many packs on repositories restoration. That was my
initial motivation for the patch.

[1] https://lab.nexedi.com/nexedi/erp5
[2] https://lab.nexedi.com/kirr/git-backup

NOTE

Jeff also suggests that pack.useBitmaps was probably a mistake to
introduce originally. This way we are not adding another config point,
but instead just always default to-file pack-objects not to use bitmap
index: Tools which need to generate on-disk packs with using bitmap, can
pass --use-bitmap-index explicitly. And git-repack does never pass
--use-bitmap-index, so this way we can be sure regular on-disk repacking
remains robust.

NOTE2

`git pack-objects --stdout >file.pack` + `git index-pack file.pack` is much slower
than `git pack-objects file.pack`. Extracting erp5.git pack from
lab.nexedi.com backup repository:

    $ time echo 0186ac99 | git pack-objects --stdout --revs >erp5pack-stdout.pack

    real    0m22.309s
    user    0m21.148s
    sys     0m0.932s

    $ time git index-pack erp5pack-stdout.pack

    real    0m50.873s   <-- more than 2 times slower than time to generate pack itself!
    user    0m49.300s
    sys     0m1.360s

So the time for

    `pack-object --stdout >file.pack` + `index-pack file.pack`  is  72s,

while

    `pack-objects file.pack` which does both pack and index     is  27s.

And even

    `pack-objects --no-use-bitmap-index file.pack`              is  37s.

Jeff explains:

    The packfile does not carry the sha1 of the objects. A receiving
    index-pack has to compute them itself, including inflating and applying
    all of the deltas.

that's why for `git-backup restore` we want to teach `git pack-objects
file.pack` to use bitmaps instead of using `git pack-objects --stdout
>file.pack` + `git index-pack file.pack`.

More context:

    http://article.gmane.org/gmane.comp.version-control.git/299063
    http://article.gmane.org/gmane.comp.version-control.git/299107
    http://article.gmane.org/gmane.comp.version-control.git/299420
    http://article.gmane.org/gmane.comp.version-control.git/300217

Cc: Vicent Marti <tanoku@gmail.com>
Helped-by: Jeff King <peff@peff.net>
Signed-off-by: Kirill Smelkov <kirr@nexedi.com>
---
 Documentation/config.txt |  3 +++
 builtin/pack-objects.c   | 25 +++++++++++++++++++++----
 t/t5310-pack-bitmaps.sh  | 14 ++++++++++++++
 3 files changed, 38 insertions(+), 4 deletions(-)

diff --git a/Documentation/config.txt b/Documentation/config.txt
index 8b1aee4..6a903c0 100644
--- a/Documentation/config.txt
+++ b/Documentation/config.txt
@@ -2244,6 +2244,9 @@ pack.useBitmaps::
 	to stdout (e.g., during the server side of a fetch). Defaults to
 	true. You should not generally need to turn this off unless
 	you are debugging pack bitmaps.
++
+*NOTE*: when packing to file (e.g., on repack) the default is always not to use
+	pack bitmaps.
 
 pack.writeBitmaps (deprecated)::
 	This is a deprecated synonym for `repack.writeBitmaps`.
diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index 34b3019..2b2e74a 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -66,7 +66,8 @@ static struct packed_git *reuse_packfile;
 static uint32_t reuse_packfile_objects;
 static off_t reuse_packfile_offset;
 
-static int use_bitmap_index = 1;
+static int use_bitmap_index_default = 1;
+static int use_bitmap_index = -1;
 static int write_bitmap_index;
 static uint16_t write_bitmap_options;
 
@@ -2264,7 +2265,7 @@ static int git_pack_config(const char *k, const char *v, void *cb)
 			write_bitmap_options &= ~BITMAP_OPT_HASH_CACHE;
 	}
 	if (!strcmp(k, "pack.usebitmaps")) {
-		use_bitmap_index = git_config_bool(k, v);
+		use_bitmap_index_default = git_config_bool(k, v);
 		return 0;
 	}
 	if (!strcmp(k, "pack.threads")) {
@@ -2527,7 +2528,7 @@ static int get_object_list_from_bitmap(struct rev_info *revs)
 	if (prepare_bitmap_walk(revs) < 0)
 		return -1;
 
-	if (pack_options_allow_reuse() &&
+	if (pack_options_allow_reuse() && pack_to_stdout &&
 	    !reuse_partial_packfile_from_bitmap(
 			&reuse_packfile,
 			&reuse_packfile_objects,
@@ -2812,7 +2813,23 @@ int cmd_pack_objects(int argc, const char **argv, const char *prefix)
 	if (!rev_list_all || !rev_list_reflog || !rev_list_index)
 		unpack_unreachable_expiration = 0;
 
-	if (!use_internal_rev_list || !pack_to_stdout || is_repository_shallow())
+	/*
+	 * "soft" reasons not to use bitmaps - for on-disk repack by default we want
+	 *
+	 * - to produce good pack (with bitmap index not-yet-packed objects are
+	 *   packed in suboptimal order).
+	 *
+	 * - to use more robust pack-generation codepath (avoiding possible
+	 *   bugs in bitmap code and possible bitmap index corruption).
+	 */
+	if (!pack_to_stdout)
+		use_bitmap_index_default = 0;
+
+	if (use_bitmap_index < 0)
+		use_bitmap_index = use_bitmap_index_default;
+
+	/* "hard" reasons not to use bitmaps; these just won't work at all */
+	if (!use_internal_rev_list || (!pack_to_stdout && write_bitmap_index) || is_repository_shallow())
 		use_bitmap_index = 0;
 
 	if (pack_to_stdout || !rev_list_all)
diff --git a/t/t5310-pack-bitmaps.sh b/t/t5310-pack-bitmaps.sh
index a76f6ca..58c3b29 100755
--- a/t/t5310-pack-bitmaps.sh
+++ b/t/t5310-pack-bitmaps.sh
@@ -200,6 +200,20 @@ test_expect_success 'pack-objects respects --local (non-local bitmapped pack)' '
 	mv alt_objects/pack/$packbitmap.* .git/objects/pack/
 '
 
+test_expect_success 'pack-objects to file can use bitmap' '
+	# make sure we still have 1 bitmap index from previous tests
+	ls .git/objects/pack/ | grep bitmap >output &&
+	test_line_count = 1 output &&
+	# verify equivalent packs are generated with/without using bitmap index
+	packasha1=$(git pack-objects --no-use-bitmap-index --all packa </dev/null) &&
+	packbsha1=$(git pack-objects --use-bitmap-index --all packb </dev/null) &&
+	git verify-pack -v packa-$packasha1.pack >packa.verify &&
+	git verify-pack -v packb-$packbsha1.pack >packb.verify &&
+	grep -o "^$_x40" packa.verify |sort >packa.objects &&
+	grep -o "^$_x40" packb.verify |sort >packb.objects &&
+	test_cmp packa.objects packb.objects
+'
+
 test_expect_success 'full repack, reusing previous bitmaps' '
 	git repack -ad &&
 	ls .git/objects/pack/ | grep bitmap >output &&
-- 
2.9.0.431.g3cb5c84

^ permalink raw reply related	[flat|nested] 62+ messages in thread

* Re: [PATCH 1/2] pack-objects: Teach --use-bitmap-index codepath to respect --local, --honor-pack-keep and --incremental
  2016-07-29  7:46                     ` [PATCH 1/2] pack-objects: Teach --use-bitmap-index codepath to respect --local, --honor-pack-keep and --incremental Kirill Smelkov
@ 2016-08-01 18:17                       ` Junio C Hamano
  2016-08-08 12:37                         ` Kirill Smelkov
  0 siblings, 1 reply; 62+ messages in thread
From: Junio C Hamano @ 2016-08-01 18:17 UTC (permalink / raw)
  To: Kirill Smelkov
  Cc: Jeff King, Vicent Marti, Jérome Perrin, Isabelle Vallet,
	Kazuhiko Shiozaki, Julien Muchembled, git

Kirill Smelkov <kirr@nexedi.com> writes:

> Since 6b8fda2d (pack-objects: use bitmaps when packing objects) there
> are two codepaths in pack-objects: with & without using bitmap
> reachability index.
>
> However add_object_entry_from_bitmap(), despite its non-bitmapped
> counterpart add_object_entry(), in no way does check for whether --local
> or --honor-pack-keep or --incremental should be respected. In
> non-bitmapped codepath this is handled in want_object_in_pack(), but
> bitmapped codepath has simply no such checking at all.
>
> The bitmapped codepath however was allowing to pass in all those options
> and with bitmap indices still being used under such conditions -
> potentially giving wrong output (e.g. including objects from non-local or
> .keep'ed pack).
>
> We can easily fix this by noting the following: when an object comes to
> add_object_entry_from_bitmap() it can come for two reasons:
>
>     1. entries coming from main pack covered by bitmap index, and
>     2. object coming from, possibly alternate, loose or other packs.
>
> For "2" we always have pack not yet found by bitmap traversal code, and
> thus we can simply reuse non-bitmapped want_object_in_pack() to find in
> which pack an object lives and also for taking omitting decision.
>
> For "1" we always have pack already found by bitmap traversal code and we
> only need to check that pack for same criteria used in
> want_object_in_pack() for found_pack.
>
> Suggested-by: Junio C Hamano <gitster@pobox.com>
> Discussed-with: Jeff King <peff@peff.net>
> ---

I do not think I suggested much of this to deserve credit like this,
though, as I certainly haven't thought about the pros-and-cons
between adding the same "some object in pack may not want to be in
the output" logic to the bitmap side, or punting the bitmap codepath
when local/keep are involved.

> +/* Like want_object_in_pack() but for objects coming from-under bitmapped traversal */
> +static int want_object_in_pack_bitmap(const unsigned char *sha1,
> +				      struct packed_git **found_pack,
> +				      off_t *found_offset)
> +{
> +	struct packed_git *p = *found_pack;
> +
> +	/*
> +	 * There are two types of requests coming here:
> +	 * 1. entries coming from main pack covered by bitmap index, and
> +	 * 2. object coming from, possibly alternate, loose or other packs.
> +	 *
> +	 * For "1" we always have *found_pack != NULL passed here from
> +	 * traverse_bitmap_commit_list(). (*found_pack is bitmap_git.pack
> +	 * actually).
> +	 *
> +	 * For "2" we always have *found_pack == NULL passed here from
> +	 * traverse_bitmap_commit_list() - since this is the way bitmap
> +	 * traversal passes here "extended" bitmap entries.
> +	 */
> +
> +	/* objects not covered by bitmap */
> +	if (!p)
> +		return want_object_in_pack(sha1, 0, found_pack, found_offset);
> +	/* objects covered by bitmap - we only have to check p wrt local and .keep */

I am assuming that p != NULL only means "this object exists in THIS
pack", without saying anything about "this object may also exist in
other places", but "we only have to check" implies that "p != NULL"
means "this object exists *ONLY* in this pack and nowhere else".

Puzzled.


> +	if (incremental)
> +		return 0;
> +	if (local && !p->pack_local)
> +		return 0;
> +	if (ignore_packed_keep && p->pack_local && p->pack_keep)
> +		return 0;
> +
> +	return 1;
> +}
> +

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 1/2] pack-objects: Teach --use-bitmap-index codepath to respect --local, --honor-pack-keep and --incremental
  2016-08-01 18:17                       ` Junio C Hamano
@ 2016-08-08 12:37                         ` Kirill Smelkov
  2016-08-08 13:50                           ` Jeff King
  2016-08-08 16:11                           ` Junio C Hamano
  0 siblings, 2 replies; 62+ messages in thread
From: Kirill Smelkov @ 2016-08-08 12:37 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Jeff King, Vicent Marti, Jérome Perrin, Isabelle Vallet,
	Kazuhiko Shiozaki, Julien Muchembled, git

On Mon, Aug 01, 2016 at 11:17:30AM -0700, Junio C Hamano wrote:
> Kirill Smelkov <kirr@nexedi.com> writes:
> 
> > Since 6b8fda2d (pack-objects: use bitmaps when packing objects) there
> > are two codepaths in pack-objects: with & without using bitmap
> > reachability index.
> >
> > However add_object_entry_from_bitmap(), despite its non-bitmapped
> > counterpart add_object_entry(), in no way does check for whether --local
> > or --honor-pack-keep or --incremental should be respected. In
> > non-bitmapped codepath this is handled in want_object_in_pack(), but
> > bitmapped codepath has simply no such checking at all.
> >
> > The bitmapped codepath however was allowing to pass in all those options
> > and with bitmap indices still being used under such conditions -
> > potentially giving wrong output (e.g. including objects from non-local or
> > .keep'ed pack).
> >
> > We can easily fix this by noting the following: when an object comes to
> > add_object_entry_from_bitmap() it can come for two reasons:
> >
> >     1. entries coming from main pack covered by bitmap index, and
> >     2. object coming from, possibly alternate, loose or other packs.
> >
> > For "2" we always have pack not yet found by bitmap traversal code, and
> > thus we can simply reuse non-bitmapped want_object_in_pack() to find in
> > which pack an object lives and also for taking omitting decision.
> >
> > For "1" we always have pack already found by bitmap traversal code and we
> > only need to check that pack for same criteria used in
> > want_object_in_pack() for found_pack.
> >
> > Suggested-by: Junio C Hamano <gitster@pobox.com>
> > Discussed-with: Jeff King <peff@peff.net>
> > ---
> 
> I do not think I suggested much of this to deserve credit like this,
> though, as I certainly haven't thought about the pros-and-cons
> between adding the same "some object in pack may not want to be in
> the output" logic to the bitmap side, or punting the bitmap codepath
> when local/keep are involved.

I understand. Still for me it was you who convinced me to add proper
support for e.g. --local vs bitmap instead of special-casing it.
I think we also can avoid punting the bitmap codepath - please see
below.


> > +/* Like want_object_in_pack() but for objects coming from-under bitmapped traversal */
> > +static int want_object_in_pack_bitmap(const unsigned char *sha1,
> > +				      struct packed_git **found_pack,
> > +				      off_t *found_offset)
> > +{
> > +	struct packed_git *p = *found_pack;
> > +
> > +	/*
> > +	 * There are two types of requests coming here:
> > +	 * 1. entries coming from main pack covered by bitmap index, and
> > +	 * 2. object coming from, possibly alternate, loose or other packs.
> > +	 *
> > +	 * For "1" we always have *found_pack != NULL passed here from
> > +	 * traverse_bitmap_commit_list(). (*found_pack is bitmap_git.pack
> > +	 * actually).
> > +	 *
> > +	 * For "2" we always have *found_pack == NULL passed here from
> > +	 * traverse_bitmap_commit_list() - since this is the way bitmap
> > +	 * traversal passes here "extended" bitmap entries.
> > +	 */
> > +
> > +	/* objects not covered by bitmap */
> > +	if (!p)
> > +		return want_object_in_pack(sha1, 0, found_pack, found_offset);
> > +	/* objects covered by bitmap - we only have to check p wrt local and .keep */
> 
> I am assuming that p != NULL only means "this object exists in THIS
> pack", without saying anything about "this object may also exist in
> other places", but "we only have to check" implies that "p != NULL"
> means "this object exists *ONLY* in this pack and nowhere else".
> 
> Puzzled.

You are right. Being new to --local and .keep I've missed this. I've
added tests to cover cases like "object lives in both bitmapped pack and
non-local loose or .keep'ed pack" and made the adjustments. The checks
are now live unified in want_object_in_pack() for both bitmapped and
non-bitmapped codepaths. Please apply the following corrected patch on
top of 56dfeb62 (jk/pack-objects-optim).

Thanks,
Kirill

---- 8< ----
From: Kirill Smelkov <kirr@nexedi.com>
Date: Fri, 29 Jul 2016 10:46:56 +0300
Subject: [PATCH v2] pack-objects: Teach --use-bitmap-index codepath to respect
 --local, --honor-pack-keep and --incremental

Since 6b8fda2d (pack-objects: use bitmaps when packing objects) there
are two codepaths in pack-objects: with & without using bitmap
reachability index.

However add_object_entry_from_bitmap(), despite its non-bitmapped
counterpart add_object_entry(), in no way does check for whether --local
or --honor-pack-keep or --incremental should be respected. In
non-bitmapped codepath this is handled in want_object_in_pack(), but
bitmapped codepath has simply no such checking at all.

The bitmapped codepath however was allowing to pass in all those options
and with bitmap indices still being used under such conditions -
potentially giving wrong output (e.g. including objects from non-local or
.keep'ed pack).

We can easily fix this by noting the following: when an object comes to
add_object_entry_from_bitmap() it can come for two reasons:

    1. entries coming from main pack covered by bitmap index, and
    2. object coming from, possibly alternate, loose or other packs.

"2" can be already handled by want_object_in_pack() and to cover
"1" we can teach want_object_in_pack() to expect that *found_pack can be
non-NULL, meaning calling client already found object's pack entry.

In want_object_in_pack() we care to start the checks from already found
pack, if we have one, this way caring not to do more than 1 iteration
in case neither --local nor --honour-pack-keep are active. In
particular, as p5310-pack-bitmaps.sh shows, we do not do harm to
served-with-bitmap clones performance-wise:

    Test                      56dfeb62          this tree
    -----------------------------------------------------------------
    5310.2: repack to disk    9.14(8.18+0.31)   8.89(7.92+0.28) -2.7%
    5310.3: simulated clone   1.94(2.14+0.07)   1.91(2.08+0.08) -1.5%
    5310.4: simulated fetch   0.75(1.01+0.02)   0.75(0.94+0.07) +0.0%
    5310.6: partial bitmap    1.99(2.44+0.16)   1.95(2.40+0.14) -2.0%

with all differences strangely showing we are a bit faster now, but
probably all being within noise.

Suggested-by: Junio C Hamano <gitster@pobox.com>
Discussed-with: Jeff King <peff@peff.net>
---
 builtin/pack-objects.c  |  36 ++++++++++++-----
 t/t5310-pack-bitmaps.sh | 103 ++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 130 insertions(+), 9 deletions(-)

diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index c4c2a3c..2c274d3 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -948,9 +948,9 @@ static int have_duplicate_entry(const unsigned char *sha1,
  * Check whether we want the object in the pack (e.g., we do not want
  * objects found in non-local stores if the "--local" option was used).
  *
- * As a side effect of this check, we will find the packed version of this
- * object, if any. We therefore pass out the pack information to avoid having
- * to look it up again later.
+ * As a side effect of this check, if object's pack entry was not already found,
+ * we will find the packed version of this object, if any. We therefore pass
+ * out the pack information to avoid having to look it up again later.
  */
 static int want_object_in_pack(const unsigned char *sha1,
 			       int exclude,
@@ -958,15 +958,30 @@ static int want_object_in_pack(const unsigned char *sha1,
 			       off_t *found_offset)
 {
 	struct packed_git *p;
+	struct packed_git *pack1 = *found_pack;
+	int pack1_seen = !pack1;
 
 	if (!exclude && local && has_loose_object_nonlocal(sha1))
 		return 0;
 
-	*found_pack = NULL;
-	*found_offset = 0;
+	/*
+	 * If we already know the pack object lives in, start checks from that
+	 * pack - in the usual case when neither --local was given nor .keep files
+	 * are present the loop will degenerate to have only 1 iteration.
+	 */
+	for (p = (pack1 ? pack1 : packed_git); p;
+	     p = (pack1_seen ? p->next : packed_git), pack1_seen = 1) {
+		off_t offset;
+
+		if (p == pack1) {
+			if (pack1_seen)
+				continue;
+			offset = *found_offset;
+		}
+		else {
+			offset = find_pack_entry_one(sha1, p);
+		}
 
-	for (p = packed_git; p; p = p->next) {
-		off_t offset = find_pack_entry_one(sha1, p);
 		if (offset) {
 			if (!*found_pack) {
 				if (!is_pack_valid(p))
@@ -1039,8 +1054,8 @@ static const char no_closure_warning[] = N_(
 static int add_object_entry(const unsigned char *sha1, enum object_type type,
 			    const char *name, int exclude)
 {
-	struct packed_git *found_pack;
-	off_t found_offset;
+	struct packed_git *found_pack = NULL;
+	off_t found_offset = 0;
 	uint32_t index_pos;
 
 	if (have_duplicate_entry(sha1, exclude, &index_pos))
@@ -1073,6 +1088,9 @@ static int add_object_entry_from_bitmap(const unsigned char *sha1,
 	if (have_duplicate_entry(sha1, 0, &index_pos))
 		return 0;
 
+	if (!want_object_in_pack(sha1, 0, &pack, &offset))
+		return 0;
+
 	create_object_entry(sha1, type, name_hash, 0, 0, index_pos, pack, offset);
 
 	display_progress(progress_state, nr_result);
diff --git a/t/t5310-pack-bitmaps.sh b/t/t5310-pack-bitmaps.sh
index 3893afd..1a61de4 100755
--- a/t/t5310-pack-bitmaps.sh
+++ b/t/t5310-pack-bitmaps.sh
@@ -16,6 +16,7 @@ test_expect_success 'setup repo with moderate-sized history' '
 		test_commit side-$i
 	done &&
 	git checkout master &&
+	bitmaptip=$(git show-ref -s master) &&
 	blob=$(echo tagged-blob | git hash-object -w --stdin) &&
 	git tag tagged-blob $blob &&
 	git config repack.writebitmaps true &&
@@ -118,6 +119,90 @@ test_expect_success 'incremental repack can disable bitmaps' '
 	git repack -d --no-write-bitmap-index
 '
 
+test_expect_success 'pack-objects respects --local (non-local loose)' '
+	mkdir -p alt_objects/pack &&
+	echo $(pwd)/alt_objects > .git/objects/info/alternates &&
+	echo content1 > file1 &&
+	objsha1=$(GIT_OBJECT_DIRECTORY=alt_objects git hash-object -w file1) &&
+	git cat-file blob $blob | GIT_OBJECT_DIRECTORY=alt_objects git hash-object -w --stdin &&
+	git add file1 &&
+	test_tick &&
+	git commit -m commit_file1 &&
+	echo HEAD | \
+	git pack-objects --local --stdout --revs >1.pack &&
+	git index-pack 1.pack &&
+	git verify-pack -v 1.pack >1.objects &&
+	echo -e "$objsha1\n$blob" >nonlocal-loose &&
+	if grep -qFf nonlocal-loose 1.objects; then
+		echo "Non-local object present in pack generated with --local"
+		return 1
+	fi
+'
+
+test_expect_success 'pack-objects respects --honor-pack-keep (local non-bitmapped pack)' '
+	echo content2 > file2 &&
+	objsha2=$(git hash-object -w file2) &&
+	git add file2 &&
+	test_tick &&
+	git commit -m commit_file2 &&
+	echo -e "$objsha2\n$bitmaptip" >keepobjects &&
+	pack2=$(git pack-objects pack2 <keepobjects) &&
+	mv pack2-$pack2.* .git/objects/pack/ &&
+	touch .git/objects/pack/pack2-$pack2.keep &&
+	rm $(objpath $objsha2) &&
+	echo HEAD | \
+	git pack-objects --honor-pack-keep --stdout --revs >2a.pack &&
+	git index-pack 2a.pack &&
+	git verify-pack -v 2a.pack >2a.objects &&
+	if grep -qFf keepobjects 2a.objects; then
+		echo "Object from .keeped pack present in pack generated with --honor-pack-keep"
+		return 1
+	fi
+'
+
+test_expect_success 'pack-objects respects --local (non-local pack)' '
+	mv .git/objects/pack/pack2-$pack2.* alt_objects/pack/ &&
+	echo HEAD | \
+	git pack-objects --local --stdout --revs >2b.pack &&
+	git index-pack 2b.pack &&
+	git verify-pack -v 2b.pack >2b.objects &&
+	if grep -qFf keepobjects 2b.objects; then
+		echo "Non-local object present in pack generated with --local"
+		return 1
+	fi
+'
+
+test_expect_success 'pack-objects respects --honor-pack-keep (local bitmapped pack)' '
+	ls .git/objects/pack/ | grep bitmap >output &&
+	test_line_count = 1 output &&
+	packbitmap=$(basename $(cat output) .bitmap) &&
+	git verify-pack -v .git/objects/pack/$packbitmap.pack >packbitmap.verify &&
+	grep -o "^$_x40" packbitmap.verify |sort >packbitmap.objects &&
+	touch .git/objects/pack/$packbitmap.keep &&
+	echo HEAD | \
+	git pack-objects --honor-pack-keep --stdout --revs >3a.pack &&
+	git index-pack 3a.pack &&
+	git verify-pack -v 3a.pack >3a.objects &&
+	if grep -qFf packbitmap.objects 3a.objects; then
+		echo "Object from .keeped bitmapped pack present in pack generated with --honour-pack-keep"
+		return 1
+	fi &&
+	rm .git/objects/pack/$packbitmap.keep
+'
+
+test_expect_success 'pack-objects respects --local (non-local bitmapped pack)' '
+	mv .git/objects/pack/$packbitmap.* alt_objects/pack/ &&
+	echo HEAD | \
+	git pack-objects --local --stdout --revs >3b.pack &&
+	git index-pack 3b.pack &&
+	git verify-pack -v 3b.pack >3b.objects &&
+	if grep -qFf packbitmap.objects 3b.objects; then
+		echo "Non-local object from bitmapped pack present in pack generated with --local"
+		return 1
+	fi &&
+	mv alt_objects/pack/$packbitmap.* .git/objects/pack/
+'
+
 test_expect_success 'full repack, reusing previous bitmaps' '
 	git repack -ad &&
 	ls .git/objects/pack/ | grep bitmap >output &&
@@ -143,6 +228,24 @@ test_expect_success 'create objects for missing-HAVE tests' '
 	EOF
 '
 
+test_expect_success 'pack-objects respects --incremental' '
+	cat >revs2 <<-EOF &&
+	HEAD
+	$commit
+	EOF
+	git pack-objects --incremental --stdout --revs <revs2 >4.pack &&
+	git index-pack 4.pack &&
+	git verify-pack -v 4.pack >4.verify &&
+	grep -o "^$_x40" 4.verify |sort >4.objects &&
+	test_line_count = 4 4.objects &&
+	git rev-list --objects $commit >revlist &&
+	grep -o "^$_x40" revlist |sort >objects &&
+	if grep -qvFf objects 4.objects; then
+		echo "Expected objects not present in incremental pack"
+		return 1
+	fi
+'
+
 test_expect_success 'pack with missing blob' '
 	rm $(objpath $blob) &&
 	git pack-objects --stdout --revs <revs >/dev/null
-- 
2.9.2.701.gf965a18.dirty

^ permalink raw reply related	[flat|nested] 62+ messages in thread

* Re: [PATCH 1/2] pack-objects: Teach --use-bitmap-index codepath to respect --local, --honor-pack-keep and --incremental
  2016-08-08 12:37                         ` Kirill Smelkov
@ 2016-08-08 13:50                           ` Jeff King
  2016-08-08 13:51                             ` Jeff King
                                               ` (2 more replies)
  2016-08-08 16:11                           ` Junio C Hamano
  1 sibling, 3 replies; 62+ messages in thread
From: Jeff King @ 2016-08-08 13:50 UTC (permalink / raw)
  To: Kirill Smelkov
  Cc: Junio C Hamano, Vicent Marti, Jérome Perrin, Isabelle Vallet,
	Kazuhiko Shiozaki, Julien Muchembled, git

On Mon, Aug 08, 2016 at 03:37:35PM +0300, Kirill Smelkov wrote:

> @@ -958,15 +958,30 @@ static int want_object_in_pack(const unsigned char *sha1,
>  			       off_t *found_offset)
>  {
>  	struct packed_git *p;
> +	struct packed_git *pack1 = *found_pack;
> +	int pack1_seen = !pack1;
>  
>  	if (!exclude && local && has_loose_object_nonlocal(sha1))
>  		return 0;
>  
> -	*found_pack = NULL;
> -	*found_offset = 0;
> +	/*
> +	 * If we already know the pack object lives in, start checks from that
> +	 * pack - in the usual case when neither --local was given nor .keep files
> +	 * are present the loop will degenerate to have only 1 iteration.
> +	 */
> +	for (p = (pack1 ? pack1 : packed_git); p;
> +	     p = (pack1_seen ? p->next : packed_git), pack1_seen = 1) {
> +		off_t offset;

Hmm. So this is basically sticking the found-pack at the front of the
loop.

We either need to look at zero packs here (we already know where the
object is, and we don't need to bother with --local or .keep lookups),
or we need to look at all of them (to check for local/keep).

I guess you structured it this way to try to reuse the "can we break out
early" logic from the middle of the loop. So we go through the loop one
time, and then break out. And then this:

> +		if (p == pack1) {
> +			if (pack1_seen)
> +				continue;
> +			offset = *found_offset;
> +		}
> +		else {
> +			offset = find_pack_entry_one(sha1, p);
> +		}

is meant to make that one-time through the loop cheaper. So I don't
think it's wrong, but it's very confusing to me.

Would it be simpler to stick that logic in a function like:

  static int want_found_object(int exclude, struct packed_git *pack)
  {
	if (exclude)
		return 1;
	if (incremental)
		return 0;

	/* if we can break early, then do so */
	if (!ignore_packed_keep &&
	    (!local || !have_non_local_packs))
		return 1;

	if (local && !p->pack_local)
		return 0;
	if (ignore_packed_keep && p->pack_local && p->pack_keep)
		return 0;

	/* indeterminate; keep looking for more packs */
	return -1;
  }

  static int want_object_in_pack(...)
  {
	...
	if (!exclude && local && has_loose_object_nonlocal(sha1))
		return 0;

	if (*found_pack) {
		int ret = want_found_object(exclude, *found_pack);
		if (ret != -1)
			return ret;
	}

	for (p = packed_git; p; p = p->next) {
		off_t offset;

		if (p == *found_pack)
			offset = *found_offset;
		else
			offset = find_pack_entry(sha1, p);
		if (offset) {
			... fill in *found_pack ...
			int ret = want_found_object(exclude, p);
			if (ret != -1)
				return ret;
		}
	}
	return 1;
  }

That's a little more verbose, but IMHO the flow is a lot easier to
follow (especially as the later re-rolls of that series actually muck
with the loop order more, but with this approach there's no conflict).

>  static int add_object_entry(const unsigned char *sha1, enum object_type type,
>  			    const char *name, int exclude)
>  {
> -	struct packed_git *found_pack;
> -	off_t found_offset;
> +	struct packed_git *found_pack = NULL;
> +	off_t found_offset = 0;
>  	uint32_t index_pos;
>  
>  	if (have_duplicate_entry(sha1, exclude, &index_pos))
> @@ -1073,6 +1088,9 @@ static int add_object_entry_from_bitmap(const unsigned char *sha1,
>  	if (have_duplicate_entry(sha1, 0, &index_pos))
>  		return 0;
>  
> +	if (!want_object_in_pack(sha1, 0, &pack, &offset))
> +		return 0;
> +

This part looks correct and easy to understand.

> diff --git a/t/t5310-pack-bitmaps.sh b/t/t5310-pack-bitmaps.sh
> index 3893afd..1a61de4 100755
> --- a/t/t5310-pack-bitmaps.sh
> +++ b/t/t5310-pack-bitmaps.sh
> @@ -16,6 +16,7 @@ test_expect_success 'setup repo with moderate-sized history' '
>  		test_commit side-$i
>  	done &&
>  	git checkout master &&
> +	bitmaptip=$(git show-ref -s master) &&

Our usual method for getting a sha1 is "git rev-parse". I don't think
there's anything wrong with your method, but it might be better to stick
to the canonical one (I had to actually look up "show-ref -s").

> @@ -118,6 +119,90 @@ test_expect_success 'incremental repack can disable bitmaps' '
>  	git repack -d --no-write-bitmap-index
>  '
>  
> +test_expect_success 'pack-objects respects --local (non-local loose)' '
> +	mkdir -p alt_objects/pack &&
> +	echo $(pwd)/alt_objects > .git/objects/info/alternates &&
> +	echo content1 > file1 &&

Style: we don't put a space between ">" and the filename.

> +	objsha1=$(GIT_OBJECT_DIRECTORY=alt_objects git hash-object -w file1) &&
> +	git cat-file blob $blob | GIT_OBJECT_DIRECTORY=alt_objects git hash-object -w --stdin &&

I'm not sure why we need two objects in the fake alt_objects repository.
Shouldn't one be enough to do the test?

> +	git add file1 &&

I think this will actually skip the writing of the loose object, because
it's already available in the alternate object store. You probably want
to do this before adding it there.

> +	test_tick &&
> +	git commit -m commit_file1 &&
> +	echo HEAD | \

No need for "\" after a "|"; the shell knows it has to keep looking.

> +	git pack-objects --local --stdout --revs >1.pack &&
> +	git index-pack 1.pack &&

I'd have expected you to use the non-stdout version here. Is this meant
to be independent of your other patch (I think that's OK).

> +	git verify-pack -v 1.pack >1.objects &&

It's cheaper to use "git show-index <1.pack", and the output is saner,
too.

> +	echo -e "$objsha1\n$blob" >nonlocal-loose &&

"echo -e" isn't portable. You can use "printf", or two echos like:

  {
    echo one &&
    echo two
  } >file

(though I'm still not sure what we gain by checking both).

> +	if grep -qFf nonlocal-loose 1.objects; then
> +		echo "Non-local object present in pack generated with --local"
> +		return 1
> +	fi
> +'

grep -f isn't portable. However, I think:

  echo $objsha1 >expect &&
  git show-index <1.pack | cut -d' ' -f2 >actual
  test_cmp expect actual

would work (if you do stick with two entries, you might need to sort
your "expect").

I think similar comments apply to the other tests. I would have expected
"respects --local (non-local pack)" to come next (i.e., to keep all of
the --local tests together). But you seem to interleave them with
--honor-pack-keep.

-Peff

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 1/2] pack-objects: Teach --use-bitmap-index codepath to respect --local, --honor-pack-keep and --incremental
  2016-08-08 13:50                           ` Jeff King
@ 2016-08-08 13:51                             ` Jeff King
  2016-08-08 16:08                             ` Junio C Hamano
  2016-08-08 19:06                             ` Junio C Hamano
  2 siblings, 0 replies; 62+ messages in thread
From: Jeff King @ 2016-08-08 13:51 UTC (permalink / raw)
  To: Kirill Smelkov
  Cc: Junio C Hamano, Vicent Marti, Jérome Perrin, Isabelle Vallet,
	Kazuhiko Shiozaki, Julien Muchembled, git

On Mon, Aug 08, 2016 at 09:50:20AM -0400, Jeff King wrote:

> > +	git pack-objects --local --stdout --revs >1.pack &&
> > +	git index-pack 1.pack &&
> 
> I'd have expected you to use the non-stdout version here. Is this meant
> to be independent of your other patch (I think that's OK).

Oh, nevermind, I forgot this was meant to be a preparatory patch. So it
makes sense to use --stdout in the tests.

-Peff

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v4 2/2] pack-objects: Teach it to use reachability bitmap index when generating non-stdout pack too
  2016-07-29  7:47                     ` [PATCH v4 2/2] pack-objects: Teach it to use reachability bitmap index when generating non-stdout pack too Kirill Smelkov
@ 2016-08-08 13:56                       ` Jeff King
  2016-08-08 15:40                         ` Kirill Smelkov
  0 siblings, 1 reply; 62+ messages in thread
From: Jeff King @ 2016-08-08 13:56 UTC (permalink / raw)
  To: Kirill Smelkov
  Cc: Junio C Hamano, Vicent Marti, Jérome Perrin, Isabelle Vallet,
	Kazuhiko Shiozaki, Julien Muchembled, git

On Fri, Jul 29, 2016 at 10:47:46AM +0300, Kirill Smelkov wrote:

> @@ -2527,7 +2528,7 @@ static int get_object_list_from_bitmap(struct rev_info *revs)
>  	if (prepare_bitmap_walk(revs) < 0)
>  		return -1;
>  
> -	if (pack_options_allow_reuse() &&
> +	if (pack_options_allow_reuse() && pack_to_stdout &&
>  	    !reuse_partial_packfile_from_bitmap(

Should pack_to_stdout just be part of pack_options_allow_reuse()?

> @@ -2812,7 +2813,23 @@ int cmd_pack_objects(int argc, const char **argv, const char *prefix)
>  	if (!rev_list_all || !rev_list_reflog || !rev_list_index)
>  		unpack_unreachable_expiration = 0;
>  
> -	if (!use_internal_rev_list || !pack_to_stdout || is_repository_shallow())
> +	/*
> +	 * "soft" reasons not to use bitmaps - for on-disk repack by default we want
> +	 *
> +	 * - to produce good pack (with bitmap index not-yet-packed objects are
> +	 *   packed in suboptimal order).
> +	 *
> +	 * - to use more robust pack-generation codepath (avoiding possible
> +	 *   bugs in bitmap code and possible bitmap index corruption).
> +	 */
> +	if (!pack_to_stdout)
> +		use_bitmap_index_default = 0;
> +
> +	if (use_bitmap_index < 0)
> +		use_bitmap_index = use_bitmap_index_default;
> +
> +	/* "hard" reasons not to use bitmaps; these just won't work at all */
> +	if (!use_internal_rev_list || (!pack_to_stdout && write_bitmap_index) || is_repository_shallow())
>  		use_bitmap_index = 0;

This all makes sense and looks good.

> +test_expect_success 'pack-objects to file can use bitmap' '
> +	# make sure we still have 1 bitmap index from previous tests
> +	ls .git/objects/pack/ | grep bitmap >output &&
> +	test_line_count = 1 output &&
> +	# verify equivalent packs are generated with/without using bitmap index
> +	packasha1=$(git pack-objects --no-use-bitmap-index --all packa </dev/null) &&
> +	packbsha1=$(git pack-objects --use-bitmap-index --all packb </dev/null) &&
> +	git verify-pack -v packa-$packasha1.pack >packa.verify &&
> +	git verify-pack -v packb-$packbsha1.pack >packb.verify &&
> +	grep -o "^$_x40" packa.verify |sort >packa.objects &&
> +	grep -o "^$_x40" packb.verify |sort >packb.objects &&
> +	test_cmp packa.objects packb.objects
> +'

I don't think "grep -o" is portable. However, an easier way to do this
is probably:

  # these are already in sorted order
  git show-index <packa-$packasha1.pack | cut -d' ' -f2 >packa.objects &&
  git show-index <packb-$packbsha1.pack | cut -d' ' -f2 >packb.objects &&
  test_cmp packa.objects packb.objects

-Peff

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v4 2/2] pack-objects: Teach it to use reachability bitmap index when generating non-stdout pack too
  2016-08-08 13:56                       ` Jeff King
@ 2016-08-08 15:40                         ` Kirill Smelkov
  2016-08-08 18:08                           ` Junio C Hamano
  2016-08-08 18:55                           ` [PATCH v5] pack-objects: teach " Kirill Smelkov
  0 siblings, 2 replies; 62+ messages in thread
From: Kirill Smelkov @ 2016-08-08 15:40 UTC (permalink / raw)
  To: Jeff King
  Cc: Junio C Hamano, Vicent Marti, Jérome Perrin, Isabelle Vallet,
	Kazuhiko Shiozaki, Julien Muchembled, git

On Mon, Aug 08, 2016 at 09:56:00AM -0400, Jeff King wrote:
> On Fri, Jul 29, 2016 at 10:47:46AM +0300, Kirill Smelkov wrote:
> 
> > @@ -2527,7 +2528,7 @@ static int get_object_list_from_bitmap(struct rev_info *revs)
> >  	if (prepare_bitmap_walk(revs) < 0)
> >  		return -1;
> >  
> > -	if (pack_options_allow_reuse() &&
> > +	if (pack_options_allow_reuse() && pack_to_stdout &&
> >  	    !reuse_partial_packfile_from_bitmap(
> 
> Should pack_to_stdout just be part of pack_options_allow_reuse()?

Yes, makes sense; thanks for catching this.


> > @@ -2812,7 +2813,23 @@ int cmd_pack_objects(int argc, const char **argv, const char *prefix)
> >  	if (!rev_list_all || !rev_list_reflog || !rev_list_index)
> >  		unpack_unreachable_expiration = 0;
> >  
> > -	if (!use_internal_rev_list || !pack_to_stdout || is_repository_shallow())
> > +	/*
> > +	 * "soft" reasons not to use bitmaps - for on-disk repack by default we want
> > +	 *
> > +	 * - to produce good pack (with bitmap index not-yet-packed objects are
> > +	 *   packed in suboptimal order).
> > +	 *
> > +	 * - to use more robust pack-generation codepath (avoiding possible
> > +	 *   bugs in bitmap code and possible bitmap index corruption).
> > +	 */
> > +	if (!pack_to_stdout)
> > +		use_bitmap_index_default = 0;
> > +
> > +	if (use_bitmap_index < 0)
> > +		use_bitmap_index = use_bitmap_index_default;
> > +
> > +	/* "hard" reasons not to use bitmaps; these just won't work at all */
> > +	if (!use_internal_rev_list || (!pack_to_stdout && write_bitmap_index) || is_repository_shallow())
> >  		use_bitmap_index = 0;
> 
> This all makes sense and looks good.

Thanks.


> > +test_expect_success 'pack-objects to file can use bitmap' '
> > +	# make sure we still have 1 bitmap index from previous tests
> > +	ls .git/objects/pack/ | grep bitmap >output &&
> > +	test_line_count = 1 output &&
> > +	# verify equivalent packs are generated with/without using bitmap index
> > +	packasha1=$(git pack-objects --no-use-bitmap-index --all packa </dev/null) &&
> > +	packbsha1=$(git pack-objects --use-bitmap-index --all packb </dev/null) &&
> > +	git verify-pack -v packa-$packasha1.pack >packa.verify &&
> > +	git verify-pack -v packb-$packbsha1.pack >packb.verify &&
> > +	grep -o "^$_x40" packa.verify |sort >packa.objects &&
> > +	grep -o "^$_x40" packb.verify |sort >packb.objects &&
> > +	test_cmp packa.objects packb.objects
> > +'
> 
> I don't think "grep -o" is portable. However, an easier way to do this
> is probably:
> 
>   # these are already in sorted order
>   git show-index <packa-$packasha1.pack | cut -d' ' -f2 >packa.objects &&
>   git show-index <packb-$packbsha1.pack | cut -d' ' -f2 >packb.objects &&
>   test_cmp packa.objects packb.objects

Thanks for the info. I did not knew about show-index when I was starting
to work on this and later it just came out of sight. Please find
corrected patch below.

---- 8< ----
From: Kirill Smelkov <kirr@nexedi.com>
Date: Fri, 29 Jul 2016 10:47:46 +0300
Subject: [PATCH v5] pack-objects: Teach it to use reachability bitmap index when
 generating non-stdout pack too

Starting from 6b8fda2d (pack-objects: use bitmaps when packing objects)
if a repository has bitmap index, pack-objects can nicely speedup
"Counting objects" graph traversal phase. That however was done only for
case when resultant pack is sent to stdout, not written into a file.

The reason here is for on-disk repack by default we want:

- to produce good pack (with bitmap index not-yet-packed objects are
  emitted to pack in suboptimal order).

- to use more robust pack-generation codepath (avoiding possible
  bugs in bitmap code and possible bitmap index corruption).

Jeff Kind further explains:

    The reason for this split is that pack-objects tries to determine how
    "careful" it should be based on whether we are packing to disk or to
    stdout. Packing to disk implies "git repack", and that we will likely
    delete the old packs after finishing. We want to be more careful (so
    as not to carry forward a corruption, and to generate a more optimal
    pack), and we presumably run less frequently and can afford extra CPU.
    Whereas packing to stdout implies serving a remote via "git fetch" or
    "git push". This happens more frequently (e.g., a server handling many
    fetching clients), and we assume the receiving end takes more
    responsibility for verifying the data.

    But this isn't always the case. One might want to generate on-disk
    packfiles for a specialized object transfer. Just using "--stdout" and
    writing to a file is not optimal, as it will not generate the matching
    pack index.

    So it would be useful to have some way of overriding this heuristic:
    to tell pack-objects that even though it should generate on-disk
    files, it is still OK to use the reachability bitmaps to do the
    traversal.

So we can teach pack-objects to use bitmap index for initial object
counting phase when generating resultant pack file too:

- if we care it is not activated under git-repack:

  See above about repack robustness and not forward-carrying corruption.

- if we know bitmap index generation is not enabled for resultant pack:

  Current code has singleton bitmap_git so cannot work simultaneously
  with two bitmap indices.

  We also want to avoid (at least with current implementation)
  generating bitmaps off of bitmaps. The reason here is: when generating
  a pack, not-yet-packed objects will be emitted into pack in
  suboptimal order and added to tail of the bitmap as "extended entries".
  When the resultant pack + some new objects in associated repository
  are in turn used to generate another pack with bitmap, the situation
  repeats: new objects are again not emitted optimally and just added to
  bitmap tail - not in recency order.

  So the pack badness can grow over time when at each step we have
  bitmapped pack + some other objects. That's why we want to avoid
  generating bitmaps off of bitmaps, not to let pack badness grow.

- if we keep pack reuse enabled still only for "send-to-stdout" case:

  Because on pack reuse raw entries are directly written out to destination
  pack by write_reused_pack() bypassing needed for pack index generation
  bookkeeping done by regular codepath in write_one() and friends.

This way for pack-objects -> file we get nice speedup:

    erp5.git[1] (~230MB) extracted from ~ 5GB lab.nexedi.com backup
    repository managed by git-backup[2] via

    time echo 0186ac99 | git pack-objects --revs erp5pack

before:  37.2s
after:   26.2s

And for `git repack -adb` packed git.git

    time echo 5c589a73 | git pack-objects --revs gitpack

before:   7.1s
after:    3.6s

i.e. it can be 30% - 50% speedup for pack extraction.

git-backup extracts many packs on repositories restoration. That was my
initial motivation for the patch.

[1] https://lab.nexedi.com/nexedi/erp5
[2] https://lab.nexedi.com/kirr/git-backup

NOTE

Jeff also suggests that pack.useBitmaps was probably a mistake to
introduce originally. This way we are not adding another config point,
but instead just always default to-file pack-objects not to use bitmap
index: Tools which need to generate on-disk packs with using bitmap, can
pass --use-bitmap-index explicitly. And git-repack does never pass
--use-bitmap-index, so this way we can be sure regular on-disk repacking
remains robust.

NOTE2

`git pack-objects --stdout >file.pack` + `git index-pack file.pack` is much slower
than `git pack-objects file.pack`. Extracting erp5.git pack from
lab.nexedi.com backup repository:

    $ time echo 0186ac99 | git pack-objects --stdout --revs >erp5pack-stdout.pack

    real    0m22.309s
    user    0m21.148s
    sys     0m0.932s

    $ time git index-pack erp5pack-stdout.pack

    real    0m50.873s   <-- more than 2 times slower than time to generate pack itself!
    user    0m49.300s
    sys     0m1.360s

So the time for

    `pack-object --stdout >file.pack` + `index-pack file.pack`  is  72s,

while

    `pack-objects file.pack` which does both pack and index     is  27s.

And even

    `pack-objects --no-use-bitmap-index file.pack`              is  37s.

Jeff explains:

    The packfile does not carry the sha1 of the objects. A receiving
    index-pack has to compute them itself, including inflating and applying
    all of the deltas.

that's why for `git-backup restore` we want to teach `git pack-objects
file.pack` to use bitmaps instead of using `git pack-objects --stdout
>file.pack` + `git index-pack file.pack`.

More context:

    http://marc.info/?t=146792101400001&r=1&w=2

Cc: Vicent Marti <tanoku@gmail.com>
Helped-by: Jeff King <peff@peff.net>
Signed-off-by: Kirill Smelkov <kirr@nexedi.com>
---
 Documentation/config.txt |  3 +++
 builtin/pack-objects.c   | 31 ++++++++++++++++++++++++-------
 t/t5310-pack-bitmaps.sh  | 12 ++++++++++++
 3 files changed, 39 insertions(+), 7 deletions(-)

diff --git a/Documentation/config.txt b/Documentation/config.txt
index bc1c433..4ba0c4a 100644
--- a/Documentation/config.txt
+++ b/Documentation/config.txt
@@ -2244,6 +2244,9 @@ pack.useBitmaps::
 	to stdout (e.g., during the server side of a fetch). Defaults to
 	true. You should not generally need to turn this off unless
 	you are debugging pack bitmaps.
++
+*NOTE*: when packing to file (e.g., on repack) the default is always not to use
+	pack bitmaps.
 
 pack.writeBitmaps (deprecated)::
 	This is a deprecated synonym for `repack.writeBitmaps`.
diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index 92e2e5f..0a89e8d 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -66,7 +66,8 @@ static struct packed_git *reuse_packfile;
 static uint32_t reuse_packfile_objects;
 static off_t reuse_packfile_offset;
 
-static int use_bitmap_index = 1;
+static int use_bitmap_index_default = 1;
+static int use_bitmap_index = -1;
 static int write_bitmap_index;
 static uint16_t write_bitmap_options;
 
@@ -2226,7 +2227,7 @@ static int git_pack_config(const char *k, const char *v, void *cb)
 			write_bitmap_options &= ~BITMAP_OPT_HASH_CACHE;
 	}
 	if (!strcmp(k, "pack.usebitmaps")) {
-		use_bitmap_index = git_config_bool(k, v);
+		use_bitmap_index_default = git_config_bool(k, v);
 		return 0;
 	}
 	if (!strcmp(k, "pack.threads")) {
@@ -2475,13 +2476,13 @@ static void loosen_unused_packed_objects(struct rev_info *revs)
 }
 
 /*
- * This tracks any options which a reader of the pack might
- * not understand, and which would therefore prevent blind reuse
- * of what we have on disk.
+ * This tracks any options which pack-reuse code expects to be on, or which a
+ * reader of the pack might not understand, and which would therefore prevent
+ * blind reuse of what we have on disk.
  */
 static int pack_options_allow_reuse(void)
 {
-	return allow_ofs_delta;
+	return pack_to_stdout && allow_ofs_delta;
 }
 
 static int get_object_list_from_bitmap(struct rev_info *revs)
@@ -2774,7 +2775,23 @@ int cmd_pack_objects(int argc, const char **argv, const char *prefix)
 	if (!rev_list_all || !rev_list_reflog || !rev_list_index)
 		unpack_unreachable_expiration = 0;
 
-	if (!use_internal_rev_list || !pack_to_stdout || is_repository_shallow())
+	/*
+	 * "soft" reasons not to use bitmaps - for on-disk repack by default we want
+	 *
+	 * - to produce good pack (with bitmap index not-yet-packed objects are
+	 *   packed in suboptimal order).
+	 *
+	 * - to use more robust pack-generation codepath (avoiding possible
+	 *   bugs in bitmap code and possible bitmap index corruption).
+	 */
+	if (!pack_to_stdout)
+		use_bitmap_index_default = 0;
+
+	if (use_bitmap_index < 0)
+		use_bitmap_index = use_bitmap_index_default;
+
+	/* "hard" reasons not to use bitmaps; these just won't work at all */
+	if (!use_internal_rev_list || (!pack_to_stdout && write_bitmap_index) || is_repository_shallow())
 		use_bitmap_index = 0;
 
 	if (pack_to_stdout || !rev_list_all)
diff --git a/t/t5310-pack-bitmaps.sh b/t/t5310-pack-bitmaps.sh
index 3893afd..ffecc6a 100755
--- a/t/t5310-pack-bitmaps.sh
+++ b/t/t5310-pack-bitmaps.sh
@@ -118,6 +118,18 @@ test_expect_success 'incremental repack can disable bitmaps' '
 	git repack -d --no-write-bitmap-index
 '
 
+test_expect_success 'pack-objects to file can use bitmap' '
+	# make sure we still have 1 bitmap index from previous tests
+	ls .git/objects/pack/ | grep bitmap >output &&
+	test_line_count = 1 output &&
+	# verify equivalent packs are generated with/without using bitmap index
+	packasha1=$(git pack-objects --no-use-bitmap-index --all packa </dev/null) &&
+	packbsha1=$(git pack-objects --use-bitmap-index --all packb </dev/null) &&
+	git show-index <packa-$packasha1.idx | cut -d" " -f2 >packa.objects &&
+	git show-index <packb-$packbsha1.idx | cut -d" " -f2 >packb.objects &&
+	test_cmp packa.objects packb.objects
+'
+
 test_expect_success 'full repack, reusing previous bitmaps' '
 	git repack -ad &&
 	ls .git/objects/pack/ | grep bitmap >output &&
-- 
2.9.2.701.gf965a18.dirty

^ permalink raw reply related	[flat|nested] 62+ messages in thread

* Re: [PATCH 1/2] pack-objects: Teach --use-bitmap-index codepath to respect --local, --honor-pack-keep and --incremental
  2016-08-08 13:50                           ` Jeff King
  2016-08-08 13:51                             ` Jeff King
@ 2016-08-08 16:08                             ` Junio C Hamano
  2016-08-08 19:06                             ` Junio C Hamano
  2 siblings, 0 replies; 62+ messages in thread
From: Junio C Hamano @ 2016-08-08 16:08 UTC (permalink / raw)
  To: Jeff King
  Cc: Kirill Smelkov, Vicent Marti, Jérome Perrin, Isabelle Vallet,
	Kazuhiko Shiozaki, Julien Muchembled, git

Jeff King <peff@peff.net> writes:

> On Mon, Aug 08, 2016 at 03:37:35PM +0300, Kirill Smelkov wrote:
> ...
>   static int want_object_in_pack(...)
>   {
> 	...
> 	if (!exclude && local && has_loose_object_nonlocal(sha1))
> 		return 0;
>
> 	if (*found_pack) {
> 		int ret = want_found_object(exclude, *found_pack);
> 		if (ret != -1)
> 			return ret;
> 	}
>
> 	for (p = packed_git; p; p = p->next) {
> 		off_t offset;
>
> 		if (p == *found_pack)
> 			offset = *found_offset;
> 		else
> 			offset = find_pack_entry(sha1, p);
> 		if (offset) {
> 			... fill in *found_pack ...
> 			int ret = want_found_object(exclude, p);
> 			if (ret != -1)
> 				return ret;
> 		}
> 	}
> 	return 1;
>   }
>
> That's a little more verbose, but IMHO the flow is a lot easier to
> follow (especially as the later re-rolls of that series actually muck
> with the loop order more, but with this approach there's no conflict).

I agree; Kirill's version was so confusing that I couldn't see what
it was trying to do with "pack1_seen" flag that is reset every time
loop repeats (at least, before got my coffee ;-).  A helper function
like the above makes the logic a lot easier to grasp.

>>  static int add_object_entry(const unsigned char *sha1, enum object_type type,
>>  			    const char *name, int exclude)
>>  {
>> -	struct packed_git *found_pack;
>> -	off_t found_offset;
>> +	struct packed_git *found_pack = NULL;
>> +	off_t found_offset = 0;
>>  	uint32_t index_pos;
>>  
>>  	if (have_duplicate_entry(sha1, exclude, &index_pos))
>> @@ -1073,6 +1088,9 @@ static int add_object_entry_from_bitmap(const unsigned char *sha1,
>>  	if (have_duplicate_entry(sha1, 0, &index_pos))
>>  		return 0;
>>  
>> +	if (!want_object_in_pack(sha1, 0, &pack, &offset))
>> +		return 0;
>> +
>
> This part looks correct and easy to understand.

Yes.

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 1/2] pack-objects: Teach --use-bitmap-index codepath to respect --local, --honor-pack-keep and --incremental
  2016-08-08 12:37                         ` Kirill Smelkov
  2016-08-08 13:50                           ` Jeff King
@ 2016-08-08 16:11                           ` Junio C Hamano
  2016-08-08 18:19                             ` Kirill Smelkov
  1 sibling, 1 reply; 62+ messages in thread
From: Junio C Hamano @ 2016-08-08 16:11 UTC (permalink / raw)
  To: Kirill Smelkov
  Cc: Jeff King, Vicent Marti, Jérome Perrin, Isabelle Vallet,
	Kazuhiko Shiozaki, Julien Muchembled, git

Kirill Smelkov <kirr@nexedi.com> writes:

>> > ...
>> > Suggested-by: Junio C Hamano <gitster@pobox.com>
>> > Discussed-with: Jeff King <peff@peff.net>
>> > ---
>> 
>> I do not think I suggested much of this to deserve credit like this,
>> though, as I certainly haven't thought about the pros-and-cons
>> between adding the same "some object in pack may not want to be in
>> the output" logic to the bitmap side, or punting the bitmap codepath
>> when local/keep are involved.
>
> I understand. Still for me it was you who convinced me to add proper
> support for e.g. --local vs bitmap instead of special-casing it.

OK, in such a case, it probably is more sensible to do it like:

    ...
    with all differences strangely showing we are a bit faster now, but
    probably all being within noise.

    Credit for inspiring this solution and discussing the design of
    the change goes to Junio and Jeff King.

    Signed-off-by: Kirill Smelkov <kirr@nexedi.com>
    ---
     builtin/pack-objects.c  |  36 ++++++++++++-----
     t/t5310-pack-bitmaps.sh | 103 ++++++++++++++++++++++++++++++++++++++++++++++++
     2 files changed, 130 insertions(+), 9 deletions(-)

Don't forget your own sign-off ;-)

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v4 2/2] pack-objects: Teach it to use reachability bitmap index when generating non-stdout pack too
  2016-08-08 15:40                         ` Kirill Smelkov
@ 2016-08-08 18:08                           ` Junio C Hamano
  2016-08-08 18:13                             ` Kirill Smelkov
  2016-08-08 18:55                           ` [PATCH v5] pack-objects: teach " Kirill Smelkov
  1 sibling, 1 reply; 62+ messages in thread
From: Junio C Hamano @ 2016-08-08 18:08 UTC (permalink / raw)
  To: Kirill Smelkov
  Cc: Jeff King, Vicent Marti, Jérome Perrin, Isabelle Vallet,
	Kazuhiko Shiozaki, Julien Muchembled, git

Kirill Smelkov <kirr@nexedi.com> writes:

> Thanks for the info. I did not knew about show-index when I was starting
> to work on this and later it just came out of sight. Please find
> corrected patch below.
>
> ---- 8< ----
> From: Kirill Smelkov <kirr@nexedi.com>
> Date: Fri, 29 Jul 2016 10:47:46 +0300
> Subject: [PATCH v5] pack-objects: Teach it to use reachability bitmap index when
>  generating non-stdout pack too

Please don't do this (not the patch text itself, but saying "Please
find ..." and attaching the patch AFTER 60+ lines of response).
When going through old/read messages to see if there are patches
that fell through the cracks, if it is not immediately clear in the
top part of the message that it contains an updated patch, such a
patch will certainly be missed.

Please say "I'll follow up with a corrected patch" instead of
"Please find ..." and respond to that message with just the patch.

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v4 2/2] pack-objects: Teach it to use reachability bitmap index when generating non-stdout pack too
  2016-08-08 18:08                           ` Junio C Hamano
@ 2016-08-08 18:13                             ` Kirill Smelkov
  2016-08-08 18:28                               ` Junio C Hamano
  0 siblings, 1 reply; 62+ messages in thread
From: Kirill Smelkov @ 2016-08-08 18:13 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Jeff King, Vicent Marti, Jérome Perrin, Isabelle Vallet,
	Kazuhiko Shiozaki, Julien Muchembled, git

On Mon, Aug 08, 2016 at 11:08:34AM -0700, Junio C Hamano wrote:
> Kirill Smelkov <kirr@nexedi.com> writes:
> 
> > Thanks for the info. I did not knew about show-index when I was starting
> > to work on this and later it just came out of sight. Please find
> > corrected patch below.
> >
> > ---- 8< ----
> > From: Kirill Smelkov <kirr@nexedi.com>
> > Date: Fri, 29 Jul 2016 10:47:46 +0300
> > Subject: [PATCH v5] pack-objects: Teach it to use reachability bitmap index when
> >  generating non-stdout pack too
> 
> Please don't do this (not the patch text itself, but saying "Please
> find ..." and attaching the patch AFTER 60+ lines of response).
> When going through old/read messages to see if there are patches
> that fell through the cracks, if it is not immediately clear in the
> top part of the message that it contains an updated patch, such a
> patch will certainly be missed.
> 
> Please say "I'll follow up with a corrected patch" instead of
> "Please find ..." and respond to that message with just the patch.

Ok, I see. Should I resend this v5 as separated one or only starting
from next time?

Another question: I'm preparing another version of "pack-objects: Teach
--use-bitmap-index codepath to  respect --local ..." and was going to
put

    ( updated patch is in the end of this mail )

in the top of the message. Is it ok or better not to do so and just respin
the patch in its own separate mail?

Thanks beforehand for clarifying,
Kirill

P.S. I put updated patches in the same mail not because I'm trying to
    make maintainer's life harder, but because this is the way I would
    expect and prefer them to be coming to me...

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 1/2] pack-objects: Teach --use-bitmap-index codepath to respect --local, --honor-pack-keep and --incremental
  2016-08-08 16:11                           ` Junio C Hamano
@ 2016-08-08 18:19                             ` Kirill Smelkov
  2016-08-08 18:57                               ` [PATCH v3] " Kirill Smelkov
  2016-08-08 19:26                               ` [PATCH 1/2] " Junio C Hamano
  0 siblings, 2 replies; 62+ messages in thread
From: Kirill Smelkov @ 2016-08-08 18:19 UTC (permalink / raw)
  To: Jeff King, Junio C Hamano
  Cc: Vicent Marti, Jérome Perrin, Isabelle Vallet,
	Kazuhiko Shiozaki, Julien Muchembled, git


( updated patch is in the end of this mail )

Jeff, first of all thanks for commenting,

On Mon, Aug 08, 2016 at 09:50:20AM -0400, Jeff King wrote:
> On Mon, Aug 08, 2016 at 03:37:35PM +0300, Kirill Smelkov wrote:
> 
> > @@ -958,15 +958,30 @@ static int want_object_in_pack(const unsigned char *sha1,
> >  			       off_t *found_offset)
> >  {
> >  	struct packed_git *p;
> > +	struct packed_git *pack1 = *found_pack;
> > +	int pack1_seen = !pack1;
> >  
> >  	if (!exclude && local && has_loose_object_nonlocal(sha1))
> >  		return 0;
> >  
> > -	*found_pack = NULL;
> > -	*found_offset = 0;
> > +	/*
> > +	 * If we already know the pack object lives in, start checks from that
> > +	 * pack - in the usual case when neither --local was given nor .keep files
> > +	 * are present the loop will degenerate to have only 1 iteration.
> > +	 */
> > +	for (p = (pack1 ? pack1 : packed_git); p;
> > +	     p = (pack1_seen ? p->next : packed_git), pack1_seen = 1) {
> > +		off_t offset;
> 
> Hmm. So this is basically sticking the found-pack at the front of the
> loop.
> 
> We either need to look at zero packs here (we already know where the
> object is, and we don't need to bother with --local or .keep lookups),
> or we need to look at all of them (to check for local/keep).
> 
> I guess you structured it this way to try to reuse the "can we break out
> early" logic from the middle of the loop. So we go through the loop one
> time, and then break out. And then this:
> 
> > +		if (p == pack1) {
> > +			if (pack1_seen)
> > +				continue;
> > +			offset = *found_offset;
> > +		}
> > +		else {
> > +			offset = find_pack_entry_one(sha1, p);
> > +		}
> 
> is meant to make that one-time through the loop cheaper. So I don't
> think it's wrong, but it's very confusing to me.
> 
> Would it be simpler to stick that logic in a function like:
> 
>   static int want_found_object(int exclude, struct packed_git *pack)
>   {
> 	if (exclude)
> 		return 1;
> 	if (incremental)
> 		return 0;
> 
> 	/* if we can break early, then do so */
> 	if (!ignore_packed_keep &&
> 	    (!local || !have_non_local_packs))
> 		return 1;
> 
> 	if (local && !p->pack_local)
> 		return 0;
> 	if (ignore_packed_keep && p->pack_local && p->pack_keep)
> 		return 0;
> 
> 	/* indeterminate; keep looking for more packs */
> 	return -1;
>   }
> 
>   static int want_object_in_pack(...)
>   {
> 	...
> 	if (!exclude && local && has_loose_object_nonlocal(sha1))
> 		return 0;
> 
> 	if (*found_pack) {
> 		int ret = want_found_object(exclude, *found_pack);
> 		if (ret != -1)
> 			return ret;
> 	}
> 
> 	for (p = packed_git; p; p = p->next) {
> 		off_t offset;
> 
> 		if (p == *found_pack)
> 			offset = *found_offset;
> 		else
> 			offset = find_pack_entry(sha1, p);
> 		if (offset) {
> 			... fill in *found_pack ...
> 			int ret = want_found_object(exclude, p);
> 			if (ret != -1)
> 				return ret;
> 		}
> 	}
> 	return 1;
>   }
> 
> That's a little more verbose, but IMHO the flow is a lot easier to
> follow (especially as the later re-rolls of that series actually muck
> with the loop order more, but with this approach there's no conflict).

On Mon, Aug 08, 2016 at 09:08:51AM -0700, Junio C Hamano wrote:
> I agree; Kirill's version was so confusing that I couldn't see what
> it was trying to do with "pack1_seen" flag that is reset every time
> loop repeats (at least, before got my coffee ;-).  A helper function
> like the above makes the logic a lot easier to grasp.

Ok, at least I put today's record for the most confusing code. I agree
with your comments - it is better to simplify control-flow logic. Somehow
my head was refusing doing that and insisted on keeping the loop inside
intact. Maybe I should have a bit of rest... Scratch all that in favour
of want_found_object() and thanks for heads-up.



> >  static int add_object_entry(const unsigned char *sha1, enum object_type type,
> >  			    const char *name, int exclude)
> >  {
> > -	struct packed_git *found_pack;
> > -	off_t found_offset;
> > +	struct packed_git *found_pack = NULL;
> > +	off_t found_offset = 0;
> >  	uint32_t index_pos;
> >  
> >  	if (have_duplicate_entry(sha1, exclude, &index_pos))
> > @@ -1073,6 +1088,9 @@ static int add_object_entry_from_bitmap(const unsigned char *sha1,
> >  	if (have_duplicate_entry(sha1, 0, &index_pos))
> >  		return 0;
> >  
> > +	if (!want_object_in_pack(sha1, 0, &pack, &offset))
> > +		return 0;
> > +
> 
> This part looks correct and easy to understand.

thanks.


> > diff --git a/t/t5310-pack-bitmaps.sh b/t/t5310-pack-bitmaps.sh
> > index 3893afd..1a61de4 100755
> > --- a/t/t5310-pack-bitmaps.sh
> > +++ b/t/t5310-pack-bitmaps.sh
> > @@ -16,6 +16,7 @@ test_expect_success 'setup repo with moderate-sized history' '
> >  		test_commit side-$i
> >  	done &&
> >  	git checkout master &&
> > +	bitmaptip=$(git show-ref -s master) &&
> 
> Our usual method for getting a sha1 is "git rev-parse". I don't think
> there's anything wrong with your method, but it might be better to stick
> to the canonical one (I had to actually look up "show-ref -s").

ok.


> > @@ -118,6 +119,90 @@ test_expect_success 'incremental repack can disable bitmaps' '
> >  	git repack -d --no-write-bitmap-index
> >  '
> >  
> > +test_expect_success 'pack-objects respects --local (non-local loose)' '
> > +	mkdir -p alt_objects/pack &&
> > +	echo $(pwd)/alt_objects > .git/objects/info/alternates &&
> > +	echo content1 > file1 &&
> 
> Style: we don't put a space between ">" and the filename.

ok, corrected.


> > +	objsha1=$(GIT_OBJECT_DIRECTORY=alt_objects git hash-object -w file1) &&
> > +	git cat-file blob $blob | GIT_OBJECT_DIRECTORY=alt_objects git hash-object -w --stdin &&
> 
> I'm not sure why we need two objects in the fake alt_objects repository.
> Shouldn't one be enough to do the test?

Those two objects are different: one is not present in main bitmapped
pack and another is present in main bitmapped pack. So the second one
tests for case Junio caught - when bitmapped pack overlaps with
non-local loose object and with --local we want to avoid that object in
resultant pack. I've adjusted the patch as

+       # non-local loose object which is not present in bitmapped pack
        objsha1=$(GIT_OBJECT_DIRECTORY=alt_objects git hash-object -w file1) &&
+       # non-local loose object which is also present in bitmapped pack
        git cat-file blob $blob | GIT_OBJECT_DIRECTORY=alt_objects git hash-object -w --stdin &&

> > +	git add file1 &&
> 
> I think this will actually skip the writing of the loose object, because
> it's already available in the alternate object store. You probably want
> to do this before adding it there.

It does not want to add the object to local objects - it just wants to
make a commit with reference to that object, so that $objsha1 we added
above becomes referenced from HEAD and thus should be put in pack
without --local (and with --local we test it is not put there).

> > +	test_tick &&
> > +	git commit -m commit_file1 &&
> > +	echo HEAD | \
> 
> No need for "\" after a "|"; the shell knows it has to keep looking.

Ok, thanks for the info. I've actually now folded those two lines into
one as it is not long.

> 
> > +	git pack-objects --local --stdout --revs >1.pack &&
> > +	git index-pack 1.pack &&
> 
> I'd have expected you to use the non-stdout version here. Is this meant
> to be independent of your other patch (I think that's OK).

On Mon, Aug 08, 2016 at 09:50:20AM -0400, Jeff King wrote:
> Oh, nevermind, I forgot this was meant to be a preparatory patch. So it
> makes sense to use --stdout in the tests.

Actually now these two patches:

    - to teach bitmapped pack-objects  about --local & friends, and
    - to teach `pack-objects file` to use bitmaps

are completely separated and orthogonal.

I mean they work independently and can be reviewed / applied
independently, each solving its own task. Initially I was keeping them
together because in the first version of `pack-objects file` the default was
to always use bitmap index, and since repack was using it and there were
tests for repack v.s. non-local objects those tests were failing.

Now, since we figured we should have use_bitmap_index=0 by default when
packing to file, the `bitmap + --local` part is not needed for the first
patch. ( it is still good to have the `bitmap + --local` applied because
it restores correctness and consistency and allows future paths for
brave soles to do repacking with bitmap index being on maybe )

For the current patch I think using --stdout in tests is ok as we know
--stdout uses bitmap indices by default.

> > +	git verify-pack -v 1.pack >1.objects &&
> 
> It's cheaper to use "git show-index <1.pack", and the output is saner,
> too.

I've copied those verify-packs from t7700-repack.sh, but it ok to switch
to show-index. Thanks for pointing this out.

> > +	echo -e "$objsha1\n$blob" >nonlocal-loose &&
> 
> "echo -e" isn't portable. You can use "printf", or two echos like:
> 
>   {
>     echo one &&
>     echo two
>   } >file

ok, switching to printf. Thanks for portability hint.

> (though I'm still not sure what we gain by checking both).

Please see above about those two objects serves for testing two
different scenarios: 1) non-local object which is not in bitmapped pack,
and 2) non-local object which is also present in bitmapped pack.


> > +	if grep -qFf nonlocal-loose 1.objects; then
> > +		echo "Non-local object present in pack generated with --local"
> > +		return 1
> > +	fi
> > +'
> 
> grep -f isn't portable. However, I think:
> 
>   echo $objsha1 >expect &&
>   git show-index <1.pack | cut -d' ' -f2 >actual
>   test_cmp expect actual
> 
> would work (if you do stick with two entries, you might need to sort
> your "expect").

Thanks for pointing out grep -f is not portable (I did not knew nor
cared about portability). However here and in similar places we are
checking that entries in nonlocal-loose are not present in 1.objects
and that is not what test_cmp does as it would test nonlocal-loose and
1.objects to be completely same or not.

For this reason I'm changing `grep -f ...` to `git grep --no-index -f ...`
which we carry with us.

> 
> I think similar comments apply to the other tests.

I went through all tests in the patch and made similar adjustments
everywhere.

> I would have expected
> "respects --local (non-local pack)" to come next (i.e., to keep all of
> the --local tests together). But you seem to interleave them with
> --honor-pack-keep.

There is a reason: "respects --local (non-local pack)" needs non-local
pack setup in alt_objects/ for its tests to run. And we setup such pack
as byproduct of running "pack-objects respects --honor-pack-keep (local
non-bitmapped pack)". Similarly "pack-objects respects --local
(non-local bitmapped pack)" for its testing needs (and moves to
alt_objects/) main bitmapped pack, which was just analyzed in
"pack-objects respects --honor-pack-keep (local bitmapped pack)"

Initially I tried to cluster tests (i.e. all --local together and then
all --honor-pack-keep together) but having tests interleaved turned out
to be handy because one step checks something and prepares setup for its
next one.

On Mon, Aug 08, 2016 at 09:11:53AM -0700, Junio C Hamano wrote:
> Kirill Smelkov <kirr@nexedi.com> writes:
> 
> >> > ...
> >> > Suggested-by: Junio C Hamano <gitster@pobox.com>
> >> > Discussed-with: Jeff King <peff@peff.net>
> >> > ---
> >> 
> >> I do not think I suggested much of this to deserve credit like this,
> >> though, as I certainly haven't thought about the pros-and-cons
> >> between adding the same "some object in pack may not want to be in
> >> the output" logic to the bitmap side, or punting the bitmap codepath
> >> when local/keep are involved.
> >
> > I understand. Still for me it was you who convinced me to add proper
> > support for e.g. --local vs bitmap instead of special-casing it.
> 
> OK, in such a case, it probably is more sensible to do it like:
> 
>     ...
>     with all differences strangely showing we are a bit faster now, but
>     probably all being within noise.
> 
>     Credit for inspiring this solution and discussing the design of
>     the change goes to Junio and Jeff King.

Ok, thanks for advice.

> Don't forget your own sign-off ;-)

Oops, thanks for catching :) Obviously I forgot it and now corrected.

Thanks,
Kirill

---- 8< ----
From: Kirill Smelkov <kirr@nexedi.com>
Subject: [PATCH v3] pack-objects: Teach --use-bitmap-index codepath to respect
 --local, --honor-pack-keep and --incremental

Since 6b8fda2d (pack-objects: use bitmaps when packing objects) there
are two codepaths in pack-objects: with & without using bitmap
reachability index.

However add_object_entry_from_bitmap(), despite its non-bitmapped
counterpart add_object_entry(), in no way does check for whether --local
or --honor-pack-keep or --incremental should be respected. In
non-bitmapped codepath this is handled in want_object_in_pack(), but
bitmapped codepath has simply no such checking at all.

The bitmapped codepath however was allowing to pass in all those options
and with bitmap indices still being used under such conditions -
potentially giving wrong output (e.g. including objects from non-local or
.keep'ed pack).

We can easily fix this by noting the following: when an object comes to
add_object_entry_from_bitmap() it can come for two reasons:

    1. entries coming from main pack covered by bitmap index, and
    2. object coming from, possibly alternate, loose or other packs.

"2" can be already handled by want_object_in_pack() and to cover
"1" we can teach want_object_in_pack() to expect that *found_pack can be
non-NULL, meaning calling client already found object's pack entry.

In want_object_in_pack() we care to start the checks from already found
pack, if we have one, this way determining the answer right away
in case neither --local nor --honour-pack-keep are active. In
particular, as p5310-pack-bitmaps.sh shows, we do not do harm to
served-with-bitmap clones performance-wise:

    Test                      56dfeb62          this tree
    -----------------------------------------------------------------
    5310.2: repack to disk    9.63(8.67+0.33)   9.47(8.55+0.28) -1.7%
    5310.3: simulated clone   2.07(2.17+0.12)   2.03(2.14+0.12) -1.9%
    5310.4: simulated fetch   0.78(1.03+0.02)   0.76(1.00+0.03) -2.6%
    5310.6: partial bitmap    1.97(2.43+0.15)   1.92(2.36+0.14) -2.5%

with all differences strangely showing we are a bit faster now, but
probably all being within noise.

And in the general case we care not to have duplicate
find_pack_entry_one(*found_pack) calls. Worst what can happen is we can
call want_found_object(*found_pack) -- newly introduced helper for
checking whether we want object -- twice, but since want_found_object()
is very lightweight it does not make any difference.

I appreciate help and discussing this change with Junio C Hamano and
Jeff King.

Signed-off-by: Kirill Smelkov <kirr@nexedi.com>
---
 builtin/pack-objects.c  |  94 ++++++++++++++++++++++++++--------------
 t/t5310-pack-bitmaps.sh | 111 ++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 172 insertions(+), 33 deletions(-)

diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index c4c2a3c..e06c1bf 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -944,13 +944,45 @@ static int have_duplicate_entry(const unsigned char *sha1,
 	return 1;
 }
 
+static int want_found_object(int exclude, struct packed_git *p)
+{
+	if (exclude)
+		return 1;
+	if (incremental)
+		return 0;
+
+	/*
+	 * When asked to do --local (do not include an
+	 * object that appears in a pack we borrow
+	 * from elsewhere) or --honor-pack-keep (do not
+	 * include an object that appears in a pack marked
+	 * with .keep), we need to make sure no copy of this
+	 * object come from in _any_ pack that causes us to
+	 * omit it, and need to complete this loop.  When
+	 * neither option is in effect, we know the object
+	 * we just found is going to be packed, so break
+	 * out of the search loop now.
+	 */
+	if (!ignore_packed_keep &&
+	    (!local || !have_non_local_packs))
+		return 1;
+
+	if (local && !p->pack_local)
+		return 0;
+	if (ignore_packed_keep && p->pack_local && p->pack_keep)
+		return 0;
+
+	/* we don't know yet; keep looking for more packs */
+	return -1;
+}
+
 /*
  * Check whether we want the object in the pack (e.g., we do not want
  * objects found in non-local stores if the "--local" option was used).
  *
- * As a side effect of this check, we will find the packed version of this
- * object, if any. We therefore pass out the pack information to avoid having
- * to look it up again later.
+ * As a side effect of this check, if object's pack entry was not already found,
+ * we will find the packed version of this object, if any. We therefore pass
+ * out the pack information to avoid having to look it up again later.
  */
 static int want_object_in_pack(const unsigned char *sha1,
 			       int exclude,
@@ -958,15 +990,30 @@ static int want_object_in_pack(const unsigned char *sha1,
 			       off_t *found_offset)
 {
 	struct packed_git *p;
+	int want;
 
 	if (!exclude && local && has_loose_object_nonlocal(sha1))
 		return 0;
 
-	*found_pack = NULL;
-	*found_offset = 0;
+	/*
+	 * If we already know the pack object lives in, start checks from that
+	 * pack - in the usual case when neither --local was given nor .keep files
+	 * are present we will determine the answer right now.
+	 */
+	if (*found_pack) {
+		want = want_found_object(exclude, *found_pack);
+		if (want != -1)
+			return want;
+	}
 
 	for (p = packed_git; p; p = p->next) {
-		off_t offset = find_pack_entry_one(sha1, p);
+		off_t offset;
+
+		if (p == *found_pack)
+			offset = *found_offset;
+		else
+			offset = find_pack_entry_one(sha1, p);
+
 		if (offset) {
 			if (!*found_pack) {
 				if (!is_pack_valid(p))
@@ -974,31 +1021,9 @@ static int want_object_in_pack(const unsigned char *sha1,
 				*found_offset = offset;
 				*found_pack = p;
 			}
-			if (exclude)
-				return 1;
-			if (incremental)
-				return 0;
-
-			/*
-			 * When asked to do --local (do not include an
-			 * object that appears in a pack we borrow
-			 * from elsewhere) or --honor-pack-keep (do not
-			 * include an object that appears in a pack marked
-			 * with .keep), we need to make sure no copy of this
-			 * object come from in _any_ pack that causes us to
-			 * omit it, and need to complete this loop.  When
-			 * neither option is in effect, we know the object
-			 * we just found is going to be packed, so break
-			 * out of the loop to return 1 now.
-			 */
-			if (!ignore_packed_keep &&
-			    (!local || !have_non_local_packs))
-				break;
-
-			if (local && !p->pack_local)
-				return 0;
-			if (ignore_packed_keep && p->pack_local && p->pack_keep)
-				return 0;
+			want = want_found_object(exclude, p);
+			if (want != -1)
+				return want;
 		}
 	}
 
@@ -1039,8 +1064,8 @@ static const char no_closure_warning[] = N_(
 static int add_object_entry(const unsigned char *sha1, enum object_type type,
 			    const char *name, int exclude)
 {
-	struct packed_git *found_pack;
-	off_t found_offset;
+	struct packed_git *found_pack = NULL;
+	off_t found_offset = 0;
 	uint32_t index_pos;
 
 	if (have_duplicate_entry(sha1, exclude, &index_pos))
@@ -1073,6 +1098,9 @@ static int add_object_entry_from_bitmap(const unsigned char *sha1,
 	if (have_duplicate_entry(sha1, 0, &index_pos))
 		return 0;
 
+	if (!want_object_in_pack(sha1, 0, &pack, &offset))
+		return 0;
+
 	create_object_entry(sha1, type, name_hash, 0, 0, index_pos, pack, offset);
 
 	display_progress(progress_state, nr_result);
diff --git a/t/t5310-pack-bitmaps.sh b/t/t5310-pack-bitmaps.sh
index 3893afd..e71caa4 100755
--- a/t/t5310-pack-bitmaps.sh
+++ b/t/t5310-pack-bitmaps.sh
@@ -7,6 +7,19 @@ objpath () {
 	echo ".git/objects/$(echo "$1" | sed -e 's|\(..\)|\1/|')"
 }
 
+# show objects present in pack ($1 should be associated *.idx)
+packobjects () {
+	git show-index <$1 | cut -d' ' -f2
+}
+
+# hasany pattern-file content-file
+# tests whether content-file has any entry from pattern-file with entries being
+# whole lines.
+hasany () {
+	# NOTE `grep -f` is not portable
+	git grep --no-index -qFf $1 $2
+}
+
 test_expect_success 'setup repo with moderate-sized history' '
 	for i in $(test_seq 1 10); do
 		test_commit $i
@@ -16,6 +29,7 @@ test_expect_success 'setup repo with moderate-sized history' '
 		test_commit side-$i
 	done &&
 	git checkout master &&
+	bitmaptip=$(git rev-parse master) &&
 	blob=$(echo tagged-blob | git hash-object -w --stdin) &&
 	git tag tagged-blob $blob &&
 	git config repack.writebitmaps true &&
@@ -118,6 +132,86 @@ test_expect_success 'incremental repack can disable bitmaps' '
 	git repack -d --no-write-bitmap-index
 '
 
+test_expect_success 'pack-objects respects --local (non-local loose)' '
+	mkdir -p alt_objects/pack &&
+	echo $(pwd)/alt_objects >.git/objects/info/alternates &&
+	echo content1 >file1 &&
+	# non-local loose object which is not present in bitmapped pack
+	objsha1=$(GIT_OBJECT_DIRECTORY=alt_objects git hash-object -w file1) &&
+	# non-local loose object which is also present in bitmapped pack
+	git cat-file blob $blob | GIT_OBJECT_DIRECTORY=alt_objects git hash-object -w --stdin &&
+	git add file1 &&
+	test_tick &&
+	git commit -m commit_file1 &&
+	echo HEAD | git pack-objects --local --stdout --revs >1.pack &&
+	git index-pack 1.pack &&
+	packobjects 1.idx >1.objects &&
+	printf "$objsha1\n$blob\n" >nonlocal-loose &&
+	if hasany nonlocal-loose 1.objects; then
+		echo "Non-local object present in pack generated with --local"
+		return 1
+	fi
+'
+
+test_expect_success 'pack-objects respects --honor-pack-keep (local non-bitmapped pack)' '
+	echo content2 >file2 &&
+	objsha2=$(git hash-object -w file2) &&
+	git add file2 &&
+	test_tick &&
+	git commit -m commit_file2 &&
+	printf "$objsha2\n$bitmaptip\n" >keepobjects &&
+	pack2=$(git pack-objects pack2 <keepobjects) &&
+	mv pack2-$pack2.* .git/objects/pack/ &&
+	touch .git/objects/pack/pack2-$pack2.keep &&
+	rm $(objpath $objsha2) &&
+	echo HEAD | git pack-objects --honor-pack-keep --stdout --revs >2a.pack &&
+	git index-pack 2a.pack &&
+	packobjects 2a.idx >2a.objects &&
+	if hasany keepobjects 2a.objects; then
+		echo "Object from .keeped pack present in pack generated with --honor-pack-keep"
+		return 1
+	fi
+'
+
+test_expect_success 'pack-objects respects --local (non-local pack)' '
+	mv .git/objects/pack/pack2-$pack2.* alt_objects/pack/ &&
+	echo HEAD | git pack-objects --local --stdout --revs >2b.pack &&
+	git index-pack 2b.pack &&
+	packobjects 2b.idx >2b.objects &&
+	if hasany keepobjects 2b.objects; then
+		echo "Non-local object present in pack generated with --local"
+		return 1
+	fi
+'
+
+test_expect_success 'pack-objects respects --honor-pack-keep (local bitmapped pack)' '
+	ls .git/objects/pack/ | grep bitmap >output &&
+	test_line_count = 1 output &&
+	packbitmap=$(basename $(cat output) .bitmap) &&
+	packobjects .git/objects/pack/$packbitmap.idx >packbitmap.objects &&
+	touch .git/objects/pack/$packbitmap.keep &&
+	echo HEAD | git pack-objects --honor-pack-keep --stdout --revs >3a.pack &&
+	git index-pack 3a.pack &&
+	packobjects 3a.idx >3a.objects &&
+	if hasany packbitmap.objects 3a.objects; then
+		echo "Object from .keeped bitmapped pack present in pack generated with --honour-pack-keep"
+		return 1
+	fi &&
+	rm .git/objects/pack/$packbitmap.keep
+'
+
+test_expect_success 'pack-objects respects --local (non-local bitmapped pack)' '
+	mv .git/objects/pack/$packbitmap.* alt_objects/pack/ &&
+	echo HEAD | git pack-objects --local --stdout --revs >3b.pack &&
+	git index-pack 3b.pack &&
+	packobjects 3b.idx >3b.objects &&
+	if hasany packbitmap.objects 3b.objects; then
+		echo "Non-local object from bitmapped pack present in pack generated with --local"
+		return 1
+	fi &&
+	mv alt_objects/pack/$packbitmap.* .git/objects/pack/
+'
+
 test_expect_success 'full repack, reusing previous bitmaps' '
 	git repack -ad &&
 	ls .git/objects/pack/ | grep bitmap >output &&
@@ -143,6 +237,23 @@ test_expect_success 'create objects for missing-HAVE tests' '
 	EOF
 '
 
+test_expect_success 'pack-objects respects --incremental' '
+	cat >revs2 <<-EOF &&
+	HEAD
+	$commit
+	EOF
+	git pack-objects --incremental --stdout --revs <revs2 >4.pack &&
+	git index-pack 4.pack &&
+	packobjects 4.idx >4.objects &&
+	test_line_count = 4 4.objects &&
+	git rev-list --objects $commit >revlist &&
+	cut -d" " -f1 revlist |sort >objects &&
+	if !hasany objects 4.objects; then
+		echo "Expected objects not present in incremental pack"
+		return 1
+	fi
+'
+
 test_expect_success 'pack with missing blob' '
 	rm $(objpath $blob) &&
 	git pack-objects --stdout --revs <revs >/dev/null
-- 
2.9.2.701.gf965a18.dirty

^ permalink raw reply related	[flat|nested] 62+ messages in thread

* Re: [PATCH v4 2/2] pack-objects: Teach it to use reachability bitmap index when generating non-stdout pack too
  2016-08-08 18:13                             ` Kirill Smelkov
@ 2016-08-08 18:28                               ` Junio C Hamano
  2016-08-08 18:58                                 ` Kirill Smelkov
  0 siblings, 1 reply; 62+ messages in thread
From: Junio C Hamano @ 2016-08-08 18:28 UTC (permalink / raw)
  To: Kirill Smelkov
  Cc: Jeff King, Vicent Marti, Jérome Perrin, Isabelle Vallet,
	Kazuhiko Shiozaki, Julien Muchembled, git

Kirill Smelkov <kirr@nexedi.com> writes:

> Another question: I'm preparing another version of "pack-objects: Teach
> --use-bitmap-index codepath to  respect --local ..." and was going to
> put
>
>     ( updated patch is in the end of this mail )
>
> in the top of the message. Is it ok or better not to do so and just respin
> the patch in its own separate mail?

That would force those who pick leftover bits to _open_ and read a
first few lines.

Definitely it is better than burying a patch after 60+ lines, but a
separate patch with incremented "[PATCH v6 1/2]" on the subject line
beats it hands-down from discoverability's point of view.

^ permalink raw reply	[flat|nested] 62+ messages in thread

* [PATCH v5] pack-objects: teach it to use reachability bitmap index when generating non-stdout pack too
  2016-08-08 15:40                         ` Kirill Smelkov
  2016-08-08 18:08                           ` Junio C Hamano
@ 2016-08-08 18:55                           ` Kirill Smelkov
  2016-08-08 20:53                             ` Junio C Hamano
  1 sibling, 1 reply; 62+ messages in thread
From: Kirill Smelkov @ 2016-08-08 18:55 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Jeff King, Vicent Marti, Jérome Perrin, Isabelle Vallet,
	Kazuhiko Shiozaki, Julien Muchembled, git, Kirill Smelkov

Starting from 6b8fda2d (pack-objects: use bitmaps when packing objects)
if a repository has bitmap index, pack-objects can nicely speedup
"Counting objects" graph traversal phase. That however was done only for
case when resultant pack is sent to stdout, not written into a file.

The reason here is for on-disk repack by default we want:

- to produce good pack (with bitmap index not-yet-packed objects are
  emitted to pack in suboptimal order).

- to use more robust pack-generation codepath (avoiding possible
  bugs in bitmap code and possible bitmap index corruption).

Jeff King further explains:

    The reason for this split is that pack-objects tries to determine how
    "careful" it should be based on whether we are packing to disk or to
    stdout. Packing to disk implies "git repack", and that we will likely
    delete the old packs after finishing. We want to be more careful (so
    as not to carry forward a corruption, and to generate a more optimal
    pack), and we presumably run less frequently and can afford extra CPU.
    Whereas packing to stdout implies serving a remote via "git fetch" or
    "git push". This happens more frequently (e.g., a server handling many
    fetching clients), and we assume the receiving end takes more
    responsibility for verifying the data.

    But this isn't always the case. One might want to generate on-disk
    packfiles for a specialized object transfer. Just using "--stdout" and
    writing to a file is not optimal, as it will not generate the matching
    pack index.

    So it would be useful to have some way of overriding this heuristic:
    to tell pack-objects that even though it should generate on-disk
    files, it is still OK to use the reachability bitmaps to do the
    traversal.

So we can teach pack-objects to use bitmap index for initial object
counting phase when generating resultant pack file too:

- if we care it is not activated under git-repack:

  See above about repack robustness and not forward-carrying corruption.

- if we know bitmap index generation is not enabled for resultant pack:

  Current code has singleton bitmap_git so cannot work simultaneously
  with two bitmap indices.

  We also want to avoid (at least with current implementation)
  generating bitmaps off of bitmaps. The reason here is: when generating
  a pack, not-yet-packed objects will be emitted into pack in
  suboptimal order and added to tail of the bitmap as "extended entries".
  When the resultant pack + some new objects in associated repository
  are in turn used to generate another pack with bitmap, the situation
  repeats: new objects are again not emitted optimally and just added to
  bitmap tail - not in recency order.

  So the pack badness can grow over time when at each step we have
  bitmapped pack + some other objects. That's why we want to avoid
  generating bitmaps off of bitmaps, not to let pack badness grow.

- if we keep pack reuse enabled still only for "send-to-stdout" case:

  Because on pack reuse raw entries are directly written out to destination
  pack by write_reused_pack() bypassing needed for pack index generation
  bookkeeping done by regular codepath in write_one() and friends.

This way for pack-objects -> file we get nice speedup:

    erp5.git[1] (~230MB) extracted from ~ 5GB lab.nexedi.com backup
    repository managed by git-backup[2] via

    time echo 0186ac99 | git pack-objects --revs erp5pack

before:  37.2s
after:   26.2s

And for `git repack -adb` packed git.git

    time echo 5c589a73 | git pack-objects --revs gitpack

before:   7.1s
after:    3.6s

i.e. it can be 30% - 50% speedup for pack extraction.

git-backup extracts many packs on repositories restoration. That was my
initial motivation for the patch.

[1] https://lab.nexedi.com/nexedi/erp5
[2] https://lab.nexedi.com/kirr/git-backup

NOTE

Jeff also suggests that pack.useBitmaps was probably a mistake to
introduce originally. This way we are not adding another config point,
but instead just always default to-file pack-objects not to use bitmap
index: Tools which need to generate on-disk packs with using bitmap, can
pass --use-bitmap-index explicitly. And git-repack does never pass
--use-bitmap-index, so this way we can be sure regular on-disk repacking
remains robust.

NOTE2

`git pack-objects --stdout >file.pack` + `git index-pack file.pack` is much slower
than `git pack-objects file.pack`. Extracting erp5.git pack from
lab.nexedi.com backup repository:

    $ time echo 0186ac99 | git pack-objects --stdout --revs >erp5pack-stdout.pack

    real    0m22.309s
    user    0m21.148s
    sys     0m0.932s

    $ time git index-pack erp5pack-stdout.pack

    real    0m50.873s   <-- more than 2 times slower than time to generate pack itself!
    user    0m49.300s
    sys     0m1.360s

So the time for

    `pack-object --stdout >file.pack` + `index-pack file.pack`  is  72s,

while

    `pack-objects file.pack` which does both pack and index     is  27s.

And even

    `pack-objects --no-use-bitmap-index file.pack`              is  37s.

Jeff explains:

    The packfile does not carry the sha1 of the objects. A receiving
    index-pack has to compute them itself, including inflating and applying
    all of the deltas.

that's why for `git-backup restore` we want to teach `git pack-objects
file.pack` to use bitmaps instead of using `git pack-objects --stdout
>file.pack` + `git index-pack file.pack`.

More context:

    http://marc.info/?t=146792101400001&r=1&w=2

Cc: Vicent Marti <tanoku@gmail.com>
Helped-by: Jeff King <peff@peff.net>
Signed-off-by: Kirill Smelkov <kirr@nexedi.com>
---
 Documentation/config.txt |  3 +++
 builtin/pack-objects.c   | 31 ++++++++++++++++++++++++-------
 t/t5310-pack-bitmaps.sh  | 12 ++++++++++++
 3 files changed, 39 insertions(+), 7 deletions(-)

diff --git a/Documentation/config.txt b/Documentation/config.txt
index bc1c433..4ba0c4a 100644
--- a/Documentation/config.txt
+++ b/Documentation/config.txt
@@ -2244,6 +2244,9 @@ pack.useBitmaps::
 	to stdout (e.g., during the server side of a fetch). Defaults to
 	true. You should not generally need to turn this off unless
 	you are debugging pack bitmaps.
++
+*NOTE*: when packing to file (e.g., on repack) the default is always not to use
+	pack bitmaps.
 
 pack.writeBitmaps (deprecated)::
 	This is a deprecated synonym for `repack.writeBitmaps`.
diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index 92e2e5f..0a89e8d 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -66,7 +66,8 @@ static struct packed_git *reuse_packfile;
 static uint32_t reuse_packfile_objects;
 static off_t reuse_packfile_offset;
 
-static int use_bitmap_index = 1;
+static int use_bitmap_index_default = 1;
+static int use_bitmap_index = -1;
 static int write_bitmap_index;
 static uint16_t write_bitmap_options;
 
@@ -2226,7 +2227,7 @@ static int git_pack_config(const char *k, const char *v, void *cb)
 			write_bitmap_options &= ~BITMAP_OPT_HASH_CACHE;
 	}
 	if (!strcmp(k, "pack.usebitmaps")) {
-		use_bitmap_index = git_config_bool(k, v);
+		use_bitmap_index_default = git_config_bool(k, v);
 		return 0;
 	}
 	if (!strcmp(k, "pack.threads")) {
@@ -2475,13 +2476,13 @@ static void loosen_unused_packed_objects(struct rev_info *revs)
 }
 
 /*
- * This tracks any options which a reader of the pack might
- * not understand, and which would therefore prevent blind reuse
- * of what we have on disk.
+ * This tracks any options which pack-reuse code expects to be on, or which a
+ * reader of the pack might not understand, and which would therefore prevent
+ * blind reuse of what we have on disk.
  */
 static int pack_options_allow_reuse(void)
 {
-	return allow_ofs_delta;
+	return pack_to_stdout && allow_ofs_delta;
 }
 
 static int get_object_list_from_bitmap(struct rev_info *revs)
@@ -2774,7 +2775,23 @@ int cmd_pack_objects(int argc, const char **argv, const char *prefix)
 	if (!rev_list_all || !rev_list_reflog || !rev_list_index)
 		unpack_unreachable_expiration = 0;
 
-	if (!use_internal_rev_list || !pack_to_stdout || is_repository_shallow())
+	/*
+	 * "soft" reasons not to use bitmaps - for on-disk repack by default we want
+	 *
+	 * - to produce good pack (with bitmap index not-yet-packed objects are
+	 *   packed in suboptimal order).
+	 *
+	 * - to use more robust pack-generation codepath (avoiding possible
+	 *   bugs in bitmap code and possible bitmap index corruption).
+	 */
+	if (!pack_to_stdout)
+		use_bitmap_index_default = 0;
+
+	if (use_bitmap_index < 0)
+		use_bitmap_index = use_bitmap_index_default;
+
+	/* "hard" reasons not to use bitmaps; these just won't work at all */
+	if (!use_internal_rev_list || (!pack_to_stdout && write_bitmap_index) || is_repository_shallow())
 		use_bitmap_index = 0;
 
 	if (pack_to_stdout || !rev_list_all)
diff --git a/t/t5310-pack-bitmaps.sh b/t/t5310-pack-bitmaps.sh
index 3893afd..ffecc6a 100755
--- a/t/t5310-pack-bitmaps.sh
+++ b/t/t5310-pack-bitmaps.sh
@@ -118,6 +118,18 @@ test_expect_success 'incremental repack can disable bitmaps' '
 	git repack -d --no-write-bitmap-index
 '
 
+test_expect_success 'pack-objects to file can use bitmap' '
+	# make sure we still have 1 bitmap index from previous tests
+	ls .git/objects/pack/ | grep bitmap >output &&
+	test_line_count = 1 output &&
+	# verify equivalent packs are generated with/without using bitmap index
+	packasha1=$(git pack-objects --no-use-bitmap-index --all packa </dev/null) &&
+	packbsha1=$(git pack-objects --use-bitmap-index --all packb </dev/null) &&
+	git show-index <packa-$packasha1.idx | cut -d" " -f2 >packa.objects &&
+	git show-index <packb-$packbsha1.idx | cut -d" " -f2 >packb.objects &&
+	test_cmp packa.objects packb.objects
+'
+
 test_expect_success 'full repack, reusing previous bitmaps' '
 	git repack -ad &&
 	ls .git/objects/pack/ | grep bitmap >output &&
-- 
2.9.2.701.gf965a18.dirty

^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH v3] pack-objects: Teach --use-bitmap-index codepath to respect --local, --honor-pack-keep and --incremental
  2016-08-08 18:19                             ` Kirill Smelkov
@ 2016-08-08 18:57                               ` Kirill Smelkov
  2016-08-08 19:26                               ` [PATCH 1/2] " Junio C Hamano
  1 sibling, 0 replies; 62+ messages in thread
From: Kirill Smelkov @ 2016-08-08 18:57 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Jeff King, Vicent Marti, Jérome Perrin, Isabelle Vallet,
	Kazuhiko Shiozaki, Julien Muchembled, git, Kirill Smelkov

Since 6b8fda2d (pack-objects: use bitmaps when packing objects) there
are two codepaths in pack-objects: with & without using bitmap
reachability index.

However add_object_entry_from_bitmap(), despite its non-bitmapped
counterpart add_object_entry(), in no way does check for whether --local
or --honor-pack-keep or --incremental should be respected. In
non-bitmapped codepath this is handled in want_object_in_pack(), but
bitmapped codepath has simply no such checking at all.

The bitmapped codepath however was allowing to pass in all those options
and with bitmap indices still being used under such conditions -
potentially giving wrong output (e.g. including objects from non-local or
.keep'ed pack).

We can easily fix this by noting the following: when an object comes to
add_object_entry_from_bitmap() it can come for two reasons:

    1. entries coming from main pack covered by bitmap index, and
    2. object coming from, possibly alternate, loose or other packs.

"2" can be already handled by want_object_in_pack() and to cover
"1" we can teach want_object_in_pack() to expect that *found_pack can be
non-NULL, meaning calling client already found object's pack entry.

In want_object_in_pack() we care to start the checks from already found
pack, if we have one, this way determining the answer right away
in case neither --local nor --honour-pack-keep are active. In
particular, as p5310-pack-bitmaps.sh shows, we do not do harm to
served-with-bitmap clones performance-wise:

    Test                      56dfeb62          this tree
    -----------------------------------------------------------------
    5310.2: repack to disk    9.63(8.67+0.33)   9.47(8.55+0.28) -1.7%
    5310.3: simulated clone   2.07(2.17+0.12)   2.03(2.14+0.12) -1.9%
    5310.4: simulated fetch   0.78(1.03+0.02)   0.76(1.00+0.03) -2.6%
    5310.6: partial bitmap    1.97(2.43+0.15)   1.92(2.36+0.14) -2.5%

with all differences strangely showing we are a bit faster now, but
probably all being within noise.

And in the general case we care not to have duplicate
find_pack_entry_one(*found_pack) calls. Worst what can happen is we can
call want_found_object(*found_pack) -- newly introduced helper for
checking whether we want object -- twice, but since want_found_object()
is very lightweight it does not make any difference.

I appreciate help and discussing this change with Junio C Hamano and
Jeff King.

Signed-off-by: Kirill Smelkov <kirr@nexedi.com>
---
 builtin/pack-objects.c  |  94 ++++++++++++++++++++++++++--------------
 t/t5310-pack-bitmaps.sh | 111 ++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 172 insertions(+), 33 deletions(-)

diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index c4c2a3c..e06c1bf 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -944,13 +944,45 @@ static int have_duplicate_entry(const unsigned char *sha1,
 	return 1;
 }
 
+static int want_found_object(int exclude, struct packed_git *p)
+{
+	if (exclude)
+		return 1;
+	if (incremental)
+		return 0;
+
+	/*
+	 * When asked to do --local (do not include an
+	 * object that appears in a pack we borrow
+	 * from elsewhere) or --honor-pack-keep (do not
+	 * include an object that appears in a pack marked
+	 * with .keep), we need to make sure no copy of this
+	 * object come from in _any_ pack that causes us to
+	 * omit it, and need to complete this loop.  When
+	 * neither option is in effect, we know the object
+	 * we just found is going to be packed, so break
+	 * out of the search loop now.
+	 */
+	if (!ignore_packed_keep &&
+	    (!local || !have_non_local_packs))
+		return 1;
+
+	if (local && !p->pack_local)
+		return 0;
+	if (ignore_packed_keep && p->pack_local && p->pack_keep)
+		return 0;
+
+	/* we don't know yet; keep looking for more packs */
+	return -1;
+}
+
 /*
  * Check whether we want the object in the pack (e.g., we do not want
  * objects found in non-local stores if the "--local" option was used).
  *
- * As a side effect of this check, we will find the packed version of this
- * object, if any. We therefore pass out the pack information to avoid having
- * to look it up again later.
+ * As a side effect of this check, if object's pack entry was not already found,
+ * we will find the packed version of this object, if any. We therefore pass
+ * out the pack information to avoid having to look it up again later.
  */
 static int want_object_in_pack(const unsigned char *sha1,
 			       int exclude,
@@ -958,15 +990,30 @@ static int want_object_in_pack(const unsigned char *sha1,
 			       off_t *found_offset)
 {
 	struct packed_git *p;
+	int want;
 
 	if (!exclude && local && has_loose_object_nonlocal(sha1))
 		return 0;
 
-	*found_pack = NULL;
-	*found_offset = 0;
+	/*
+	 * If we already know the pack object lives in, start checks from that
+	 * pack - in the usual case when neither --local was given nor .keep files
+	 * are present we will determine the answer right now.
+	 */
+	if (*found_pack) {
+		want = want_found_object(exclude, *found_pack);
+		if (want != -1)
+			return want;
+	}
 
 	for (p = packed_git; p; p = p->next) {
-		off_t offset = find_pack_entry_one(sha1, p);
+		off_t offset;
+
+		if (p == *found_pack)
+			offset = *found_offset;
+		else
+			offset = find_pack_entry_one(sha1, p);
+
 		if (offset) {
 			if (!*found_pack) {
 				if (!is_pack_valid(p))
@@ -974,31 +1021,9 @@ static int want_object_in_pack(const unsigned char *sha1,
 				*found_offset = offset;
 				*found_pack = p;
 			}
-			if (exclude)
-				return 1;
-			if (incremental)
-				return 0;
-
-			/*
-			 * When asked to do --local (do not include an
-			 * object that appears in a pack we borrow
-			 * from elsewhere) or --honor-pack-keep (do not
-			 * include an object that appears in a pack marked
-			 * with .keep), we need to make sure no copy of this
-			 * object come from in _any_ pack that causes us to
-			 * omit it, and need to complete this loop.  When
-			 * neither option is in effect, we know the object
-			 * we just found is going to be packed, so break
-			 * out of the loop to return 1 now.
-			 */
-			if (!ignore_packed_keep &&
-			    (!local || !have_non_local_packs))
-				break;
-
-			if (local && !p->pack_local)
-				return 0;
-			if (ignore_packed_keep && p->pack_local && p->pack_keep)
-				return 0;
+			want = want_found_object(exclude, p);
+			if (want != -1)
+				return want;
 		}
 	}
 
@@ -1039,8 +1064,8 @@ static const char no_closure_warning[] = N_(
 static int add_object_entry(const unsigned char *sha1, enum object_type type,
 			    const char *name, int exclude)
 {
-	struct packed_git *found_pack;
-	off_t found_offset;
+	struct packed_git *found_pack = NULL;
+	off_t found_offset = 0;
 	uint32_t index_pos;
 
 	if (have_duplicate_entry(sha1, exclude, &index_pos))
@@ -1073,6 +1098,9 @@ static int add_object_entry_from_bitmap(const unsigned char *sha1,
 	if (have_duplicate_entry(sha1, 0, &index_pos))
 		return 0;
 
+	if (!want_object_in_pack(sha1, 0, &pack, &offset))
+		return 0;
+
 	create_object_entry(sha1, type, name_hash, 0, 0, index_pos, pack, offset);
 
 	display_progress(progress_state, nr_result);
diff --git a/t/t5310-pack-bitmaps.sh b/t/t5310-pack-bitmaps.sh
index 3893afd..e71caa4 100755
--- a/t/t5310-pack-bitmaps.sh
+++ b/t/t5310-pack-bitmaps.sh
@@ -7,6 +7,19 @@ objpath () {
 	echo ".git/objects/$(echo "$1" | sed -e 's|\(..\)|\1/|')"
 }
 
+# show objects present in pack ($1 should be associated *.idx)
+packobjects () {
+	git show-index <$1 | cut -d' ' -f2
+}
+
+# hasany pattern-file content-file
+# tests whether content-file has any entry from pattern-file with entries being
+# whole lines.
+hasany () {
+	# NOTE `grep -f` is not portable
+	git grep --no-index -qFf $1 $2
+}
+
 test_expect_success 'setup repo with moderate-sized history' '
 	for i in $(test_seq 1 10); do
 		test_commit $i
@@ -16,6 +29,7 @@ test_expect_success 'setup repo with moderate-sized history' '
 		test_commit side-$i
 	done &&
 	git checkout master &&
+	bitmaptip=$(git rev-parse master) &&
 	blob=$(echo tagged-blob | git hash-object -w --stdin) &&
 	git tag tagged-blob $blob &&
 	git config repack.writebitmaps true &&
@@ -118,6 +132,86 @@ test_expect_success 'incremental repack can disable bitmaps' '
 	git repack -d --no-write-bitmap-index
 '
 
+test_expect_success 'pack-objects respects --local (non-local loose)' '
+	mkdir -p alt_objects/pack &&
+	echo $(pwd)/alt_objects >.git/objects/info/alternates &&
+	echo content1 >file1 &&
+	# non-local loose object which is not present in bitmapped pack
+	objsha1=$(GIT_OBJECT_DIRECTORY=alt_objects git hash-object -w file1) &&
+	# non-local loose object which is also present in bitmapped pack
+	git cat-file blob $blob | GIT_OBJECT_DIRECTORY=alt_objects git hash-object -w --stdin &&
+	git add file1 &&
+	test_tick &&
+	git commit -m commit_file1 &&
+	echo HEAD | git pack-objects --local --stdout --revs >1.pack &&
+	git index-pack 1.pack &&
+	packobjects 1.idx >1.objects &&
+	printf "$objsha1\n$blob\n" >nonlocal-loose &&
+	if hasany nonlocal-loose 1.objects; then
+		echo "Non-local object present in pack generated with --local"
+		return 1
+	fi
+'
+
+test_expect_success 'pack-objects respects --honor-pack-keep (local non-bitmapped pack)' '
+	echo content2 >file2 &&
+	objsha2=$(git hash-object -w file2) &&
+	git add file2 &&
+	test_tick &&
+	git commit -m commit_file2 &&
+	printf "$objsha2\n$bitmaptip\n" >keepobjects &&
+	pack2=$(git pack-objects pack2 <keepobjects) &&
+	mv pack2-$pack2.* .git/objects/pack/ &&
+	touch .git/objects/pack/pack2-$pack2.keep &&
+	rm $(objpath $objsha2) &&
+	echo HEAD | git pack-objects --honor-pack-keep --stdout --revs >2a.pack &&
+	git index-pack 2a.pack &&
+	packobjects 2a.idx >2a.objects &&
+	if hasany keepobjects 2a.objects; then
+		echo "Object from .keeped pack present in pack generated with --honor-pack-keep"
+		return 1
+	fi
+'
+
+test_expect_success 'pack-objects respects --local (non-local pack)' '
+	mv .git/objects/pack/pack2-$pack2.* alt_objects/pack/ &&
+	echo HEAD | git pack-objects --local --stdout --revs >2b.pack &&
+	git index-pack 2b.pack &&
+	packobjects 2b.idx >2b.objects &&
+	if hasany keepobjects 2b.objects; then
+		echo "Non-local object present in pack generated with --local"
+		return 1
+	fi
+'
+
+test_expect_success 'pack-objects respects --honor-pack-keep (local bitmapped pack)' '
+	ls .git/objects/pack/ | grep bitmap >output &&
+	test_line_count = 1 output &&
+	packbitmap=$(basename $(cat output) .bitmap) &&
+	packobjects .git/objects/pack/$packbitmap.idx >packbitmap.objects &&
+	touch .git/objects/pack/$packbitmap.keep &&
+	echo HEAD | git pack-objects --honor-pack-keep --stdout --revs >3a.pack &&
+	git index-pack 3a.pack &&
+	packobjects 3a.idx >3a.objects &&
+	if hasany packbitmap.objects 3a.objects; then
+		echo "Object from .keeped bitmapped pack present in pack generated with --honour-pack-keep"
+		return 1
+	fi &&
+	rm .git/objects/pack/$packbitmap.keep
+'
+
+test_expect_success 'pack-objects respects --local (non-local bitmapped pack)' '
+	mv .git/objects/pack/$packbitmap.* alt_objects/pack/ &&
+	echo HEAD | git pack-objects --local --stdout --revs >3b.pack &&
+	git index-pack 3b.pack &&
+	packobjects 3b.idx >3b.objects &&
+	if hasany packbitmap.objects 3b.objects; then
+		echo "Non-local object from bitmapped pack present in pack generated with --local"
+		return 1
+	fi &&
+	mv alt_objects/pack/$packbitmap.* .git/objects/pack/
+'
+
 test_expect_success 'full repack, reusing previous bitmaps' '
 	git repack -ad &&
 	ls .git/objects/pack/ | grep bitmap >output &&
@@ -143,6 +237,23 @@ test_expect_success 'create objects for missing-HAVE tests' '
 	EOF
 '
 
+test_expect_success 'pack-objects respects --incremental' '
+	cat >revs2 <<-EOF &&
+	HEAD
+	$commit
+	EOF
+	git pack-objects --incremental --stdout --revs <revs2 >4.pack &&
+	git index-pack 4.pack &&
+	packobjects 4.idx >4.objects &&
+	test_line_count = 4 4.objects &&
+	git rev-list --objects $commit >revlist &&
+	cut -d" " -f1 revlist |sort >objects &&
+	if !hasany objects 4.objects; then
+		echo "Expected objects not present in incremental pack"
+		return 1
+	fi
+'
+
 test_expect_success 'pack with missing blob' '
 	rm $(objpath $blob) &&
 	git pack-objects --stdout --revs <revs >/dev/null
-- 
2.9.2.701.gf965a18.dirty

^ permalink raw reply related	[flat|nested] 62+ messages in thread

* Re: [PATCH v4 2/2] pack-objects: Teach it to use reachability bitmap index when generating non-stdout pack too
  2016-08-08 18:28                               ` Junio C Hamano
@ 2016-08-08 18:58                                 ` Kirill Smelkov
  0 siblings, 0 replies; 62+ messages in thread
From: Kirill Smelkov @ 2016-08-08 18:58 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Jeff King, Vicent Marti, Jérome Perrin, Isabelle Vallet,
	Kazuhiko Shiozaki, Julien Muchembled, git

On Mon, Aug 08, 2016 at 11:28:02AM -0700, Junio C Hamano wrote:
> Kirill Smelkov <kirr@nexedi.com> writes:
> 
> > Another question: I'm preparing another version of "pack-objects: Teach
> > --use-bitmap-index codepath to  respect --local ..." and was going to
> > put
> >
> >     ( updated patch is in the end of this mail )
> >
> > in the top of the message. Is it ok or better not to do so and just respin
> > the patch in its own separate mail?
> 
> That would force those who pick leftover bits to _open_ and read a
> first few lines.
> 
> Definitely it is better than burying a patch after 60+ lines, but a
> separate patch with incremented "[PATCH v6 1/2]" on the subject line
> beats it hands-down from discoverability's point of view.

Thanks, I see. I've resent both patches as separate mails.

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 1/2] pack-objects: Teach --use-bitmap-index codepath to respect --local, --honor-pack-keep and --incremental
  2016-08-08 13:50                           ` Jeff King
  2016-08-08 13:51                             ` Jeff King
  2016-08-08 16:08                             ` Junio C Hamano
@ 2016-08-08 19:06                             ` Junio C Hamano
  2016-08-08 19:09                               ` Jeff King
  2 siblings, 1 reply; 62+ messages in thread
From: Junio C Hamano @ 2016-08-08 19:06 UTC (permalink / raw)
  To: Jeff King
  Cc: Kirill Smelkov, Vicent Marti, Jérome Perrin, Isabelle Vallet,
	Kazuhiko Shiozaki, Julien Muchembled, git

Jeff King <peff@peff.net> writes:

>> +	if grep -qFf nonlocal-loose 1.objects; then
>> +		echo "Non-local object present in pack generated with --local"
>> +		return 1
>> +	fi
>> +'
>
> grep -f isn't portable. However, I think:
>
>   echo $objsha1 >expect &&
>   git show-index <1.pack | cut -d' ' -f2 >actual
>   test_cmp expect actual
>
> would work (if you do stick with two entries, you might need to sort
> your "expect").

Hmph, are you sure?  "grep -f pattern_file" is in POSIX.1.


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 1/2] pack-objects: Teach --use-bitmap-index codepath to respect --local, --honor-pack-keep and --incremental
  2016-08-08 19:06                             ` Junio C Hamano
@ 2016-08-08 19:09                               ` Jeff King
  0 siblings, 0 replies; 62+ messages in thread
From: Jeff King @ 2016-08-08 19:09 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Kirill Smelkov, Vicent Marti, Jérome Perrin, Isabelle Vallet,
	Kazuhiko Shiozaki, Julien Muchembled, git

On Mon, Aug 08, 2016 at 12:06:13PM -0700, Junio C Hamano wrote:

> Jeff King <peff@peff.net> writes:
> 
> >> +	if grep -qFf nonlocal-loose 1.objects; then
> >> +		echo "Non-local object present in pack generated with --local"
> >> +		return 1
> >> +	fi
> >> +'
> >
> > grep -f isn't portable. However, I think:
> >
> >   echo $objsha1 >expect &&
> >   git show-index <1.pack | cut -d' ' -f2 >actual
> >   test_cmp expect actual
> >
> > would work (if you do stick with two entries, you might need to sort
> > your "expect").
> 
> Hmph, are you sure?  "grep -f pattern_file" is in POSIX.1.

Hmm, you're right. I specifically checked my local grep.1posix manpage,
but searching for "-f" didn't turn up anything, because it's formatted
with a Unicode minus sign (U+2212). Bleh.

-Peff

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 1/2] pack-objects: Teach --use-bitmap-index codepath to respect --local, --honor-pack-keep and --incremental
  2016-08-08 18:19                             ` Kirill Smelkov
  2016-08-08 18:57                               ` [PATCH v3] " Kirill Smelkov
@ 2016-08-08 19:26                               ` Junio C Hamano
  2016-08-09 11:21                                 ` Kirill Smelkov
  1 sibling, 1 reply; 62+ messages in thread
From: Junio C Hamano @ 2016-08-08 19:26 UTC (permalink / raw)
  To: Kirill Smelkov
  Cc: Jeff King, Vicent Marti, Jérome Perrin, Isabelle Vallet,
	Kazuhiko Shiozaki, Julien Muchembled, git

Kirill Smelkov <kirr@nexedi.com> writes:

> ---- 8< ----
> From: Kirill Smelkov <kirr@nexedi.com>
> Subject: [PATCH v3] pack-objects: Teach --use-bitmap-index codepath to respect
>  --local, --honor-pack-keep and --incremental

(Not a question to Kirill)

Hmph.  I suspect that handling of in-body header by mailinfo not
prepared to see RFC2822 header folding.  "am -c" gives a single line
subject with " --local ..." as its first line in the body.

I'll leave it as a low-hanging fruit for somebody to fix ;-)

	Subject: pack-objects: respect --local, etc. when bitmap is in use

might be shorter and more to the point, anyway.

> diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
> index c4c2a3c..e06c1bf 100644
> --- a/builtin/pack-objects.c
> +++ b/builtin/pack-objects.c
> @@ -944,13 +944,45 @@ static int have_duplicate_entry(const unsigned char *sha1,
>  	return 1;
>  }
>  
> +static int want_found_object(int exclude, struct packed_git *p)
> +{
> +	if (exclude)
> +		return 1;
> +	if (incremental)
> +		return 0;
> +
> +	/*
> +	 * When asked to do --local (do not include an
> +	 * object that appears in a pack we borrow
> +	 * from elsewhere) or --honor-pack-keep (do not
> +	 * include an object that appears in a pack marked
> +	 * with .keep), we need to make sure no copy of this
> +	 * object come from in _any_ pack that causes us to
> +	 * omit it, and need to complete this loop.  When
> +	 * neither option is in effect, we know the object
> +	 * we just found is going to be packed, so break
> +	 * out of the search loop now.
> +	 */

The blame is mine, but "no copy of this object appears in _any_ pack"
would be more correct and easier to read.

This code is no longer in a search loop; its caller is.  Further
rephrasing is needed.  "When asked to do ...these things..., finding
a pack that matches the criteria is sufficient for us to decide to
omit it.  However, even if this pack does not satisify the criteria,
we need to make sure no copy of this object appears in _any_ pack
that makes us to omit the object, so we need to check all the packs.
Signal that by returning -1 to the caller." or something along that
line.

>  /*
>   * Check whether we want the object in the pack (e.g., we do not want
>   * objects found in non-local stores if the "--local" option was used).
>   *
> - * As a side effect of this check, we will find the packed version of this
> - * object, if any. We therefore pass out the pack information to avoid having
> - * to look it up again later.
> + * As a side effect of this check, if object's pack entry was not already found,
> + * we will find the packed version of this object, if any. We therefore pass
> + * out the pack information to avoid having to look it up again later.

The reasoning leading to "We therefore" is understandable, but "pass
out the pack information" is not quite.  Is this meant to explain
the fact that *found_pack and *found_offset are in-out parameters?

The explanation to justify why *found_pack and *found_offset that
used to be out parameters are made in-out parameters belongs to the
log message.  We do not want this in-code comment to explain the
updated code relative to what the code used to do; that is not
useful to those who read the code for the first time in the context
of the committed state.

        /* 
         * Check whether we want to pack the object in the pack (e.g. ...).
         *
         * If the caller already knows an existing pack it wants to
         * take the object from, that is passed in *found_pack and
         * *found_offset; otherwise this function finds if there is
         * any pack that has the object and returns the pack and its
         * offset in these variables.
         */

> diff --git a/t/t5310-pack-bitmaps.sh b/t/t5310-pack-bitmaps.sh
> index 3893afd..e71caa4 100755
> --- a/t/t5310-pack-bitmaps.sh
> +++ b/t/t5310-pack-bitmaps.sh
> @@ -7,6 +7,19 @@ objpath () {
>  	echo ".git/objects/$(echo "$1" | sed -e 's|\(..\)|\1/|')"
>  }
>  
> +# show objects present in pack ($1 should be associated *.idx)
> +packobjects () {
> +	git show-index <$1 | cut -d' ' -f2
> +}

That is a misleading name for a helper function that produces a list
of objects that were packed.  "list_packed_objects", perhaps.

> +# hasany pattern-file content-file
> +# tests whether content-file has any entry from pattern-file with entries being
> +# whole lines.
> +hasany () {
> +	# NOTE `grep -f` is not portable
> +	git grep --no-index -qFf $1 $2
> +}

I doubt "grep -f pattern_file" is not portable, but in any case, it
is probably a good idea to have this helper function to make the
caller easier to read.  Please name it "has_any", though, and quote
"$1" and "$2" as they are meant to be able to take any filename.

> +test_expect_success 'pack-objects respects --local (non-local loose)' '
> +	mkdir -p alt_objects/pack &&

I'd really really prefer to see an empty repository created for
this.  Even though the original intent was .git/objects/ alone,
i.e. GIT_OBJECT_DIRECTORY can exist without associated refs, we
discovered that it is in general not a good idea (think: "gc").

> +	echo $(pwd)/alt_objects >.git/objects/info/alternates &&
> +	echo content1 >file1 &&
> +	# non-local loose object which is not present in bitmapped pack
> +	objsha1=$(GIT_OBJECT_DIRECTORY=alt_objects git hash-object -w file1) &&

Don't say "sha" when you mean "object name".  Otherwise you would
end up introducing funky variable names like $objsha2 we see below
that is confusing (we don't use SHA-2).

> +	# non-local loose object which is also present in bitmapped pack
> +	git cat-file blob $blob | GIT_OBJECT_DIRECTORY=alt_objects git hash-object -w --stdin &&
> +	git add file1 &&
> +	test_tick &&
> +	git commit -m commit_file1 &&
> +	echo HEAD | git pack-objects --local --stdout --revs >1.pack &&
> +	git index-pack 1.pack &&
> +	packobjects 1.idx >1.objects &&
> +	printf "$objsha1\n$blob\n" >nonlocal-loose &&

I think Peff meant to suggest this instead:

	printf "%s\n" "$objsha1" "$blob"

> +	if hasany nonlocal-loose 1.objects; then
> +		echo "Non-local object present in pack generated with --local"
> +		return 1
> +	fi

Just saying

	! has_any nonlocal-loose 1.objects

is sufficient.  Same comment for all other uses of these verbose
output.

Besides, we spell "if/then/fi" like this:

	if condition
        then
        	body
	fi

without a semicolon.

> +test_expect_success 'pack-objects respects --honor-pack-keep (local non-bitmapped pack)' '
> +...
> +	touch .git/objects/pack/pack2-$pack2.keep &&

Please don't do "touch" _unless_ you care about the timestamp of the
file.  Redirect an empty command into it, i.e.

	>.git/objects/pack/pack2-$pack2.keep

or

	echo "reason to keep it" >.git/objects/pack/pack2-$pack2.keep

instead.

> +test_expect_success 'pack-objects respects --honor-pack-keep (local bitmapped pack)' '
> +	ls .git/objects/pack/ | grep bitmap >output &&
> +	test_line_count = 1 output &&
> +	packbitmap=$(basename $(cat output) .bitmap) &&
> +	packobjects .git/objects/pack/$packbitmap.idx >packbitmap.objects &&
> +	touch .git/objects/pack/$packbitmap.keep &&
> +	echo HEAD | git pack-objects --honor-pack-keep --stdout --revs >3a.pack &&
> +	git index-pack 3a.pack &&
> +	packobjects 3a.idx >3a.objects &&
> +	if hasany packbitmap.objects 3a.objects; then
> +		echo "Object from .keeped bitmapped pack present in pack generated with --honour-pack-keep"
> +		return 1
> +	fi &&
> +	rm .git/objects/pack/$packbitmap.keep

Arrange this removal to happen even when any earlier step fails, so
that later tests will not get affected by stray existence of this
file, by using test_when_finished.  E.g.

	list_packed_objects .git/objects/pack/$packbitmap.idx >packbitmap.objects &&
	test_when_finished "rm -f .git/objects/pack/$packbitmap.keep" &&
	>.git/objects/pack/$packbitmap.keep" &&

> +test_expect_success 'pack-objects respects --local (non-local bitmapped pack)' '
> +	mv .git/objects/pack/$packbitmap.* alt_objects/pack/ &&
> +	echo HEAD | git pack-objects --local --stdout --revs >3b.pack &&
> +	git index-pack 3b.pack &&
> +	packobjects 3b.idx >3b.objects &&
> +	if hasany packbitmap.objects 3b.objects; then
> +		echo "Non-local object from bitmapped pack present in pack generated with --local"
> +		return 1
> +	fi &&
> +	mv alt_objects/pack/$packbitmap.* .git/objects/pack/

Ditto on potential use of test_when_finished.


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v5] pack-objects: teach it to use reachability bitmap index when generating non-stdout pack too
  2016-08-08 18:55                           ` [PATCH v5] pack-objects: teach " Kirill Smelkov
@ 2016-08-08 20:53                             ` Junio C Hamano
  2016-08-09 11:21                               ` Kirill Smelkov
  0 siblings, 1 reply; 62+ messages in thread
From: Junio C Hamano @ 2016-08-08 20:53 UTC (permalink / raw)
  To: Kirill Smelkov
  Cc: Jeff King, Vicent Marti, Jérome Perrin, Isabelle Vallet,
	Kazuhiko Shiozaki, Julien Muchembled, git

Kirill Smelkov <kirr@nexedi.com> writes:

> diff --git a/Documentation/config.txt b/Documentation/config.txt
> index bc1c433..4ba0c4a 100644
> --- a/Documentation/config.txt
> +++ b/Documentation/config.txt
> @@ -2244,6 +2244,9 @@ pack.useBitmaps::
>  	to stdout (e.g., during the server side of a fetch). Defaults to
>  	true. You should not generally need to turn this off unless
>  	you are debugging pack bitmaps.
> ++
> +*NOTE*: when packing to file (e.g., on repack) the default is always not to use
> +	pack bitmaps.

This is a bit hard to read and understand.

The patched result starts with "When true, git will use bitmap when
packing to stdout", i.e. when packing to file, git will not.  So
this *NOTE* is repeating the same thing.  The reader is made to
wonder "Why does it need to repeat the same thing?  Does this mean
when the variable is set, a pack sent to a disk uses the bitmap?"

I think what you actually do in the code is to make the variable
affect _only_ the standard-output case, and users need a command
line option if they want to use bitmap when writing to a file (the
code to do so looks correctly done).

> diff --git a/t/t5310-pack-bitmaps.sh b/t/t5310-pack-bitmaps.sh
> index 3893afd..ffecc6a 100755
> --- a/t/t5310-pack-bitmaps.sh
> +++ b/t/t5310-pack-bitmaps.sh
> @@ -118,6 +118,18 @@ test_expect_success 'incremental repack can disable bitmaps' '
>  	git repack -d --no-write-bitmap-index
>  '
>  
> +test_expect_success 'pack-objects to file can use bitmap' '
> +	# make sure we still have 1 bitmap index from previous tests
> +	ls .git/objects/pack/ | grep bitmap >output &&
> +	test_line_count = 1 output &&
> +	# verify equivalent packs are generated with/without using bitmap index
> +	packasha1=$(git pack-objects --no-use-bitmap-index --all packa </dev/null) &&
> +	packbsha1=$(git pack-objects --use-bitmap-index --all packb </dev/null) &&
> +	git show-index <packa-$packasha1.idx | cut -d" " -f2 >packa.objects &&
> +	git show-index <packb-$packbsha1.idx | cut -d" " -f2 >packb.objects &&
> +	test_cmp packa.objects packb.objects
> +'

Looks good.

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 1/2] pack-objects: Teach --use-bitmap-index codepath to respect --local, --honor-pack-keep and --incremental
  2016-08-08 19:26                               ` [PATCH 1/2] " Junio C Hamano
@ 2016-08-09 11:21                                 ` Kirill Smelkov
  2016-08-09 11:25                                   ` [PATCH 1/2 v4] pack-objects: respect --local/--honor-pack-keep/--incremental when bitmap is in use Kirill Smelkov
  2016-08-09 16:52                                   ` [PATCH 1/2] pack-objects: Teach --use-bitmap-index codepath to respect --local, --honor-pack-keep and --incremental Junio C Hamano
  0 siblings, 2 replies; 62+ messages in thread
From: Kirill Smelkov @ 2016-08-09 11:21 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Jeff King, Vicent Marti, Jérome Perrin, Isabelle Vallet,
	Kazuhiko Shiozaki, Julien Muchembled, git

Junio, first of all thanks for feedback,

On Mon, Aug 08, 2016 at 12:26:33PM -0700, Junio C Hamano wrote:
> Kirill Smelkov <kirr@nexedi.com> writes:
[...]
> > diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
> > index c4c2a3c..e06c1bf 100644
> > --- a/builtin/pack-objects.c
> > +++ b/builtin/pack-objects.c
> > @@ -944,13 +944,45 @@ static int have_duplicate_entry(const unsigned char *sha1,
> >  	return 1;
> >  }
> >  
> > +static int want_found_object(int exclude, struct packed_git *p)
> > +{
> > +	if (exclude)
> > +		return 1;
> > +	if (incremental)
> > +		return 0;
> > +
> > +	/*
> > +	 * When asked to do --local (do not include an
> > +	 * object that appears in a pack we borrow
> > +	 * from elsewhere) or --honor-pack-keep (do not
> > +	 * include an object that appears in a pack marked
> > +	 * with .keep), we need to make sure no copy of this
> > +	 * object come from in _any_ pack that causes us to
> > +	 * omit it, and need to complete this loop.  When
> > +	 * neither option is in effect, we know the object
> > +	 * we just found is going to be packed, so break
> > +	 * out of the search loop now.
> > +	 */
> 
> The blame is mine, but "no copy of this object appears in _any_ pack"
> would be more correct and easier to read.
> 
> This code is no longer in a search loop; its caller is.  Further
> rephrasing is needed.  "When asked to do ...these things..., finding
> a pack that matches the criteria is sufficient for us to decide to
> omit it.  However, even if this pack does not satisify the criteria,
> we need to make sure no copy of this object appears in _any_ pack
> that makes us to omit the object, so we need to check all the packs.
> Signal that by returning -1 to the caller." or something along that
> line.

Ok, I've rephrased it your way. Thanks for advising.

> >  /*
> >   * Check whether we want the object in the pack (e.g., we do not want
> >   * objects found in non-local stores if the "--local" option was used).
> >   *
> > - * As a side effect of this check, we will find the packed version of this
> > - * object, if any. We therefore pass out the pack information to avoid having
> > - * to look it up again later.
> > + * As a side effect of this check, if object's pack entry was not already found,
> > + * we will find the packed version of this object, if any. We therefore pass
> > + * out the pack information to avoid having to look it up again later.
> 
> The reasoning leading to "We therefore" is understandable, but "pass
> out the pack information" is not quite.  Is this meant to explain
> the fact that *found_pack and *found_offset are in-out parameters?
> 
> The explanation to justify why *found_pack and *found_offset that
> used to be out parameters are made in-out parameters belongs to the
> log message.  We do not want this in-code comment to explain the
> updated code relative to what the code used to do; that is not
> useful to those who read the code for the first time in the context
> of the committed state.
> 
>         /* 
>          * Check whether we want to pack the object in the pack (e.g. ...).
>          *
>          * If the caller already knows an existing pack it wants to
>          * take the object from, that is passed in *found_pack and
>          * *found_offset; otherwise this function finds if there is
>          * any pack that has the object and returns the pack and its
>          * offset in these variables.
>          */

The "pass out the pack information ..." is not my text - I only added
"if object's pack entry was not already found" in the middle of the
sentence and rewrapped this paragraph. The "pass out the pack
information ..." comes from ce2bc424 (pack-objects: split
add_object_entry; 2013-12-21)

I agree your text is more clear and it is better to adjust the comments.

> > diff --git a/t/t5310-pack-bitmaps.sh b/t/t5310-pack-bitmaps.sh
> > index 3893afd..e71caa4 100755
> > --- a/t/t5310-pack-bitmaps.sh
> > +++ b/t/t5310-pack-bitmaps.sh
> > @@ -7,6 +7,19 @@ objpath () {
> >  	echo ".git/objects/$(echo "$1" | sed -e 's|\(..\)|\1/|')"
> >  }
> >  
> > +# show objects present in pack ($1 should be associated *.idx)
> > +packobjects () {
> > +	git show-index <$1 | cut -d' ' -f2
> > +}
> 
> That is a misleading name for a helper function that produces a list
> of objects that were packed.  "list_packed_objects", perhaps.

I agree it is ambiguous wrt `git pack-objects` and sorry for choosing
not good name from the start. I'm changing it to pack_list_objects().
( personally I would use pack_obj_list a-la git-rev-list, but let's try
  not to create another review step because of abbreviate vs not-abbreviate )

> > +# hasany pattern-file content-file
> > +# tests whether content-file has any entry from pattern-file with entries being
> > +# whole lines.
> > +hasany () {
> > +	# NOTE `grep -f` is not portable
> > +	git grep --no-index -qFf $1 $2
> > +}
> 
> I doubt "grep -f pattern_file" is not portable, but in any case, it
> is probably a good idea to have this helper function to make the
> caller easier to read.  Please name it "has_any", though, and quote
> "$1" and "$2" as they are meant to be able to take any filename.

Ok, thanks for the info `grep -f` is portable


> > +test_expect_success 'pack-objects respects --local (non-local loose)' '
> > +	mkdir -p alt_objects/pack &&
> 
> I'd really really prefer to see an empty repository created for
> this.  Even though the original intent was .git/objects/ alone,
> i.e. GIT_OBJECT_DIRECTORY can exist without associated refs, we
> discovered that it is in general not a good idea (think: "gc").

The "mkdir alt_objects/" comes from t7700-repack.sh - e.g. from
3c3df429 which I've ported from there. However as you say let's switch
this to having full another repo.


> > +	echo $(pwd)/alt_objects >.git/objects/info/alternates &&
> > +	echo content1 >file1 &&
> > +	# non-local loose object which is not present in bitmapped pack
> > +	objsha1=$(GIT_OBJECT_DIRECTORY=alt_objects git hash-object -w file1) &&
> 
> Don't say "sha" when you mean "object name".  Otherwise you would
> end up introducing funky variable names like $objsha2 we see below
> that is confusing (we don't use SHA-2).

Ok makes sense, I've changed objsha1 to altblob and objsha2 to blob2.
Thanks for head-ups on this.


> > +	# non-local loose object which is also present in bitmapped pack
> > +	git cat-file blob $blob | GIT_OBJECT_DIRECTORY=alt_objects git hash-object -w --stdin &&
> > +	git add file1 &&
> > +	test_tick &&
> > +	git commit -m commit_file1 &&
> > +	echo HEAD | git pack-objects --local --stdout --revs >1.pack &&
> > +	git index-pack 1.pack &&
> > +	packobjects 1.idx >1.objects &&
> > +	printf "$objsha1\n$blob\n" >nonlocal-loose &&
> 
> I think Peff meant to suggest this instead:
> 
> 	printf "%s\n" "$objsha1" "$blob"

Oops, yes, my bad. Corrected.


> > +	if hasany nonlocal-loose 1.objects; then
> > +		echo "Non-local object present in pack generated with --local"
> > +		return 1
> > +	fi
> 
> Just saying
> 
> 	! has_any nonlocal-loose 1.objects
> 
> is sufficient.  Same comment for all other uses of these verbose
> output.
> 
> Besides, we spell "if/then/fi" like this:
> 
> 	if condition
>         then
>         	body
> 	fi
> 
> without a semicolon.

I initially copied this check-templates from t7700-repack.sh, e.g. from
3289b9de (t7700: test that 'repack -a' packs alternate packed objects;
2008-11-13) and other places.

But ok, let's switch the checks to oneliners like "! has_any ..."

> > +test_expect_success 'pack-objects respects --honor-pack-keep (local non-bitmapped pack)' '
> > +...
> > +	touch .git/objects/pack/pack2-$pack2.keep &&
> 
> Please don't do "touch" _unless_ you care about the timestamp of the
> file.  Redirect an empty command into it, i.e.
> 
> 	>.git/objects/pack/pack2-$pack2.keep
> 
> or
> 
> 	echo "reason to keep it" >.git/objects/pack/pack2-$pack2.keep
> 
> instead.

Ok, I've changed to >file as the reason here is obvious.
Would you please explain why we should not use touch if we do not care
about timestamps? Simply style?

> > +test_expect_success 'pack-objects respects --honor-pack-keep (local bitmapped pack)' '
> > +	ls .git/objects/pack/ | grep bitmap >output &&
> > +	test_line_count = 1 output &&
> > +	packbitmap=$(basename $(cat output) .bitmap) &&
> > +	packobjects .git/objects/pack/$packbitmap.idx >packbitmap.objects &&
> > +	touch .git/objects/pack/$packbitmap.keep &&
> > +	echo HEAD | git pack-objects --honor-pack-keep --stdout --revs >3a.pack &&
> > +	git index-pack 3a.pack &&
> > +	packobjects 3a.idx >3a.objects &&
> > +	if hasany packbitmap.objects 3a.objects; then
> > +		echo "Object from .keeped bitmapped pack present in pack generated with --honour-pack-keep"
> > +		return 1
> > +	fi &&
> > +	rm .git/objects/pack/$packbitmap.keep
> 
> Arrange this removal to happen even when any earlier step fails, so
> that later tests will not get affected by stray existence of this
> file, by using test_when_finished.  E.g.
> 
> 	list_packed_objects .git/objects/pack/$packbitmap.idx >packbitmap.objects &&
> 	test_when_finished "rm -f .git/objects/pack/$packbitmap.keep" &&
> 	>.git/objects/pack/$packbitmap.keep" &&

Ok, I did not knew about test_when_finished, and thanks for pointing
this out. Adjusted here and in similar place.

Will send v4 patch as reply to this mail with below interdiff:

Thanks again,
Kirill

---- 8< ---- (interdiff)

diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index 4c129bd..c92d7fc 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -953,16 +953,14 @@ static int want_found_object(int exclude, struct packed_git *p)
 		return 0;
 
 	/*
-	 * When asked to do --local (do not include an
-	 * object that appears in a pack we borrow
-	 * from elsewhere) or --honor-pack-keep (do not
-	 * include an object that appears in a pack marked
-	 * with .keep), we need to make sure no copy of this
-	 * object come from in _any_ pack that causes us to
-	 * omit it, and need to complete this loop.  When
-	 * neither option is in effect, we know the object
-	 * we just found is going to be packed, so break
-	 * out of the search loop now.
+	 * When asked to do --local (do not include an object that appears in a
+	 * pack we borrow from elsewhere) or --honor-pack-keep (do not include
+	 * an object that appears in a pack marked with .keep), finding a pack
+	 * that matches the criteria is sufficient for us to decide to omit it.
+	 * However, even if this pack does not satisfy the criteria, we need to
+	 * make sure no copy of this object appears in _any_ pack that makes us
+	 * to omit the object, so we need to check all the packs. Signal that by
+	 * returning -1 to the caller.
 	 */
 	if (!ignore_packed_keep &&
 	    (!local || !have_non_local_packs))
@@ -981,9 +979,10 @@ static int want_found_object(int exclude, struct packed_git *p)
  * Check whether we want the object in the pack (e.g., we do not want
  * objects found in non-local stores if the "--local" option was used).
  *
- * As a side effect of this check, if object's pack entry was not already found,
- * we will find the packed version of this object, if any. We therefore pass
- * out the pack information to avoid having to look it up again later.
+ * If the caller already knows an existing pack it wants to take the object
+ * from, that is passed in *found_pack and *found_offset; otherwise this
+ * function finds if there is any pack that has the object and returns the pack
+ * and its offset in these variables.
  */
 static int want_object_in_pack(const unsigned char *sha1,
 			       int exclude,
diff --git a/t/t5310-pack-bitmaps.sh b/t/t5310-pack-bitmaps.sh
index cce95d8..44914ac 100755
--- a/t/t5310-pack-bitmaps.sh
+++ b/t/t5310-pack-bitmaps.sh
@@ -8,16 +8,15 @@ objpath () {
 }
 
 # show objects present in pack ($1 should be associated *.idx)
-packobjects () {
-	git show-index <$1 | cut -d' ' -f2
+pack_list_objects () {
+	git show-index <"$1" | cut -d' ' -f2
 }
 
-# hasany pattern-file content-file
+# has_any pattern-file content-file
 # tests whether content-file has any entry from pattern-file with entries being
 # whole lines.
-hasany () {
-	# NOTE `grep -f` is not portable
-	git grep --no-index -qFf $1 $2
+has_any () {
+	grep -qFf "$1" "$2"
 }
 
 test_expect_success 'setup repo with moderate-sized history' '
@@ -133,83 +132,68 @@ test_expect_success 'incremental repack can disable bitmaps' '
 '
 
 test_expect_success 'pack-objects respects --local (non-local loose)' '
-	mkdir -p alt_objects/pack &&
-	echo $(pwd)/alt_objects >.git/objects/info/alternates &&
+	git init --bare alt.git &&
+	echo $(pwd)/alt.git/objects >.git/objects/info/alternates &&
 	echo content1 >file1 &&
 	# non-local loose object which is not present in bitmapped pack
-	objsha1=$(GIT_OBJECT_DIRECTORY=alt_objects git hash-object -w file1) &&
+	altblob=$(GIT_DIR=alt.git git hash-object -w file1) &&
 	# non-local loose object which is also present in bitmapped pack
-	git cat-file blob $blob | GIT_OBJECT_DIRECTORY=alt_objects git hash-object -w --stdin &&
+	git cat-file blob $blob | GIT_DIR=alt.git git hash-object -w --stdin &&
 	git add file1 &&
 	test_tick &&
 	git commit -m commit_file1 &&
 	echo HEAD | git pack-objects --local --stdout --revs >1.pack &&
 	git index-pack 1.pack &&
-	packobjects 1.idx >1.objects &&
-	printf "$objsha1\n$blob\n" >nonlocal-loose &&
-	if hasany nonlocal-loose 1.objects; then
-		echo "Non-local object present in pack generated with --local"
-		return 1
-	fi
+	pack_list_objects 1.idx >1.objects &&
+	printf "%s\n" "$altblob" "$blob" >nonlocal-loose &&
+	! has_any nonlocal-loose 1.objects
 '
 
 test_expect_success 'pack-objects respects --honor-pack-keep (local non-bitmapped pack)' '
 	echo content2 >file2 &&
-	objsha2=$(git hash-object -w file2) &&
+	blob2=$(git hash-object -w file2) &&
 	git add file2 &&
 	test_tick &&
 	git commit -m commit_file2 &&
-	printf "$objsha2\n$bitmaptip\n" >keepobjects &&
+	printf "%s\n" "$blob2" "$bitmaptip" >keepobjects &&
 	pack2=$(git pack-objects pack2 <keepobjects) &&
 	mv pack2-$pack2.* .git/objects/pack/ &&
-	touch .git/objects/pack/pack2-$pack2.keep &&
-	rm $(objpath $objsha2) &&
+	>.git/objects/pack/pack2-$pack2.keep &&
+	rm $(objpath $blob2) &&
 	echo HEAD | git pack-objects --honor-pack-keep --stdout --revs >2a.pack &&
 	git index-pack 2a.pack &&
-	packobjects 2a.idx >2a.objects &&
-	if hasany keepobjects 2a.objects; then
-		echo "Object from .keeped pack present in pack generated with --honor-pack-keep"
-		return 1
-	fi
+	pack_list_objects 2a.idx >2a.objects &&
+	! has_any keepobjects 2a.objects
 '
 
 test_expect_success 'pack-objects respects --local (non-local pack)' '
-	mv .git/objects/pack/pack2-$pack2.* alt_objects/pack/ &&
+	mv .git/objects/pack/pack2-$pack2.* alt.git/objects/pack/ &&
 	echo HEAD | git pack-objects --local --stdout --revs >2b.pack &&
 	git index-pack 2b.pack &&
-	packobjects 2b.idx >2b.objects &&
-	if hasany keepobjects 2b.objects; then
-		echo "Non-local object present in pack generated with --local"
-		return 1
-	fi
+	pack_list_objects 2b.idx >2b.objects &&
+	! has_any keepobjects 2b.objects
 '
 
 test_expect_success 'pack-objects respects --honor-pack-keep (local bitmapped pack)' '
 	ls .git/objects/pack/ | grep bitmap >output &&
 	test_line_count = 1 output &&
 	packbitmap=$(basename $(cat output) .bitmap) &&
-	packobjects .git/objects/pack/$packbitmap.idx >packbitmap.objects &&
-	touch .git/objects/pack/$packbitmap.keep &&
+	pack_list_objects .git/objects/pack/$packbitmap.idx >packbitmap.objects &&
+	test_when_finished "rm -f .git/objects/pack/$packbitmap.keep" &&
+	>.git/objects/pack/$packbitmap.keep &&
 	echo HEAD | git pack-objects --honor-pack-keep --stdout --revs >3a.pack &&
 	git index-pack 3a.pack &&
-	packobjects 3a.idx >3a.objects &&
-	if hasany packbitmap.objects 3a.objects; then
-		echo "Object from .keeped bitmapped pack present in pack generated with --honour-pack-keep"
-		return 1
-	fi &&
-	rm .git/objects/pack/$packbitmap.keep
+	pack_list_objects 3a.idx >3a.objects &&
+	! has_any packbitmap.objects 3a.objects
 '
 
 test_expect_success 'pack-objects respects --local (non-local bitmapped pack)' '
-	mv .git/objects/pack/$packbitmap.* alt_objects/pack/ &&
+	mv .git/objects/pack/$packbitmap.* alt.git/objects/pack/ &&
+	test_when_finished "mv alt.git/objects/pack/$packbitmap.* .git/objects/pack/" &&
 	echo HEAD | git pack-objects --local --stdout --revs >3b.pack &&
 	git index-pack 3b.pack &&
-	packobjects 3b.idx >3b.objects &&
-	if hasany packbitmap.objects 3b.objects; then
-		echo "Non-local object from bitmapped pack present in pack generated with --local"
-		return 1
-	fi &&
-	mv alt_objects/pack/$packbitmap.* .git/objects/pack/
+	pack_list_objects 3b.idx >3b.objects &&
+	! has_any packbitmap.objects 3b.objects
 '
 
 test_expect_success 'pack-objects to file can use bitmap' '
@@ -256,14 +240,11 @@ test_expect_success 'pack-objects respects --incremental' '
 	EOF
 	git pack-objects --incremental --stdout --revs <revs2 >4.pack &&
 	git index-pack 4.pack &&
-	packobjects 4.idx >4.objects &&
+	pack_list_objects 4.idx >4.objects &&
 	test_line_count = 4 4.objects &&
 	git rev-list --objects $commit >revlist &&
 	cut -d" " -f1 revlist |sort >objects &&
-	if !hasany objects 4.objects; then
-		echo "Expected objects not present in incremental pack"
-		return 1
-	fi
+	test_cmp 4.objects objects
 '
 
 test_expect_success 'pack with missing blob' '

^ permalink raw reply related	[flat|nested] 62+ messages in thread

* Re: [PATCH v5] pack-objects: teach it to use reachability bitmap index when generating non-stdout pack too
  2016-08-08 20:53                             ` Junio C Hamano
@ 2016-08-09 11:21                               ` Kirill Smelkov
  2016-08-09 11:26                                 ` [PATCH 2/2 v6] pack-objects: use reachability bitmap index when generating non-stdout pack Kirill Smelkov
  0 siblings, 1 reply; 62+ messages in thread
From: Kirill Smelkov @ 2016-08-09 11:21 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Jeff King, Vicent Marti, Jérome Perrin, Isabelle Vallet,
	Kazuhiko Shiozaki, Julien Muchembled, git

On Mon, Aug 08, 2016 at 01:53:20PM -0700, Junio C Hamano wrote:
> Kirill Smelkov <kirr@nexedi.com> writes:
> 
> > diff --git a/Documentation/config.txt b/Documentation/config.txt
> > index bc1c433..4ba0c4a 100644
> > --- a/Documentation/config.txt
> > +++ b/Documentation/config.txt
> > @@ -2244,6 +2244,9 @@ pack.useBitmaps::
> >  	to stdout (e.g., during the server side of a fetch). Defaults to
> >  	true. You should not generally need to turn this off unless
> >  	you are debugging pack bitmaps.
> > ++
> > +*NOTE*: when packing to file (e.g., on repack) the default is always not to use
> > +	pack bitmaps.
> 
> This is a bit hard to read and understand.
> 
> The patched result starts with "When true, git will use bitmap when
> packing to stdout", i.e. when packing to file, git will not.  So
> this *NOTE* is repeating the same thing.  The reader is made to
> wonder "Why does it need to repeat the same thing?  Does this mean
> when the variable is set, a pack sent to a disk uses the bitmap?"
> 
> I think what you actually do in the code is to make the variable
> affect _only_ the standard-output case, and users need a command
> line option if they want to use bitmap when writing to a file (the
> code to do so looks correctly done).

Yes it is this way how it is programmed. But I've added the note because
it is very implicit to me that "When true, git will use bitmap when
packing to stdout" means 1) the default for packing-to-file is different
and 2) there is no way to set the default for packing-to-file. That's
why I added the explicit info.

And especially since the config name "pack.useBitmaps" does not contain
"stdout" at all it can be very confusing to people looking at this the
first time (at least it was so this way for me). Also please recall you
wondering why 6b8fda2d added bitmap support only for to-stdout case not
even mentioning about why it is done only for that case and not for
to-file case).

I do not insist on the note however - I only thought it is better to
have it - so if you prefer we go without it - let us drop this note.

Will send v6 as reply to this mail with below interdiff.

Thanks,
Kirill

---- 8< ---- (interdiff)

--- b/Documentation/config.txt
+++ a/Documentation/config.txt
@@ -2246,9 +2246,6 @@
        to stdout (e.g., during the server side of a fetch). Defaults to
        true. You should not generally need to turn this off unless
        you are debugging pack bitmaps.
-+
-*NOTE*: when packing to file (e.g., on repack) the default is always not to use
-       pack bitmaps.
 
 pack.writeBitmaps (deprecated)::
        This is a deprecated synonym for `repack.writeBitmaps`.
diff -u b/t/t5310-pack-bitmaps.sh b/t/t5310-pack-bitmaps.sh
--- b/t/t5310-pack-bitmaps.sh
+++ b/t/t5310-pack-bitmaps.sh
@@ -219,8 +219,8 @@
        # verify equivalent packs are generated with/without using bitmap index
        packasha1=$(git pack-objects --no-use-bitmap-index --all packa </dev/null) &&
        packbsha1=$(git pack-objects --use-bitmap-index --all packb </dev/null) &&
-       git show-index <packa-$packasha1.idx | cut -d" " -f2 >packa.objects &&
-       git show-index <packb-$packbsha1.idx | cut -d" " -f2 >packb.objects &&
+       pack_list_objects <packa-$packasha1.idx >packa.objects &&
+       pack_list_objects <packb-$packbsha1.idx >packb.objects &&
        test_cmp packa.objects packb.objects
 '

^ permalink raw reply	[flat|nested] 62+ messages in thread

* [PATCH 1/2 v4] pack-objects: respect --local/--honor-pack-keep/--incremental when bitmap is in use
  2016-08-09 11:21                                 ` Kirill Smelkov
@ 2016-08-09 11:25                                   ` Kirill Smelkov
  2016-08-09 16:52                                   ` [PATCH 1/2] pack-objects: Teach --use-bitmap-index codepath to respect --local, --honor-pack-keep and --incremental Junio C Hamano
  1 sibling, 0 replies; 62+ messages in thread
From: Kirill Smelkov @ 2016-08-09 11:25 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Jeff King, Vicent Marti, Jérome Perrin, Isabelle Vallet,
	Kazuhiko Shiozaki, Julien Muchembled, git, Kirill Smelkov

Since 6b8fda2d (pack-objects: use bitmaps when packing objects) there
are two codepaths in pack-objects: with & without using bitmap
reachability index.

However add_object_entry_from_bitmap(), despite its non-bitmapped
counterpart add_object_entry(), in no way does check for whether --local
or --honor-pack-keep or --incremental should be respected. In
non-bitmapped codepath this is handled in want_object_in_pack(), but
bitmapped codepath has simply no such checking at all.

The bitmapped codepath however was allowing to pass in all those options
and with bitmap indices still being used under such conditions -
potentially giving wrong output (e.g. including objects from non-local or
.keep'ed pack).

We can easily fix this by noting the following: when an object comes to
add_object_entry_from_bitmap() it can come for two reasons:

    1. entries coming from main pack covered by bitmap index, and
    2. object coming from, possibly alternate, loose or other packs.

"2" can be already handled by want_object_in_pack() and to cover
"1" we can teach want_object_in_pack() to expect that *found_pack can be
non-NULL, meaning calling client already found object's pack entry.

In want_object_in_pack() we care to start the checks from already found
pack, if we have one, this way determining the answer right away
in case neither --local nor --honour-pack-keep are active. In
particular, as p5310-pack-bitmaps.sh shows, we do not do harm to
served-with-bitmap clones performance-wise:

    Test                      56dfeb62          this tree
    -----------------------------------------------------------------
    5310.2: repack to disk    9.63(8.67+0.33)   9.47(8.55+0.28) -1.7%
    5310.3: simulated clone   2.07(2.17+0.12)   2.03(2.14+0.12) -1.9%
    5310.4: simulated fetch   0.78(1.03+0.02)   0.76(1.00+0.03) -2.6%
    5310.6: partial bitmap    1.97(2.43+0.15)   1.92(2.36+0.14) -2.5%

with all differences strangely showing we are a bit faster now, but
probably all being within noise.

And in the general case we care not to have duplicate
find_pack_entry_one(*found_pack) calls. Worst what can happen is we can
call want_found_object(*found_pack) -- newly introduced helper for
checking whether we want object -- twice, but since want_found_object()
is very lightweight it does not make any difference.

I appreciate help and discussing this change with Junio C Hamano and
Jeff King.

Signed-off-by: Kirill Smelkov <kirr@nexedi.com>
---
 builtin/pack-objects.c  | 93 +++++++++++++++++++++++++++++++------------------
 t/t5310-pack-bitmaps.sh | 92 ++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 152 insertions(+), 33 deletions(-)

diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index c4c2a3c..b1007f2 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -944,13 +944,44 @@ static int have_duplicate_entry(const unsigned char *sha1,
 	return 1;
 }
 
+static int want_found_object(int exclude, struct packed_git *p)
+{
+	if (exclude)
+		return 1;
+	if (incremental)
+		return 0;
+
+	/*
+	 * When asked to do --local (do not include an object that appears in a
+	 * pack we borrow from elsewhere) or --honor-pack-keep (do not include
+	 * an object that appears in a pack marked with .keep), finding a pack
+	 * that matches the criteria is sufficient for us to decide to omit it.
+	 * However, even if this pack does not satisfy the criteria, we need to
+	 * make sure no copy of this object appears in _any_ pack that makes us
+	 * to omit the object, so we need to check all the packs. Signal that by
+	 * returning -1 to the caller.
+	 */
+	if (!ignore_packed_keep &&
+	    (!local || !have_non_local_packs))
+		return 1;
+
+	if (local && !p->pack_local)
+		return 0;
+	if (ignore_packed_keep && p->pack_local && p->pack_keep)
+		return 0;
+
+	/* we don't know yet; keep looking for more packs */
+	return -1;
+}
+
 /*
  * Check whether we want the object in the pack (e.g., we do not want
  * objects found in non-local stores if the "--local" option was used).
  *
- * As a side effect of this check, we will find the packed version of this
- * object, if any. We therefore pass out the pack information to avoid having
- * to look it up again later.
+ * If the caller already knows an existing pack it wants to take the object
+ * from, that is passed in *found_pack and *found_offset; otherwise this
+ * function finds if there is any pack that has the object and returns the pack
+ * and its offset in these variables.
  */
 static int want_object_in_pack(const unsigned char *sha1,
 			       int exclude,
@@ -958,15 +989,30 @@ static int want_object_in_pack(const unsigned char *sha1,
 			       off_t *found_offset)
 {
 	struct packed_git *p;
+	int want;
 
 	if (!exclude && local && has_loose_object_nonlocal(sha1))
 		return 0;
 
-	*found_pack = NULL;
-	*found_offset = 0;
+	/*
+	 * If we already know the pack object lives in, start checks from that
+	 * pack - in the usual case when neither --local was given nor .keep files
+	 * are present we will determine the answer right now.
+	 */
+	if (*found_pack) {
+		want = want_found_object(exclude, *found_pack);
+		if (want != -1)
+			return want;
+	}
 
 	for (p = packed_git; p; p = p->next) {
-		off_t offset = find_pack_entry_one(sha1, p);
+		off_t offset;
+
+		if (p == *found_pack)
+			offset = *found_offset;
+		else
+			offset = find_pack_entry_one(sha1, p);
+
 		if (offset) {
 			if (!*found_pack) {
 				if (!is_pack_valid(p))
@@ -974,31 +1020,9 @@ static int want_object_in_pack(const unsigned char *sha1,
 				*found_offset = offset;
 				*found_pack = p;
 			}
-			if (exclude)
-				return 1;
-			if (incremental)
-				return 0;
-
-			/*
-			 * When asked to do --local (do not include an
-			 * object that appears in a pack we borrow
-			 * from elsewhere) or --honor-pack-keep (do not
-			 * include an object that appears in a pack marked
-			 * with .keep), we need to make sure no copy of this
-			 * object come from in _any_ pack that causes us to
-			 * omit it, and need to complete this loop.  When
-			 * neither option is in effect, we know the object
-			 * we just found is going to be packed, so break
-			 * out of the loop to return 1 now.
-			 */
-			if (!ignore_packed_keep &&
-			    (!local || !have_non_local_packs))
-				break;
-
-			if (local && !p->pack_local)
-				return 0;
-			if (ignore_packed_keep && p->pack_local && p->pack_keep)
-				return 0;
+			want = want_found_object(exclude, p);
+			if (want != -1)
+				return want;
 		}
 	}
 
@@ -1039,8 +1063,8 @@ static const char no_closure_warning[] = N_(
 static int add_object_entry(const unsigned char *sha1, enum object_type type,
 			    const char *name, int exclude)
 {
-	struct packed_git *found_pack;
-	off_t found_offset;
+	struct packed_git *found_pack = NULL;
+	off_t found_offset = 0;
 	uint32_t index_pos;
 
 	if (have_duplicate_entry(sha1, exclude, &index_pos))
@@ -1073,6 +1097,9 @@ static int add_object_entry_from_bitmap(const unsigned char *sha1,
 	if (have_duplicate_entry(sha1, 0, &index_pos))
 		return 0;
 
+	if (!want_object_in_pack(sha1, 0, &pack, &offset))
+		return 0;
+
 	create_object_entry(sha1, type, name_hash, 0, 0, index_pos, pack, offset);
 
 	display_progress(progress_state, nr_result);
diff --git a/t/t5310-pack-bitmaps.sh b/t/t5310-pack-bitmaps.sh
index 3893afd..a50d867 100755
--- a/t/t5310-pack-bitmaps.sh
+++ b/t/t5310-pack-bitmaps.sh
@@ -7,6 +7,18 @@ objpath () {
 	echo ".git/objects/$(echo "$1" | sed -e 's|\(..\)|\1/|')"
 }
 
+# show objects present in pack ($1 should be associated *.idx)
+pack_list_objects () {
+	git show-index <"$1" | cut -d' ' -f2
+}
+
+# has_any pattern-file content-file
+# tests whether content-file has any entry from pattern-file with entries being
+# whole lines.
+has_any () {
+	grep -qFf "$1" "$2"
+}
+
 test_expect_success 'setup repo with moderate-sized history' '
 	for i in $(test_seq 1 10); do
 		test_commit $i
@@ -16,6 +28,7 @@ test_expect_success 'setup repo with moderate-sized history' '
 		test_commit side-$i
 	done &&
 	git checkout master &&
+	bitmaptip=$(git rev-parse master) &&
 	blob=$(echo tagged-blob | git hash-object -w --stdin) &&
 	git tag tagged-blob $blob &&
 	git config repack.writebitmaps true &&
@@ -118,6 +131,71 @@ test_expect_success 'incremental repack can disable bitmaps' '
 	git repack -d --no-write-bitmap-index
 '
 
+test_expect_success 'pack-objects respects --local (non-local loose)' '
+	git init --bare alt.git &&
+	echo $(pwd)/alt.git/objects >.git/objects/info/alternates &&
+	echo content1 >file1 &&
+	# non-local loose object which is not present in bitmapped pack
+	altblob=$(GIT_DIR=alt.git git hash-object -w file1) &&
+	# non-local loose object which is also present in bitmapped pack
+	git cat-file blob $blob | GIT_DIR=alt.git git hash-object -w --stdin &&
+	git add file1 &&
+	test_tick &&
+	git commit -m commit_file1 &&
+	echo HEAD | git pack-objects --local --stdout --revs >1.pack &&
+	git index-pack 1.pack &&
+	pack_list_objects 1.idx >1.objects &&
+	printf "%s\n" "$altblob" "$blob" >nonlocal-loose &&
+	! has_any nonlocal-loose 1.objects
+'
+
+test_expect_success 'pack-objects respects --honor-pack-keep (local non-bitmapped pack)' '
+	echo content2 >file2 &&
+	blob2=$(git hash-object -w file2) &&
+	git add file2 &&
+	test_tick &&
+	git commit -m commit_file2 &&
+	printf "%s\n" "$blob2" "$bitmaptip" >keepobjects &&
+	pack2=$(git pack-objects pack2 <keepobjects) &&
+	mv pack2-$pack2.* .git/objects/pack/ &&
+	>.git/objects/pack/pack2-$pack2.keep &&
+	rm $(objpath $blob2) &&
+	echo HEAD | git pack-objects --honor-pack-keep --stdout --revs >2a.pack &&
+	git index-pack 2a.pack &&
+	pack_list_objects 2a.idx >2a.objects &&
+	! has_any keepobjects 2a.objects
+'
+
+test_expect_success 'pack-objects respects --local (non-local pack)' '
+	mv .git/objects/pack/pack2-$pack2.* alt.git/objects/pack/ &&
+	echo HEAD | git pack-objects --local --stdout --revs >2b.pack &&
+	git index-pack 2b.pack &&
+	pack_list_objects 2b.idx >2b.objects &&
+	! has_any keepobjects 2b.objects
+'
+
+test_expect_success 'pack-objects respects --honor-pack-keep (local bitmapped pack)' '
+	ls .git/objects/pack/ | grep bitmap >output &&
+	test_line_count = 1 output &&
+	packbitmap=$(basename $(cat output) .bitmap) &&
+	pack_list_objects .git/objects/pack/$packbitmap.idx >packbitmap.objects &&
+	test_when_finished "rm -f .git/objects/pack/$packbitmap.keep" &&
+	>.git/objects/pack/$packbitmap.keep &&
+	echo HEAD | git pack-objects --honor-pack-keep --stdout --revs >3a.pack &&
+	git index-pack 3a.pack &&
+	pack_list_objects 3a.idx >3a.objects &&
+	! has_any packbitmap.objects 3a.objects
+'
+
+test_expect_success 'pack-objects respects --local (non-local bitmapped pack)' '
+	mv .git/objects/pack/$packbitmap.* alt.git/objects/pack/ &&
+	test_when_finished "mv alt.git/objects/pack/$packbitmap.* .git/objects/pack/" &&
+	echo HEAD | git pack-objects --local --stdout --revs >3b.pack &&
+	git index-pack 3b.pack &&
+	pack_list_objects 3b.idx >3b.objects &&
+	! has_any packbitmap.objects 3b.objects
+'
+
 test_expect_success 'full repack, reusing previous bitmaps' '
 	git repack -ad &&
 	ls .git/objects/pack/ | grep bitmap >output &&
@@ -143,6 +221,20 @@ test_expect_success 'create objects for missing-HAVE tests' '
 	EOF
 '
 
+test_expect_success 'pack-objects respects --incremental' '
+	cat >revs2 <<-EOF &&
+	HEAD
+	$commit
+	EOF
+	git pack-objects --incremental --stdout --revs <revs2 >4.pack &&
+	git index-pack 4.pack &&
+	pack_list_objects 4.idx >4.objects &&
+	test_line_count = 4 4.objects &&
+	git rev-list --objects $commit >revlist &&
+	cut -d" " -f1 revlist |sort >objects &&
+	test_cmp 4.objects objects
+'
+
 test_expect_success 'pack with missing blob' '
 	rm $(objpath $blob) &&
 	git pack-objects --stdout --revs <revs >/dev/null
-- 
2.9.2.701.gf965a18.dirty

^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH 2/2 v6] pack-objects: use reachability bitmap index when generating non-stdout pack
  2016-08-09 11:21                               ` Kirill Smelkov
@ 2016-08-09 11:26                                 ` Kirill Smelkov
  0 siblings, 0 replies; 62+ messages in thread
From: Kirill Smelkov @ 2016-08-09 11:26 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Jeff King, Vicent Marti, Jérome Perrin, Isabelle Vallet,
	Kazuhiko Shiozaki, Julien Muchembled, git, Kirill Smelkov

Starting from 6b8fda2d (pack-objects: use bitmaps when packing objects)
if a repository has bitmap index, pack-objects can nicely speedup
"Counting objects" graph traversal phase. That however was done only for
case when resultant pack is sent to stdout, not written into a file.

The reason here is for on-disk repack by default we want:

- to produce good pack (with bitmap index not-yet-packed objects are
  emitted to pack in suboptimal order).

- to use more robust pack-generation codepath (avoiding possible
  bugs in bitmap code and possible bitmap index corruption).

Jeff King further explains:

    The reason for this split is that pack-objects tries to determine how
    "careful" it should be based on whether we are packing to disk or to
    stdout. Packing to disk implies "git repack", and that we will likely
    delete the old packs after finishing. We want to be more careful (so
    as not to carry forward a corruption, and to generate a more optimal
    pack), and we presumably run less frequently and can afford extra CPU.
    Whereas packing to stdout implies serving a remote via "git fetch" or
    "git push". This happens more frequently (e.g., a server handling many
    fetching clients), and we assume the receiving end takes more
    responsibility for verifying the data.

    But this isn't always the case. One might want to generate on-disk
    packfiles for a specialized object transfer. Just using "--stdout" and
    writing to a file is not optimal, as it will not generate the matching
    pack index.

    So it would be useful to have some way of overriding this heuristic:
    to tell pack-objects that even though it should generate on-disk
    files, it is still OK to use the reachability bitmaps to do the
    traversal.

So we can teach pack-objects to use bitmap index for initial object
counting phase when generating resultant pack file too:

- if we care it is not activated under git-repack:

  See above about repack robustness and not forward-carrying corruption.

- if we know bitmap index generation is not enabled for resultant pack:

  Current code has singleton bitmap_git so cannot work simultaneously
  with two bitmap indices.

  We also want to avoid (at least with current implementation)
  generating bitmaps off of bitmaps. The reason here is: when generating
  a pack, not-yet-packed objects will be emitted into pack in
  suboptimal order and added to tail of the bitmap as "extended entries".
  When the resultant pack + some new objects in associated repository
  are in turn used to generate another pack with bitmap, the situation
  repeats: new objects are again not emitted optimally and just added to
  bitmap tail - not in recency order.

  So the pack badness can grow over time when at each step we have
  bitmapped pack + some other objects. That's why we want to avoid
  generating bitmaps off of bitmaps, not to let pack badness grow.

- if we keep pack reuse enabled still only for "send-to-stdout" case:

  Because on pack reuse raw entries are directly written out to destination
  pack by write_reused_pack() bypassing needed for pack index generation
  bookkeeping done by regular codepath in write_one() and friends.

This way for pack-objects -> file we get nice speedup:

    erp5.git[1] (~230MB) extracted from ~ 5GB lab.nexedi.com backup
    repository managed by git-backup[2] via

    time echo 0186ac99 | git pack-objects --revs erp5pack

before:  37.2s
after:   26.2s

And for `git repack -adb` packed git.git

    time echo 5c589a73 | git pack-objects --revs gitpack

before:   7.1s
after:    3.6s

i.e. it can be 30% - 50% speedup for pack extraction.

git-backup extracts many packs on repositories restoration. That was my
initial motivation for the patch.

[1] https://lab.nexedi.com/nexedi/erp5
[2] https://lab.nexedi.com/kirr/git-backup

NOTE

Jeff also suggests that pack.useBitmaps was probably a mistake to
introduce originally. This way we are not adding another config point,
but instead just always default to-file pack-objects not to use bitmap
index: Tools which need to generate on-disk packs with using bitmap, can
pass --use-bitmap-index explicitly. And git-repack does never pass
--use-bitmap-index, so this way we can be sure regular on-disk repacking
remains robust.

NOTE2

`git pack-objects --stdout >file.pack` + `git index-pack file.pack` is much slower
than `git pack-objects file.pack`. Extracting erp5.git pack from
lab.nexedi.com backup repository:

    $ time echo 0186ac99 | git pack-objects --stdout --revs >erp5pack-stdout.pack

    real    0m22.309s
    user    0m21.148s
    sys     0m0.932s

    $ time git index-pack erp5pack-stdout.pack

    real    0m50.873s   <-- more than 2 times slower than time to generate pack itself!
    user    0m49.300s
    sys     0m1.360s

So the time for

    `pack-object --stdout >file.pack` + `index-pack file.pack`  is  72s,

while

    `pack-objects file.pack` which does both pack and index     is  27s.

And even

    `pack-objects --no-use-bitmap-index file.pack`              is  37s.

Jeff explains:

    The packfile does not carry the sha1 of the objects. A receiving
    index-pack has to compute them itself, including inflating and applying
    all of the deltas.

that's why for `git-backup restore` we want to teach `git pack-objects
file.pack` to use bitmaps instead of using `git pack-objects --stdout
>file.pack` + `git index-pack file.pack`.

More context:

    http://marc.info/?t=146792101400001&r=1&w=2

Cc: Vicent Marti <tanoku@gmail.com>
Helped-by: Jeff King <peff@peff.net>
Signed-off-by: Kirill Smelkov <kirr@nexedi.com>
---
 builtin/pack-objects.c  | 31 ++++++++++++++++++++++++-------
 t/t5310-pack-bitmaps.sh | 12 ++++++++++++
 2 files changed, 36 insertions(+), 7 deletions(-)

diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index b1007f2..c92d7fc 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -67,7 +67,8 @@ static struct packed_git *reuse_packfile;
 static uint32_t reuse_packfile_objects;
 static off_t reuse_packfile_offset;
 
-static int use_bitmap_index = 1;
+static int use_bitmap_index_default = 1;
+static int use_bitmap_index = -1;
 static int write_bitmap_index;
 static uint16_t write_bitmap_options;
 
@@ -2270,7 +2271,7 @@ static int git_pack_config(const char *k, const char *v, void *cb)
 			write_bitmap_options &= ~BITMAP_OPT_HASH_CACHE;
 	}
 	if (!strcmp(k, "pack.usebitmaps")) {
-		use_bitmap_index = git_config_bool(k, v);
+		use_bitmap_index_default = git_config_bool(k, v);
 		return 0;
 	}
 	if (!strcmp(k, "pack.threads")) {
@@ -2519,13 +2520,13 @@ static void loosen_unused_packed_objects(struct rev_info *revs)
 }
 
 /*
- * This tracks any options which a reader of the pack might
- * not understand, and which would therefore prevent blind reuse
- * of what we have on disk.
+ * This tracks any options which pack-reuse code expects to be on, or which a
+ * reader of the pack might not understand, and which would therefore prevent
+ * blind reuse of what we have on disk.
  */
 static int pack_options_allow_reuse(void)
 {
-	return allow_ofs_delta;
+	return pack_to_stdout && allow_ofs_delta;
 }
 
 static int get_object_list_from_bitmap(struct rev_info *revs)
@@ -2818,7 +2819,23 @@ int cmd_pack_objects(int argc, const char **argv, const char *prefix)
 	if (!rev_list_all || !rev_list_reflog || !rev_list_index)
 		unpack_unreachable_expiration = 0;
 
-	if (!use_internal_rev_list || !pack_to_stdout || is_repository_shallow())
+	/*
+	 * "soft" reasons not to use bitmaps - for on-disk repack by default we want
+	 *
+	 * - to produce good pack (with bitmap index not-yet-packed objects are
+	 *   packed in suboptimal order).
+	 *
+	 * - to use more robust pack-generation codepath (avoiding possible
+	 *   bugs in bitmap code and possible bitmap index corruption).
+	 */
+	if (!pack_to_stdout)
+		use_bitmap_index_default = 0;
+
+	if (use_bitmap_index < 0)
+		use_bitmap_index = use_bitmap_index_default;
+
+	/* "hard" reasons not to use bitmaps; these just won't work at all */
+	if (!use_internal_rev_list || (!pack_to_stdout && write_bitmap_index) || is_repository_shallow())
 		use_bitmap_index = 0;
 
 	if (pack_to_stdout || !rev_list_all)
diff --git a/t/t5310-pack-bitmaps.sh b/t/t5310-pack-bitmaps.sh
index a50d867..44914ac 100755
--- a/t/t5310-pack-bitmaps.sh
+++ b/t/t5310-pack-bitmaps.sh
@@ -196,6 +196,18 @@ test_expect_success 'pack-objects respects --local (non-local bitmapped pack)' '
 	! has_any packbitmap.objects 3b.objects
 '
 
+test_expect_success 'pack-objects to file can use bitmap' '
+	# make sure we still have 1 bitmap index from previous tests
+	ls .git/objects/pack/ | grep bitmap >output &&
+	test_line_count = 1 output &&
+	# verify equivalent packs are generated with/without using bitmap index
+	packasha1=$(git pack-objects --no-use-bitmap-index --all packa </dev/null) &&
+	packbsha1=$(git pack-objects --use-bitmap-index --all packb </dev/null) &&
+	pack_list_objects <packa-$packasha1.idx >packa.objects &&
+	pack_list_objects <packb-$packbsha1.idx >packb.objects &&
+	test_cmp packa.objects packb.objects
+'
+
 test_expect_success 'full repack, reusing previous bitmaps' '
 	git repack -ad &&
 	ls .git/objects/pack/ | grep bitmap >output &&
-- 
2.9.2.701.gf965a18.dirty

^ permalink raw reply related	[flat|nested] 62+ messages in thread

* Re: [PATCH 1/2] pack-objects: Teach --use-bitmap-index codepath to respect --local, --honor-pack-keep and --incremental
  2016-08-09 11:21                                 ` Kirill Smelkov
  2016-08-09 11:25                                   ` [PATCH 1/2 v4] pack-objects: respect --local/--honor-pack-keep/--incremental when bitmap is in use Kirill Smelkov
@ 2016-08-09 16:52                                   ` Junio C Hamano
  2016-08-09 19:29                                     ` Kirill Smelkov
  1 sibling, 1 reply; 62+ messages in thread
From: Junio C Hamano @ 2016-08-09 16:52 UTC (permalink / raw)
  To: Kirill Smelkov
  Cc: Jeff King, Vicent Marti, Jérome Perrin, Isabelle Vallet,
	Kazuhiko Shiozaki, Julien Muchembled, git

Kirill Smelkov <kirr@nexedi.com> writes:

> Would you please explain why we should not use touch if we do not care
> about timestamps? Simply style?

To help readers.

"touch A" forcess the readers wonder "does the timestamp of A
matter, and if so in what way?" and "does any later test care what
is _in_ A, and if so in what way?"  Both of them is wasting their
time when there is no reason why "touch" should have been used. 

> diff --git a/t/t5310-pack-bitmaps.sh b/t/t5310-pack-bitmaps.sh
> index cce95d8..44914ac 100755
> --- a/t/t5310-pack-bitmaps.sh
> +++ b/t/t5310-pack-bitmaps.sh
> @@ -8,16 +8,15 @@ objpath () {
>  }
>  
>  # show objects present in pack ($1 should be associated *.idx)
> -packobjects () {
> -	git show-index <$1 | cut -d' ' -f2
> +pack_list_objects () {
> +	git show-index <"$1" | cut -d' ' -f2
>  }

pack-list-objects still sounds as if you are packing "list objects",
though.  If you are listing packed objects (or objects in a pack),
list-packed-objects (or list-objects-in-pack) reads clearer and more
to the point, at least to me.

> -# hasany pattern-file content-file
> +# has_any pattern-file content-file
>  # tests whether content-file has any entry from pattern-file with entries being
>  # whole lines.
> -hasany () {
> -	# NOTE `grep -f` is not portable
> -	git grep --no-index -qFf $1 $2
> +has_any () {
> +	grep -qFf "$1" "$2"

Omitting "-q" would help those who have to debug breakage in this
test or the code that this test checks.  What test_expect_success
outputs is not shown by default, and running the test script with
"-v" would show them as a debugging aid.

Thanks.

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 1/2] pack-objects: Teach --use-bitmap-index codepath to respect --local, --honor-pack-keep and --incremental
  2016-08-09 16:52                                   ` [PATCH 1/2] pack-objects: Teach --use-bitmap-index codepath to respect --local, --honor-pack-keep and --incremental Junio C Hamano
@ 2016-08-09 19:29                                     ` Kirill Smelkov
  2016-08-09 19:31                                       ` [PATCH 1/2 v5] pack-objects: respect --local/--honor-pack-keep/--incremental when bitmap is in use Kirill Smelkov
                                                         ` (2 more replies)
  0 siblings, 3 replies; 62+ messages in thread
From: Kirill Smelkov @ 2016-08-09 19:29 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Jeff King, Vicent Marti, Jérome Perrin, Isabelle Vallet,
	Kazuhiko Shiozaki, Julien Muchembled, git

On Tue, Aug 09, 2016 at 09:52:18AM -0700, Junio C Hamano wrote:
> Kirill Smelkov <kirr@nexedi.com> writes:
> 
> > Would you please explain why we should not use touch if we do not care
> > about timestamps? Simply style?
> 
> To help readers.
> 
> "touch A" forcess the readers wonder "does the timestamp of A
> matter, and if so in what way?" and "does any later test care what
> is _in_ A, and if so in what way?"  Both of them is wasting their
> time when there is no reason why "touch" should have been used. 

I see, thanks for explaining. I used to read it a bit the other way;
maybe it is just an environment difference.


> > diff --git a/t/t5310-pack-bitmaps.sh b/t/t5310-pack-bitmaps.sh
> > index cce95d8..44914ac 100755
> > --- a/t/t5310-pack-bitmaps.sh
> > +++ b/t/t5310-pack-bitmaps.sh
> > @@ -8,16 +8,15 @@ objpath () {
> >  }
> >  
> >  # show objects present in pack ($1 should be associated *.idx)
> > -packobjects () {
> > -	git show-index <$1 | cut -d' ' -f2
> > +pack_list_objects () {
> > +	git show-index <"$1" | cut -d' ' -f2
> >  }
> 
> pack-list-objects still sounds as if you are packing "list objects",
> though.  If you are listing packed objects (or objects in a pack),
> list-packed-objects (or list-objects-in-pack) reads clearer and more
> to the point, at least to me.

Ok, let it be list_packed_objects().


> > -# hasany pattern-file content-file
> > +# has_any pattern-file content-file
> >  # tests whether content-file has any entry from pattern-file with entries being
> >  # whole lines.
> > -hasany () {
> > -	# NOTE `grep -f` is not portable
> > -	git grep --no-index -qFf $1 $2
> > +has_any () {
> > +	grep -qFf "$1" "$2"
> 
> Omitting "-q" would help those who have to debug breakage in this
> test or the code that this test checks.  What test_expect_success
> outputs is not shown by default, and running the test script with
> "-v" would show them as a debugging aid.

Ok, makes sense. Both patches adjusted and will be reposted.

Thanks,
Kirill

^ permalink raw reply	[flat|nested] 62+ messages in thread

* [PATCH 1/2 v5] pack-objects: respect --local/--honor-pack-keep/--incremental when bitmap is in use
  2016-08-09 19:29                                     ` Kirill Smelkov
@ 2016-08-09 19:31                                       ` Kirill Smelkov
  2016-08-18 17:52                                         ` Jeff King
  2016-08-09 19:32                                       ` [PATCH 2/2 v7] pack-objects: use reachability bitmap index when generating non-stdout pack Kirill Smelkov
  2016-08-09 19:49                                       ` [PATCH 1/2] pack-objects: Teach --use-bitmap-index codepath to respect --local, --honor-pack-keep and --incremental Junio C Hamano
  2 siblings, 1 reply; 62+ messages in thread
From: Kirill Smelkov @ 2016-08-09 19:31 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Jeff King, Vicent Marti, Jérome Perrin, Isabelle Vallet,
	Kazuhiko Shiozaki, Julien Muchembled, git, Kirill Smelkov

Since 6b8fda2d (pack-objects: use bitmaps when packing objects) there
are two codepaths in pack-objects: with & without using bitmap
reachability index.

However add_object_entry_from_bitmap(), despite its non-bitmapped
counterpart add_object_entry(), in no way does check for whether --local
or --honor-pack-keep or --incremental should be respected. In
non-bitmapped codepath this is handled in want_object_in_pack(), but
bitmapped codepath has simply no such checking at all.

The bitmapped codepath however was allowing to pass in all those options
and with bitmap indices still being used under such conditions -
potentially giving wrong output (e.g. including objects from non-local or
.keep'ed pack).

We can easily fix this by noting the following: when an object comes to
add_object_entry_from_bitmap() it can come for two reasons:

    1. entries coming from main pack covered by bitmap index, and
    2. object coming from, possibly alternate, loose or other packs.

"2" can be already handled by want_object_in_pack() and to cover
"1" we can teach want_object_in_pack() to expect that *found_pack can be
non-NULL, meaning calling client already found object's pack entry.

In want_object_in_pack() we care to start the checks from already found
pack, if we have one, this way determining the answer right away
in case neither --local nor --honour-pack-keep are active. In
particular, as p5310-pack-bitmaps.sh shows, we do not do harm to
served-with-bitmap clones performance-wise:

    Test                      56dfeb62          this tree
    -----------------------------------------------------------------
    5310.2: repack to disk    9.63(8.67+0.33)   9.47(8.55+0.28) -1.7%
    5310.3: simulated clone   2.07(2.17+0.12)   2.03(2.14+0.12) -1.9%
    5310.4: simulated fetch   0.78(1.03+0.02)   0.76(1.00+0.03) -2.6%
    5310.6: partial bitmap    1.97(2.43+0.15)   1.92(2.36+0.14) -2.5%

with all differences strangely showing we are a bit faster now, but
probably all being within noise.

And in the general case we care not to have duplicate
find_pack_entry_one(*found_pack) calls. Worst what can happen is we can
call want_found_object(*found_pack) -- newly introduced helper for
checking whether we want object -- twice, but since want_found_object()
is very lightweight it does not make any difference.

I appreciate help and discussing this change with Junio C Hamano and
Jeff King.

Signed-off-by: Kirill Smelkov <kirr@nexedi.com>
---
 builtin/pack-objects.c  | 93 +++++++++++++++++++++++++++++++------------------
 t/t5310-pack-bitmaps.sh | 92 ++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 152 insertions(+), 33 deletions(-)

diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index c4c2a3c..b1007f2 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -944,13 +944,44 @@ static int have_duplicate_entry(const unsigned char *sha1,
 	return 1;
 }
 
+static int want_found_object(int exclude, struct packed_git *p)
+{
+	if (exclude)
+		return 1;
+	if (incremental)
+		return 0;
+
+	/*
+	 * When asked to do --local (do not include an object that appears in a
+	 * pack we borrow from elsewhere) or --honor-pack-keep (do not include
+	 * an object that appears in a pack marked with .keep), finding a pack
+	 * that matches the criteria is sufficient for us to decide to omit it.
+	 * However, even if this pack does not satisfy the criteria, we need to
+	 * make sure no copy of this object appears in _any_ pack that makes us
+	 * to omit the object, so we need to check all the packs. Signal that by
+	 * returning -1 to the caller.
+	 */
+	if (!ignore_packed_keep &&
+	    (!local || !have_non_local_packs))
+		return 1;
+
+	if (local && !p->pack_local)
+		return 0;
+	if (ignore_packed_keep && p->pack_local && p->pack_keep)
+		return 0;
+
+	/* we don't know yet; keep looking for more packs */
+	return -1;
+}
+
 /*
  * Check whether we want the object in the pack (e.g., we do not want
  * objects found in non-local stores if the "--local" option was used).
  *
- * As a side effect of this check, we will find the packed version of this
- * object, if any. We therefore pass out the pack information to avoid having
- * to look it up again later.
+ * If the caller already knows an existing pack it wants to take the object
+ * from, that is passed in *found_pack and *found_offset; otherwise this
+ * function finds if there is any pack that has the object and returns the pack
+ * and its offset in these variables.
  */
 static int want_object_in_pack(const unsigned char *sha1,
 			       int exclude,
@@ -958,15 +989,30 @@ static int want_object_in_pack(const unsigned char *sha1,
 			       off_t *found_offset)
 {
 	struct packed_git *p;
+	int want;
 
 	if (!exclude && local && has_loose_object_nonlocal(sha1))
 		return 0;
 
-	*found_pack = NULL;
-	*found_offset = 0;
+	/*
+	 * If we already know the pack object lives in, start checks from that
+	 * pack - in the usual case when neither --local was given nor .keep files
+	 * are present we will determine the answer right now.
+	 */
+	if (*found_pack) {
+		want = want_found_object(exclude, *found_pack);
+		if (want != -1)
+			return want;
+	}
 
 	for (p = packed_git; p; p = p->next) {
-		off_t offset = find_pack_entry_one(sha1, p);
+		off_t offset;
+
+		if (p == *found_pack)
+			offset = *found_offset;
+		else
+			offset = find_pack_entry_one(sha1, p);
+
 		if (offset) {
 			if (!*found_pack) {
 				if (!is_pack_valid(p))
@@ -974,31 +1020,9 @@ static int want_object_in_pack(const unsigned char *sha1,
 				*found_offset = offset;
 				*found_pack = p;
 			}
-			if (exclude)
-				return 1;
-			if (incremental)
-				return 0;
-
-			/*
-			 * When asked to do --local (do not include an
-			 * object that appears in a pack we borrow
-			 * from elsewhere) or --honor-pack-keep (do not
-			 * include an object that appears in a pack marked
-			 * with .keep), we need to make sure no copy of this
-			 * object come from in _any_ pack that causes us to
-			 * omit it, and need to complete this loop.  When
-			 * neither option is in effect, we know the object
-			 * we just found is going to be packed, so break
-			 * out of the loop to return 1 now.
-			 */
-			if (!ignore_packed_keep &&
-			    (!local || !have_non_local_packs))
-				break;
-
-			if (local && !p->pack_local)
-				return 0;
-			if (ignore_packed_keep && p->pack_local && p->pack_keep)
-				return 0;
+			want = want_found_object(exclude, p);
+			if (want != -1)
+				return want;
 		}
 	}
 
@@ -1039,8 +1063,8 @@ static const char no_closure_warning[] = N_(
 static int add_object_entry(const unsigned char *sha1, enum object_type type,
 			    const char *name, int exclude)
 {
-	struct packed_git *found_pack;
-	off_t found_offset;
+	struct packed_git *found_pack = NULL;
+	off_t found_offset = 0;
 	uint32_t index_pos;
 
 	if (have_duplicate_entry(sha1, exclude, &index_pos))
@@ -1073,6 +1097,9 @@ static int add_object_entry_from_bitmap(const unsigned char *sha1,
 	if (have_duplicate_entry(sha1, 0, &index_pos))
 		return 0;
 
+	if (!want_object_in_pack(sha1, 0, &pack, &offset))
+		return 0;
+
 	create_object_entry(sha1, type, name_hash, 0, 0, index_pos, pack, offset);
 
 	display_progress(progress_state, nr_result);
diff --git a/t/t5310-pack-bitmaps.sh b/t/t5310-pack-bitmaps.sh
index 3893afd..a278d30 100755
--- a/t/t5310-pack-bitmaps.sh
+++ b/t/t5310-pack-bitmaps.sh
@@ -7,6 +7,18 @@ objpath () {
 	echo ".git/objects/$(echo "$1" | sed -e 's|\(..\)|\1/|')"
 }
 
+# show objects present in pack ($1 should be associated *.idx)
+list_packed_objects () {
+	git show-index <"$1" | cut -d' ' -f2
+}
+
+# has_any pattern-file content-file
+# tests whether content-file has any entry from pattern-file with entries being
+# whole lines.
+has_any () {
+	grep -Ff "$1" "$2"
+}
+
 test_expect_success 'setup repo with moderate-sized history' '
 	for i in $(test_seq 1 10); do
 		test_commit $i
@@ -16,6 +28,7 @@ test_expect_success 'setup repo with moderate-sized history' '
 		test_commit side-$i
 	done &&
 	git checkout master &&
+	bitmaptip=$(git rev-parse master) &&
 	blob=$(echo tagged-blob | git hash-object -w --stdin) &&
 	git tag tagged-blob $blob &&
 	git config repack.writebitmaps true &&
@@ -118,6 +131,71 @@ test_expect_success 'incremental repack can disable bitmaps' '
 	git repack -d --no-write-bitmap-index
 '
 
+test_expect_success 'pack-objects respects --local (non-local loose)' '
+	git init --bare alt.git &&
+	echo $(pwd)/alt.git/objects >.git/objects/info/alternates &&
+	echo content1 >file1 &&
+	# non-local loose object which is not present in bitmapped pack
+	altblob=$(GIT_DIR=alt.git git hash-object -w file1) &&
+	# non-local loose object which is also present in bitmapped pack
+	git cat-file blob $blob | GIT_DIR=alt.git git hash-object -w --stdin &&
+	git add file1 &&
+	test_tick &&
+	git commit -m commit_file1 &&
+	echo HEAD | git pack-objects --local --stdout --revs >1.pack &&
+	git index-pack 1.pack &&
+	list_packed_objects 1.idx >1.objects &&
+	printf "%s\n" "$altblob" "$blob" >nonlocal-loose &&
+	! has_any nonlocal-loose 1.objects
+'
+
+test_expect_success 'pack-objects respects --honor-pack-keep (local non-bitmapped pack)' '
+	echo content2 >file2 &&
+	blob2=$(git hash-object -w file2) &&
+	git add file2 &&
+	test_tick &&
+	git commit -m commit_file2 &&
+	printf "%s\n" "$blob2" "$bitmaptip" >keepobjects &&
+	pack2=$(git pack-objects pack2 <keepobjects) &&
+	mv pack2-$pack2.* .git/objects/pack/ &&
+	>.git/objects/pack/pack2-$pack2.keep &&
+	rm $(objpath $blob2) &&
+	echo HEAD | git pack-objects --honor-pack-keep --stdout --revs >2a.pack &&
+	git index-pack 2a.pack &&
+	list_packed_objects 2a.idx >2a.objects &&
+	! has_any keepobjects 2a.objects
+'
+
+test_expect_success 'pack-objects respects --local (non-local pack)' '
+	mv .git/objects/pack/pack2-$pack2.* alt.git/objects/pack/ &&
+	echo HEAD | git pack-objects --local --stdout --revs >2b.pack &&
+	git index-pack 2b.pack &&
+	list_packed_objects 2b.idx >2b.objects &&
+	! has_any keepobjects 2b.objects
+'
+
+test_expect_success 'pack-objects respects --honor-pack-keep (local bitmapped pack)' '
+	ls .git/objects/pack/ | grep bitmap >output &&
+	test_line_count = 1 output &&
+	packbitmap=$(basename $(cat output) .bitmap) &&
+	list_packed_objects .git/objects/pack/$packbitmap.idx >packbitmap.objects &&
+	test_when_finished "rm -f .git/objects/pack/$packbitmap.keep" &&
+	>.git/objects/pack/$packbitmap.keep &&
+	echo HEAD | git pack-objects --honor-pack-keep --stdout --revs >3a.pack &&
+	git index-pack 3a.pack &&
+	list_packed_objects 3a.idx >3a.objects &&
+	! has_any packbitmap.objects 3a.objects
+'
+
+test_expect_success 'pack-objects respects --local (non-local bitmapped pack)' '
+	mv .git/objects/pack/$packbitmap.* alt.git/objects/pack/ &&
+	test_when_finished "mv alt.git/objects/pack/$packbitmap.* .git/objects/pack/" &&
+	echo HEAD | git pack-objects --local --stdout --revs >3b.pack &&
+	git index-pack 3b.pack &&
+	list_packed_objects 3b.idx >3b.objects &&
+	! has_any packbitmap.objects 3b.objects
+'
+
 test_expect_success 'full repack, reusing previous bitmaps' '
 	git repack -ad &&
 	ls .git/objects/pack/ | grep bitmap >output &&
@@ -143,6 +221,20 @@ test_expect_success 'create objects for missing-HAVE tests' '
 	EOF
 '
 
+test_expect_success 'pack-objects respects --incremental' '
+	cat >revs2 <<-EOF &&
+	HEAD
+	$commit
+	EOF
+	git pack-objects --incremental --stdout --revs <revs2 >4.pack &&
+	git index-pack 4.pack &&
+	list_packed_objects 4.idx >4.objects &&
+	test_line_count = 4 4.objects &&
+	git rev-list --objects $commit >revlist &&
+	cut -d" " -f1 revlist |sort >objects &&
+	test_cmp 4.objects objects
+'
+
 test_expect_success 'pack with missing blob' '
 	rm $(objpath $blob) &&
 	git pack-objects --stdout --revs <revs >/dev/null
-- 
2.9.2.701.gf965a18.dirty

^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH 2/2 v7] pack-objects: use reachability bitmap index when generating non-stdout pack
  2016-08-09 19:29                                     ` Kirill Smelkov
  2016-08-09 19:31                                       ` [PATCH 1/2 v5] pack-objects: respect --local/--honor-pack-keep/--incremental when bitmap is in use Kirill Smelkov
@ 2016-08-09 19:32                                       ` Kirill Smelkov
  2016-08-18 18:06                                         ` Jeff King
  2016-08-09 19:49                                       ` [PATCH 1/2] pack-objects: Teach --use-bitmap-index codepath to respect --local, --honor-pack-keep and --incremental Junio C Hamano
  2 siblings, 1 reply; 62+ messages in thread
From: Kirill Smelkov @ 2016-08-09 19:32 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Jeff King, Vicent Marti, Jérome Perrin, Isabelle Vallet,
	Kazuhiko Shiozaki, Julien Muchembled, git, Kirill Smelkov

Starting from 6b8fda2d (pack-objects: use bitmaps when packing objects)
if a repository has bitmap index, pack-objects can nicely speedup
"Counting objects" graph traversal phase. That however was done only for
case when resultant pack is sent to stdout, not written into a file.

The reason here is for on-disk repack by default we want:

- to produce good pack (with bitmap index not-yet-packed objects are
  emitted to pack in suboptimal order).

- to use more robust pack-generation codepath (avoiding possible
  bugs in bitmap code and possible bitmap index corruption).

Jeff King further explains:

    The reason for this split is that pack-objects tries to determine how
    "careful" it should be based on whether we are packing to disk or to
    stdout. Packing to disk implies "git repack", and that we will likely
    delete the old packs after finishing. We want to be more careful (so
    as not to carry forward a corruption, and to generate a more optimal
    pack), and we presumably run less frequently and can afford extra CPU.
    Whereas packing to stdout implies serving a remote via "git fetch" or
    "git push". This happens more frequently (e.g., a server handling many
    fetching clients), and we assume the receiving end takes more
    responsibility for verifying the data.

    But this isn't always the case. One might want to generate on-disk
    packfiles for a specialized object transfer. Just using "--stdout" and
    writing to a file is not optimal, as it will not generate the matching
    pack index.

    So it would be useful to have some way of overriding this heuristic:
    to tell pack-objects that even though it should generate on-disk
    files, it is still OK to use the reachability bitmaps to do the
    traversal.

So we can teach pack-objects to use bitmap index for initial object
counting phase when generating resultant pack file too:

- if we care it is not activated under git-repack:

  See above about repack robustness and not forward-carrying corruption.

- if we know bitmap index generation is not enabled for resultant pack:

  Current code has singleton bitmap_git so cannot work simultaneously
  with two bitmap indices.

  We also want to avoid (at least with current implementation)
  generating bitmaps off of bitmaps. The reason here is: when generating
  a pack, not-yet-packed objects will be emitted into pack in
  suboptimal order and added to tail of the bitmap as "extended entries".
  When the resultant pack + some new objects in associated repository
  are in turn used to generate another pack with bitmap, the situation
  repeats: new objects are again not emitted optimally and just added to
  bitmap tail - not in recency order.

  So the pack badness can grow over time when at each step we have
  bitmapped pack + some other objects. That's why we want to avoid
  generating bitmaps off of bitmaps, not to let pack badness grow.

- if we keep pack reuse enabled still only for "send-to-stdout" case:

  Because on pack reuse raw entries are directly written out to destination
  pack by write_reused_pack() bypassing needed for pack index generation
  bookkeeping done by regular codepath in write_one() and friends.

This way for pack-objects -> file we get nice speedup:

    erp5.git[1] (~230MB) extracted from ~ 5GB lab.nexedi.com backup
    repository managed by git-backup[2] via

    time echo 0186ac99 | git pack-objects --revs erp5pack

before:  37.2s
after:   26.2s

And for `git repack -adb` packed git.git

    time echo 5c589a73 | git pack-objects --revs gitpack

before:   7.1s
after:    3.6s

i.e. it can be 30% - 50% speedup for pack extraction.

git-backup extracts many packs on repositories restoration. That was my
initial motivation for the patch.

[1] https://lab.nexedi.com/nexedi/erp5
[2] https://lab.nexedi.com/kirr/git-backup

NOTE

Jeff also suggests that pack.useBitmaps was probably a mistake to
introduce originally. This way we are not adding another config point,
but instead just always default to-file pack-objects not to use bitmap
index: Tools which need to generate on-disk packs with using bitmap, can
pass --use-bitmap-index explicitly. And git-repack does never pass
--use-bitmap-index, so this way we can be sure regular on-disk repacking
remains robust.

NOTE2

`git pack-objects --stdout >file.pack` + `git index-pack file.pack` is much slower
than `git pack-objects file.pack`. Extracting erp5.git pack from
lab.nexedi.com backup repository:

    $ time echo 0186ac99 | git pack-objects --stdout --revs >erp5pack-stdout.pack

    real    0m22.309s
    user    0m21.148s
    sys     0m0.932s

    $ time git index-pack erp5pack-stdout.pack

    real    0m50.873s   <-- more than 2 times slower than time to generate pack itself!
    user    0m49.300s
    sys     0m1.360s

So the time for

    `pack-object --stdout >file.pack` + `index-pack file.pack`  is  72s,

while

    `pack-objects file.pack` which does both pack and index     is  27s.

And even

    `pack-objects --no-use-bitmap-index file.pack`              is  37s.

Jeff explains:

    The packfile does not carry the sha1 of the objects. A receiving
    index-pack has to compute them itself, including inflating and applying
    all of the deltas.

that's why for `git-backup restore` we want to teach `git pack-objects
file.pack` to use bitmaps instead of using `git pack-objects --stdout
>file.pack` + `git index-pack file.pack`.

More context:

    http://marc.info/?t=146792101400001&r=1&w=2

Cc: Vicent Marti <tanoku@gmail.com>
Helped-by: Jeff King <peff@peff.net>
Signed-off-by: Kirill Smelkov <kirr@nexedi.com>
---
 builtin/pack-objects.c  | 31 ++++++++++++++++++++++++-------
 t/t5310-pack-bitmaps.sh | 12 ++++++++++++
 2 files changed, 36 insertions(+), 7 deletions(-)

diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index b1007f2..c92d7fc 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -67,7 +67,8 @@ static struct packed_git *reuse_packfile;
 static uint32_t reuse_packfile_objects;
 static off_t reuse_packfile_offset;
 
-static int use_bitmap_index = 1;
+static int use_bitmap_index_default = 1;
+static int use_bitmap_index = -1;
 static int write_bitmap_index;
 static uint16_t write_bitmap_options;
 
@@ -2270,7 +2271,7 @@ static int git_pack_config(const char *k, const char *v, void *cb)
 			write_bitmap_options &= ~BITMAP_OPT_HASH_CACHE;
 	}
 	if (!strcmp(k, "pack.usebitmaps")) {
-		use_bitmap_index = git_config_bool(k, v);
+		use_bitmap_index_default = git_config_bool(k, v);
 		return 0;
 	}
 	if (!strcmp(k, "pack.threads")) {
@@ -2519,13 +2520,13 @@ static void loosen_unused_packed_objects(struct rev_info *revs)
 }
 
 /*
- * This tracks any options which a reader of the pack might
- * not understand, and which would therefore prevent blind reuse
- * of what we have on disk.
+ * This tracks any options which pack-reuse code expects to be on, or which a
+ * reader of the pack might not understand, and which would therefore prevent
+ * blind reuse of what we have on disk.
  */
 static int pack_options_allow_reuse(void)
 {
-	return allow_ofs_delta;
+	return pack_to_stdout && allow_ofs_delta;
 }
 
 static int get_object_list_from_bitmap(struct rev_info *revs)
@@ -2818,7 +2819,23 @@ int cmd_pack_objects(int argc, const char **argv, const char *prefix)
 	if (!rev_list_all || !rev_list_reflog || !rev_list_index)
 		unpack_unreachable_expiration = 0;
 
-	if (!use_internal_rev_list || !pack_to_stdout || is_repository_shallow())
+	/*
+	 * "soft" reasons not to use bitmaps - for on-disk repack by default we want
+	 *
+	 * - to produce good pack (with bitmap index not-yet-packed objects are
+	 *   packed in suboptimal order).
+	 *
+	 * - to use more robust pack-generation codepath (avoiding possible
+	 *   bugs in bitmap code and possible bitmap index corruption).
+	 */
+	if (!pack_to_stdout)
+		use_bitmap_index_default = 0;
+
+	if (use_bitmap_index < 0)
+		use_bitmap_index = use_bitmap_index_default;
+
+	/* "hard" reasons not to use bitmaps; these just won't work at all */
+	if (!use_internal_rev_list || (!pack_to_stdout && write_bitmap_index) || is_repository_shallow())
 		use_bitmap_index = 0;
 
 	if (pack_to_stdout || !rev_list_all)
diff --git a/t/t5310-pack-bitmaps.sh b/t/t5310-pack-bitmaps.sh
index a278d30..9602e9a 100755
--- a/t/t5310-pack-bitmaps.sh
+++ b/t/t5310-pack-bitmaps.sh
@@ -196,6 +196,18 @@ test_expect_success 'pack-objects respects --local (non-local bitmapped pack)' '
 	! has_any packbitmap.objects 3b.objects
 '
 
+test_expect_success 'pack-objects to file can use bitmap' '
+	# make sure we still have 1 bitmap index from previous tests
+	ls .git/objects/pack/ | grep bitmap >output &&
+	test_line_count = 1 output &&
+	# verify equivalent packs are generated with/without using bitmap index
+	packasha1=$(git pack-objects --no-use-bitmap-index --all packa </dev/null) &&
+	packbsha1=$(git pack-objects --use-bitmap-index --all packb </dev/null) &&
+	list_packed_objects <packa-$packasha1.idx >packa.objects &&
+	list_packed_objects <packb-$packbsha1.idx >packb.objects &&
+	test_cmp packa.objects packb.objects
+'
+
 test_expect_success 'full repack, reusing previous bitmaps' '
 	git repack -ad &&
 	ls .git/objects/pack/ | grep bitmap >output &&
-- 
2.9.2.701.gf965a18.dirty

^ permalink raw reply related	[flat|nested] 62+ messages in thread

* Re: [PATCH 1/2] pack-objects: Teach --use-bitmap-index codepath to respect --local, --honor-pack-keep and --incremental
  2016-08-09 19:29                                     ` Kirill Smelkov
  2016-08-09 19:31                                       ` [PATCH 1/2 v5] pack-objects: respect --local/--honor-pack-keep/--incremental when bitmap is in use Kirill Smelkov
  2016-08-09 19:32                                       ` [PATCH 2/2 v7] pack-objects: use reachability bitmap index when generating non-stdout pack Kirill Smelkov
@ 2016-08-09 19:49                                       ` Junio C Hamano
  2 siblings, 0 replies; 62+ messages in thread
From: Junio C Hamano @ 2016-08-09 19:49 UTC (permalink / raw)
  To: Kirill Smelkov
  Cc: Jeff King, Vicent Marti, Jérome Perrin, Isabelle Vallet,
	Kazuhiko Shiozaki, Julien Muchembled, git

Kirill Smelkov <kirr@nexedi.com> writes:

> On Tue, Aug 09, 2016 at 09:52:18AM -0700, Junio C Hamano wrote:
>> "touch A" forcess the readers wonder "does the timestamp of A
>> matter, and if so in what way?" and "does any later test care what
>> is _in_ A, and if so in what way?"  Both of them is wasting their
>> time when there is no reason why "touch" should have been used. 
>
> I see, thanks for explaining. I used to read it a bit the other way;

Surely ">A" may invite "Hmm, is it important that A gets empty?", so
the choice between the two is not so black-and-white.  It just is
that "touch" has a more specific "update the timestamp while keeping
its contents intact" meaning, compared to ">A", which _could_ be
read as "make it empty and update its mtime" but most people would
not (i.e. "update its mtime" is a side effect for any modification).

> Ok, makes sense. Both patches adjusted and will be reposted.

Thanks.

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 1/2 v5] pack-objects: respect --local/--honor-pack-keep/--incremental when bitmap is in use
  2016-08-09 19:31                                       ` [PATCH 1/2 v5] pack-objects: respect --local/--honor-pack-keep/--incremental when bitmap is in use Kirill Smelkov
@ 2016-08-18 17:52                                         ` Jeff King
  2016-09-10 14:57                                           ` Kirill Smelkov
  0 siblings, 1 reply; 62+ messages in thread
From: Jeff King @ 2016-08-18 17:52 UTC (permalink / raw)
  To: Kirill Smelkov
  Cc: Junio C Hamano, Vicent Marti, Jérome Perrin, Isabelle Vallet,
	Kazuhiko Shiozaki, Julien Muchembled, git

On Tue, Aug 09, 2016 at 10:31:43PM +0300, Kirill Smelkov wrote:

> Since 6b8fda2d (pack-objects: use bitmaps when packing objects) there
> are two codepaths in pack-objects: with & without using bitmap
> reachability index.

Sorry, I got distracted from reviewing these patches. I'll give them a
detailed look now and hopefully we can finalize the topic.

> In want_object_in_pack() we care to start the checks from already found
> pack, if we have one, this way determining the answer right away
> in case neither --local nor --honour-pack-keep are active. In
> particular, as p5310-pack-bitmaps.sh shows, we do not do harm to
> served-with-bitmap clones performance-wise:
> 
>     Test                      56dfeb62          this tree
>     -----------------------------------------------------------------
>     5310.2: repack to disk    9.63(8.67+0.33)   9.47(8.55+0.28) -1.7%
>     5310.3: simulated clone   2.07(2.17+0.12)   2.03(2.14+0.12) -1.9%
>     5310.4: simulated fetch   0.78(1.03+0.02)   0.76(1.00+0.03) -2.6%
>     5310.6: partial bitmap    1.97(2.43+0.15)   1.92(2.36+0.14) -2.5%
> 
> with all differences strangely showing we are a bit faster now, but
> probably all being within noise.

Good to know there is no regression. It is curious that there is a
slight _improvement_ across the board. Do we have an explanation for
that? It seems odd that noise would be so consistent.

> And in the general case we care not to have duplicate
> find_pack_entry_one(*found_pack) calls. Worst what can happen is we can
> call want_found_object(*found_pack) -- newly introduced helper for
> checking whether we want object -- twice, but since want_found_object()
> is very lightweight it does not make any difference.

I had trouble parsing this. I think maybe:

  In the general case we do not want to call find_pack_entry_one() more
  than once, because it is expensive. This patch splits the loop in
  want_object_in_pack() into two parts: finding the object and seeing if
  it impacts our choice to include it in the pack. We may call the
  inexpensive want_found_object() twice, but we will never call
  find_pack_entry_one() if we do not need to.

> +static int want_found_object(int exclude, struct packed_git *p)
> +{
> +	if (exclude)
> +		return 1;
> +	if (incremental)
> +		return 0;
> +
> +	/*
> +	 * When asked to do --local (do not include an object that appears in a
> +	 * pack we borrow from elsewhere) or --honor-pack-keep (do not include
> +	 * an object that appears in a pack marked with .keep), finding a pack
> +	 * that matches the criteria is sufficient for us to decide to omit it.
> +	 * However, even if this pack does not satisfy the criteria, we need to
> +	 * make sure no copy of this object appears in _any_ pack that makes us
> +	 * to omit the object, so we need to check all the packs. Signal that by
> +	 * returning -1 to the caller.
> +	 */
> +	if (!ignore_packed_keep &&
> +	    (!local || !have_non_local_packs))
> +		return 1;

Hmm. The comment says "-1", but the return says "1". That is because the
comment is describing the return that happens at the end. :)

I wonder if the last sentence should be:

  We can check here whether these options can possibly matter; if not,
  we can return early from the function here. Otherwise, we signal "-1"
  at the end to tell the caller that we do not know either way, and it
  needs to check more packs.

> -	*found_pack = NULL;
> -	*found_offset = 0;
> +	/*
> +	 * If we already know the pack object lives in, start checks from that
> +	 * pack - in the usual case when neither --local was given nor .keep files
> +	 * are present we will determine the answer right now.
> +	 */
> +	if (*found_pack) {
> +		want = want_found_object(exclude, *found_pack);
> +		if (want != -1)
> +			return want;
> +	}

Looks correct. Though it is not really "start checks from..." anymore,
but rather "do a quick check to see if we can quit early, and otherwise
start the loop". That might be nitpicking, though.

>  	for (p = packed_git; p; p = p->next) {
> -		off_t offset = find_pack_entry_one(sha1, p);
> +		off_t offset;
> +
> +		if (p == *found_pack)
> +			offset = *found_offset;
> +		else
> +			offset = find_pack_entry_one(sha1, p);
> +

This hunk will conflict with the MRU optimizations in 'next', but I
think the resolution should be pretty trivial.

>  static int add_object_entry(const unsigned char *sha1, enum object_type type,
>  			    const char *name, int exclude)
>  {
> -	struct packed_git *found_pack;
> -	off_t found_offset;
> +	struct packed_git *found_pack = NULL;
> +	off_t found_offset = 0;

I think technically we don't need to initialize found_offset here (it is
considered only if *found_pack is not NULL), but it doesn't hurt to make
our starting assumptions clear.

> @@ -1073,6 +1097,9 @@ static int add_object_entry_from_bitmap(const unsigned char *sha1,
>  	if (have_duplicate_entry(sha1, 0, &index_pos))
>  		return 0;
>  
> +	if (!want_object_in_pack(sha1, 0, &pack, &offset))
> +		return 0;
> +

And this caller doesn't need to worry about initialization, because of
course it knows it has a pack/offset already. Good.

> diff --git a/t/t5310-pack-bitmaps.sh b/t/t5310-pack-bitmaps.sh
> index 3893afd..a278d30 100755
> --- a/t/t5310-pack-bitmaps.sh
> +++ b/t/t5310-pack-bitmaps.sh

Tests look OK. I saw a few style nitpicks, but I think they are not even
against our style guide but more "I would have written it like this" and
are not even worth quibbling over.

So I think the code here is fine, and I just had a few minor complaints
on comment and commit message clarity.

-Peff

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 2/2 v7] pack-objects: use reachability bitmap index when generating non-stdout pack
  2016-08-09 19:32                                       ` [PATCH 2/2 v7] pack-objects: use reachability bitmap index when generating non-stdout pack Kirill Smelkov
@ 2016-08-18 18:06                                         ` Jeff King
  2016-09-10 14:59                                           ` Kirill Smelkov
  0 siblings, 1 reply; 62+ messages in thread
From: Jeff King @ 2016-08-18 18:06 UTC (permalink / raw)
  To: Kirill Smelkov
  Cc: Junio C Hamano, Vicent Marti, Jérome Perrin, Isabelle Vallet,
	Kazuhiko Shiozaki, Julien Muchembled, git

On Tue, Aug 09, 2016 at 10:32:17PM +0300, Kirill Smelkov wrote:

> Subject: Re: [PATCH 2/2 v7] pack-objects: use reachability bitmap index when
>    generating non-stdout pack

This is v7, but as I understand your numbering, it goes with v5 of patch
1/2 that I just reviewed (usually we just increment the version number
on the whole series and treat it as a unit, even if some patches didn't
change from version to version).

> So we can teach pack-objects to use bitmap index for initial object
> counting phase when generating resultant pack file too:
> 
> - if we care it is not activated under git-repack:

Do you mean "if we take care that it is not..." here?

(I think you might just be getting tripped up in the English idioms;
"care" means that we have a preference; "to take care" means that we are
being careful).

> - if we know bitmap index generation is not enabled for resultant pack:
> 
>   Current code has singleton bitmap_git so cannot work simultaneously
>   with two bitmap indices.

Minor English fixes:

  The current code has a singleton bitmap_git, so it cannot work
  simultaneously with two bitmap indices.

> - if we keep pack reuse enabled still only for "send-to-stdout" case:
> 
>   Because on pack reuse raw entries are directly written out to destination
>   pack by write_reused_pack() bypassing needed for pack index generation
>   bookkeeping done by regular codepath in write_one() and friends.

Ditto on English:

  On pack reuse raw entries are directly written out to the destination
  pack by write_reused_pack(), bypassing the need for pack index
  generation bookkeeping done by the regular code path in write_one()
  and friends.

I think this is missing the implication. Why wouldn't we want to reuse
in this case? Certainly we don't when doing a "careful" on-disk repack.
I suspect the answer is that we cannot write a ".idx" off of the result
of write_reused_pack(), and write-to-disk always includes the .idx.

> More context:
> 
>     http://marc.info/?t=146792101400001&r=1&w=2

Can we turn this into a link to public-inbox? We have just been bit by
all of our old links to gmane dying, and they cannot easily be replaced
because they use a gmane-specific article number. public-inbox URLs use
message-ids, which should be usable for other archives if public-inbox
goes away.

> diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
> index b1007f2..c92d7fc 100644
> --- a/builtin/pack-objects.c
> +++ b/builtin/pack-objects.c

The code here looks fine.

> diff --git a/t/t5310-pack-bitmaps.sh b/t/t5310-pack-bitmaps.sh
> index a278d30..9602e9a 100755
> --- a/t/t5310-pack-bitmaps.sh
> +++ b/t/t5310-pack-bitmaps.sh
> @@ -196,6 +196,18 @@ test_expect_success 'pack-objects respects --local (non-local bitmapped pack)' '
>  	! has_any packbitmap.objects 3b.objects
>  '
>  
> +test_expect_success 'pack-objects to file can use bitmap' '
> +	# make sure we still have 1 bitmap index from previous tests
> +	ls .git/objects/pack/ | grep bitmap >output &&
> +	test_line_count = 1 output &&
> +	# verify equivalent packs are generated with/without using bitmap index
> +	packasha1=$(git pack-objects --no-use-bitmap-index --all packa </dev/null) &&
> +	packbsha1=$(git pack-objects --use-bitmap-index --all packb </dev/null) &&
> +	list_packed_objects <packa-$packasha1.idx >packa.objects &&
> +	list_packed_objects <packb-$packbsha1.idx >packb.objects &&
> +	test_cmp packa.objects packb.objects
> +'

Of course we can't know if bitmaps were actually used, or if they were
turned off under the hood. But at least this exercises the code a bit.

You could possibly add a perf test which shows off the improvement, but
I don't think it's strictly necessary.

-Peff

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 1/2 v5] pack-objects: respect --local/--honor-pack-keep/--incremental when bitmap is in use
  2016-08-18 17:52                                         ` Jeff King
@ 2016-09-10 14:57                                           ` Kirill Smelkov
  2016-09-10 15:01                                             ` [PATCH 1/2 v8] " Kirill Smelkov
                                                               ` (2 more replies)
  0 siblings, 3 replies; 62+ messages in thread
From: Kirill Smelkov @ 2016-09-10 14:57 UTC (permalink / raw)
  To: Jeff King
  Cc: Junio C Hamano, Vicent Marti, Jérome Perrin, Isabelle Vallet,
	Kazuhiko Shiozaki, Julien Muchembled, git

On Thu, Aug 18, 2016 at 01:52:22PM -0400, Jeff King wrote:
> On Tue, Aug 09, 2016 at 10:31:43PM +0300, Kirill Smelkov wrote:
> 
> > Since 6b8fda2d (pack-objects: use bitmaps when packing objects) there
> > are two codepaths in pack-objects: with & without using bitmap
> > reachability index.
> 
> Sorry, I got distracted from reviewing these patches. I'll give them a
> detailed look now and hopefully we can finalize the topic.

Jeff, thanks for feedback. On my side I'm sorry for the delay because I
was travelling and only recently got back to work.

> > In want_object_in_pack() we care to start the checks from already found
> > pack, if we have one, this way determining the answer right away
> > in case neither --local nor --honour-pack-keep are active. In
> > particular, as p5310-pack-bitmaps.sh shows, we do not do harm to
> > served-with-bitmap clones performance-wise:
> > 
> >     Test                      56dfeb62          this tree
> >     -----------------------------------------------------------------
> >     5310.2: repack to disk    9.63(8.67+0.33)   9.47(8.55+0.28) -1.7%
> >     5310.3: simulated clone   2.07(2.17+0.12)   2.03(2.14+0.12) -1.9%
> >     5310.4: simulated fetch   0.78(1.03+0.02)   0.76(1.00+0.03) -2.6%
> >     5310.6: partial bitmap    1.97(2.43+0.15)   1.92(2.36+0.14) -2.5%
> > 
> > with all differences strangely showing we are a bit faster now, but
> > probably all being within noise.
> 
> Good to know there is no regression. It is curious that there is a
> slight _improvement_ across the board. Do we have an explanation for
> that? It seems odd that noise would be so consistent.

Yes, I too thought it and it turned out to be t/perf/run does not copy
config.mak.autogen & friends to build/ and I'm using autoconf with
CFLAGS="-march=native -O3 ..."

Junio, I could not resist to the following:

---- 8< ----
From: Kirill Smelkov <kirr@nexedi.com>
Subject: [PATCH] t/perf/run: Don't forget to copy config.mak.autogen & friends
 to build area

Otherwise for people who use autotools-based configure in main worktree,
the performance testing results will be inconsistent as work and build
trees could be using e.g. different optimization levels.

See e.g.

	http://public-inbox.org/git/20160818175222.bmm3ivjheokf2qzl@sigill.intra.peff.net/

for example.

NOTE config.status has to be copied because otherwise without it the build
would want to run reconfigure this way loosing just copied config.mak.autogen.

Signed-off-by: Kirill Smelkov <kirr@nexedi.com>
---
 t/perf/run | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/t/perf/run b/t/perf/run
index cfd7012..aa383c2 100755
--- a/t/perf/run
+++ b/t/perf/run
@@ -30,7 +30,7 @@ unpack_git_rev () {
 }
 build_git_rev () {
 	rev=$1
-	cp ../../config.mak build/$rev/config.mak
+	cp -t build/$rev ../../{config.mak,config.mak.autogen,config.status}
 	(cd build/$rev && make $GIT_PERF_MAKE_OPTS) ||
 	die "failed to build revision '$mydir'"
 }
-- 
2.9.2.701.gf965a18.dirty
---- 8< ----

With corrected t/perf/run the timings are more realistic - e.g. 3
consecutive runs of `./run 56dfeb62 . ./p5310-pack-bitmaps.sh`:

Test                      56dfeb62          this tree
-----------------------------------------------------------------
5310.2: repack to disk    9.08(8.20+0.25)   9.09(8.14+0.32) +0.1%
5310.3: simulated clone   1.92(2.12+0.08)   1.93(2.12+0.09) +0.5%
5310.4: simulated fetch   0.82(1.07+0.04)   0.82(1.06+0.04) +0.0%
5310.6: partial bitmap    1.96(2.42+0.13)   1.95(2.40+0.15) -0.5%

Test                      56dfeb62          this tree
-----------------------------------------------------------------
5310.2: repack to disk    9.11(8.16+0.32)   9.11(8.19+0.28) +0.0%
5310.3: simulated clone   1.93(2.14+0.07)   1.92(2.11+0.10) -0.5%
5310.4: simulated fetch   0.82(1.06+0.04)   0.82(1.04+0.05) +0.0%
5310.6: partial bitmap    1.95(2.38+0.16)   1.94(2.39+0.14) -0.5%

Test                      56dfeb62          this tree
-----------------------------------------------------------------
5310.2: repack to disk    9.13(8.17+0.31)   9.07(8.13+0.28) -0.7%
5310.3: simulated clone   1.92(2.13+0.07)   1.91(2.12+0.06) -0.5%
5310.4: simulated fetch   0.82(1.08+0.03)   0.82(1.08+0.03) +0.0%
5310.6: partial bitmap    1.96(2.43+0.14)   1.96(2.42+0.14) +0.0%



> > And in the general case we care not to have duplicate
> > find_pack_entry_one(*found_pack) calls. Worst what can happen is we can
> > call want_found_object(*found_pack) -- newly introduced helper for
> > checking whether we want object -- twice, but since want_found_object()
> > is very lightweight it does not make any difference.
> 
> I had trouble parsing this. I think maybe:
> 
>   In the general case we do not want to call find_pack_entry_one() more
>   than once, because it is expensive. This patch splits the loop in
>   want_object_in_pack() into two parts: finding the object and seeing if
>   it impacts our choice to include it in the pack. We may call the
>   inexpensive want_found_object() twice, but we will never call
>   find_pack_entry_one() if we do not need to.

Ok, thanks for the advice.

> 
> > +static int want_found_object(int exclude, struct packed_git *p)
> > +{
> > +	if (exclude)
> > +		return 1;
> > +	if (incremental)
> > +		return 0;
> > +
> > +	/*
> > +	 * When asked to do --local (do not include an object that appears in a
> > +	 * pack we borrow from elsewhere) or --honor-pack-keep (do not include
> > +	 * an object that appears in a pack marked with .keep), finding a pack
> > +	 * that matches the criteria is sufficient for us to decide to omit it.
> > +	 * However, even if this pack does not satisfy the criteria, we need to
> > +	 * make sure no copy of this object appears in _any_ pack that makes us
> > +	 * to omit the object, so we need to check all the packs. Signal that by
> > +	 * returning -1 to the caller.
> > +	 */
> > +	if (!ignore_packed_keep &&
> > +	    (!local || !have_non_local_packs))
> > +		return 1;
> 
> Hmm. The comment says "-1", but the return says "1". That is because the
> comment is describing the return that happens at the end. :)
> 
> I wonder if the last sentence should be:
> 
>   We can check here whether these options can possibly matter; if not,
>   we can return early from the function here. Otherwise, we signal "-1"
>   at the end to tell the caller that we do not know either way, and it
>   needs to check more packs.

Thanks for the catch and hint. I've changed it to the following:

	We can however first check whether these options can possible matter;
	if they do not matter we know we want the object in generated pack.
	Otherwise, we signal "-1" at the end to tell the caller that we do
	not know either way, and it needs to check more packs.

full version:

        /*
         * When asked to do --local (do not include an object that appears in a
         * pack we borrow from elsewhere) or --honor-pack-keep (do not include
         * an object that appears in a pack marked with .keep), finding a pack
         * that matches the criteria is sufficient for us to decide to omit it.
         * However, even if this pack does not satisfy the criteria, we need to
         * make sure no copy of this object appears in _any_ pack that makes us
         * to omit the object, so we need to check all the packs.
         *
         * We can however first check whether these options can possible matter;
         * if they do not matter we know we want the object in generated pack.
         * Otherwise, we signal "-1" at the end to tell the caller that we do
         * not know either way, and it needs to check more packs.
         */

Hope it is ok.

> > -	*found_pack = NULL;
> > -	*found_offset = 0;
> > +	/*
> > +	 * If we already know the pack object lives in, start checks from that
> > +	 * pack - in the usual case when neither --local was given nor .keep files
> > +	 * are present we will determine the answer right now.
> > +	 */
> > +	if (*found_pack) {
> > +		want = want_found_object(exclude, *found_pack);
> > +		if (want != -1)
> > +			return want;
> > +	}
> 
> Looks correct. Though it is not really "start checks from..." anymore,
> but rather "do a quick check to see if we can quit early, and otherwise
> start the loop". That might be nitpicking, though.

I see. Your version is ok, but to me 'start checks from ...' is a bit
more natural and explaining (yes, all subjective and depending on
taste), so if possible I'd prefer to leave it as is.

> 
> >  	for (p = packed_git; p; p = p->next) {
> > -		off_t offset = find_pack_entry_one(sha1, p);
> > +		off_t offset;
> > +
> > +		if (p == *found_pack)
> > +			offset = *found_offset;
> > +		else
> > +			offset = find_pack_entry_one(sha1, p);
> > +
> 
> This hunk will conflict with the MRU optimizations in 'next', but I
> think the resolution should be pretty trivial.

Yes.

> >  static int add_object_entry(const unsigned char *sha1, enum object_type type,
> >  			    const char *name, int exclude)
> >  {
> > -	struct packed_git *found_pack;
> > -	off_t found_offset;
> > +	struct packed_git *found_pack = NULL;
> > +	off_t found_offset = 0;
> 
> I think technically we don't need to initialize found_offset here (it is
> considered only if *found_pack is not NULL), but it doesn't hurt to make
> our starting assumptions clear.

Yes, found_pack != NULL is indicator whether we have found_pack /
found_offset info, but it makes it much clear and defending from
mistakes to set both found_{pack,offset} into known initial state.

> > @@ -1073,6 +1097,9 @@ static int add_object_entry_from_bitmap(const unsigned char *sha1,
> >  	if (have_duplicate_entry(sha1, 0, &index_pos))
> >  		return 0;
> >  
> > +	if (!want_object_in_pack(sha1, 0, &pack, &offset))
> > +		return 0;
> > +
> 
> And this caller doesn't need to worry about initialization, because of
> course it knows it has a pack/offset already. Good.

Yes, we have this info from bitmap walker calling us.


> > diff --git a/t/t5310-pack-bitmaps.sh b/t/t5310-pack-bitmaps.sh
> > index 3893afd..a278d30 100755
> > --- a/t/t5310-pack-bitmaps.sh
> > +++ b/t/t5310-pack-bitmaps.sh
> 
> Tests look OK. I saw a few style nitpicks, but I think they are not even
> against our style guide but more "I would have written it like this" and
> are not even worth quibbling over.
> 
> So I think the code here is fine, and I just had a few minor complaints
> on comment and commit message clarity.

Thanks for feedback. Yes tastes can differ but your comments regarding
commit message and want_found_object() were objectively (imho) worth it
and there I've made the adjustments.

Please expect updated patch to be send as reply to this mail.

Thanks again for feedback,
Kirill

^ permalink raw reply related	[flat|nested] 62+ messages in thread

* Re: [PATCH 2/2 v7] pack-objects: use reachability bitmap index when generating non-stdout pack
  2016-08-18 18:06                                         ` Jeff King
@ 2016-09-10 14:59                                           ` Kirill Smelkov
  2016-09-10 15:01                                             ` [PATCH 2/2 v8] " Kirill Smelkov
  2016-09-12 19:21                                             ` [PATCH 2/2 v7] " Junio C Hamano
  0 siblings, 2 replies; 62+ messages in thread
From: Kirill Smelkov @ 2016-09-10 14:59 UTC (permalink / raw)
  To: Jeff King
  Cc: Junio C Hamano, Vicent Marti, Jérome Perrin, Isabelle Vallet,
	Kazuhiko Shiozaki, Julien Muchembled, git

On Thu, Aug 18, 2016 at 02:06:15PM -0400, Jeff King wrote:
> On Tue, Aug 09, 2016 at 10:32:17PM +0300, Kirill Smelkov wrote:
> 
> > Subject: Re: [PATCH 2/2 v7] pack-objects: use reachability bitmap index when
> >    generating non-stdout pack
> 
> This is v7, but as I understand your numbering, it goes with v5 of patch
> 1/2 that I just reviewed (usually we just increment the version number
> on the whole series and treat it as a unit, even if some patches didn't
> change from version to version).

The reason those patches are having their own numbers is that they are
orthogonal to each other and can be applied / rejected independently.
Since I though Junio might want to pick them up as separate topics they
were versioned separately.

But ok, since now we have them considered both together, their next
versions posted will be uniform v8.


> > So we can teach pack-objects to use bitmap index for initial object
> > counting phase when generating resultant pack file too:
> > 
> > - if we care it is not activated under git-repack:
> 
> Do you mean "if we take care that it is not..." here?
> 
> (I think you might just be getting tripped up in the English idioms;
> "care" means that we have a preference; "to take care" means that we are
> being careful).

Ok, I've might have been tripped and thanks for the catch up. I've changed to

	"if we take care to not let it be activated under git-repack"

> 
> > - if we know bitmap index generation is not enabled for resultant pack:
> > 
> >   Current code has singleton bitmap_git so cannot work simultaneously
> >   with two bitmap indices.
> 
> Minor English fixes:
> 
>   The current code has a singleton bitmap_git, so it cannot work
>   simultaneously with two bitmap indices.

ok.

> > - if we keep pack reuse enabled still only for "send-to-stdout" case:
> > 
> >   Because on pack reuse raw entries are directly written out to destination
> >   pack by write_reused_pack() bypassing needed for pack index generation
> >   bookkeeping done by regular codepath in write_one() and friends.
> 
> Ditto on English:
> 
>   On pack reuse raw entries are directly written out to the destination
>   pack by write_reused_pack(), bypassing the need for pack index
>   generation bookkeeping done by the regular code path in write_one()
>   and friends.
> 
> I think this is missing the implication. Why wouldn't we want to reuse
> in this case? Certainly we don't when doing a "careful" on-disk repack.
> I suspect the answer is that we cannot write a ".idx" off of the result
> of write_reused_pack(), and write-to-disk always includes the .idx.

Yes, mentioning pack-to-file needs to generate .idx makes it more clear
and thanks for pointing this out. I've changed this item to the
following (picking some of your English corrections):

    - if we keep pack reuse enabled still only for "send-to-stdout" case:

      Because pack-to-file needs to generate index for destination pack, and
      currently on pack reuse raw entries are directly written out to the
      destination pack by write_reused_pack(), bypassing needed for pack index
      generation bookkeeping done by regular codepath in write_one() and
      friends.

      ( In the future we might teach pack-reuse code about cases when index
        also needs to be generated for resultant pack and remove
        pack-reuse-only-for-stdout limitation )

Hope it is ok.

> > More context:
> > 
> >     http://marc.info/?t=146792101400001&r=1&w=2
> 
> Can we turn this into a link to public-inbox? We have just been bit by
> all of our old links to gmane dying, and they cannot easily be replaced
> because they use a gmane-specific article number. public-inbox URLs use
> message-ids, which should be usable for other archives if public-inbox
> goes away.

Yes, makes sense to put msgid here. I've added

	http://public-inbox.org/git/20160707190917.20011-1-kirr@nexedi.com/T/#t


> > diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
> > index b1007f2..c92d7fc 100644
> > --- a/builtin/pack-objects.c
> > +++ b/builtin/pack-objects.c
> 
> The code here looks fine.

Thanks.

> > diff --git a/t/t5310-pack-bitmaps.sh b/t/t5310-pack-bitmaps.sh
> > index a278d30..9602e9a 100755
> > --- a/t/t5310-pack-bitmaps.sh
> > +++ b/t/t5310-pack-bitmaps.sh
> > @@ -196,6 +196,18 @@ test_expect_success 'pack-objects respects --local (non-local bitmapped pack)' '
> >  	! has_any packbitmap.objects 3b.objects
> >  '
> >  
> > +test_expect_success 'pack-objects to file can use bitmap' '
> > +	# make sure we still have 1 bitmap index from previous tests
> > +	ls .git/objects/pack/ | grep bitmap >output &&
> > +	test_line_count = 1 output &&
> > +	# verify equivalent packs are generated with/without using bitmap index
> > +	packasha1=$(git pack-objects --no-use-bitmap-index --all packa </dev/null) &&
> > +	packbsha1=$(git pack-objects --use-bitmap-index --all packb </dev/null) &&
> > +	list_packed_objects <packa-$packasha1.idx >packa.objects &&
> > +	list_packed_objects <packb-$packbsha1.idx >packb.objects &&
> > +	test_cmp packa.objects packb.objects
> > +'
> 
> Of course we can't know if bitmaps were actually used, or if they were
> turned off under the hood. But at least this exercises the code a bit.

Yes, I was thinking how to know the bitmap codepath was actually active,
and without adding debugging points there is no way (at least I could
not find it).

> You could possibly add a perf test which shows off the improvement, but
> I don't think it's strictly necessary.

Good idea. I've added this

---- 8< ----
diff --git a/t/perf/p5310-pack-bitmaps.sh b/t/perf/p5310-pack-bitmaps.sh
index de2a224..bb91dbb 100755
--- a/t/perf/p5310-pack-bitmaps.sh
+++ b/t/perf/p5310-pack-bitmaps.sh
@@ -32,6 +32,14 @@ test_perf 'simulated fetch' '
        } | git pack-objects --revs --stdout >/dev/null
 '
 
+test_perf 'pack to file' '
+       git pack-objects --all pack1 </dev/null >/dev/null
+'
+
+test_perf 'pack to file (bitmap)' '
+       git pack-objects --use-bitmap-index --all pack1b </dev/null >/dev/null
+'
+
 test_expect_success 'create partial bitmap state' '
        # pick a commit to represent the repo tip in the past
        cutoff=$(git rev-list HEAD~100 -1) &&
@@ -53,8 +61,12 @@ test_expect_success 'create partial bitmap state' '
        git update-ref HEAD $orig_tip
 '
 
-test_perf 'partial bitmap' '
+test_perf 'clone (partial bitmap)' '
        git pack-objects --stdout --all </dev/null >/dev/null
 '
 
+test_perf 'pack to file (partial bitmap)' '
+       git pack-objects --use-bitmap-index --all pack2b </dev/null >/dev/null
+'
+
 test_done
---- 8< ----

    Test                                    56dfeb62          this tree
    --------------------------------------------------------------------------------
    5310.2: repack to disk                  8.98(8.05+0.29)   9.05(8.08+0.33) +0.8%
    5310.3: simulated clone                 2.02(2.27+0.09)   2.01(2.25+0.08) -0.5%
    5310.4: simulated fetch                 0.81(1.07+0.02)   0.81(1.05+0.04) +0.0%
    5310.5: pack to file                    7.58(7.04+0.28)   7.60(7.04+0.30) +0.3%
    5310.6: pack to file (bitmap)           7.55(7.02+0.28)   3.25(2.82+0.18) -57.0%
    5310.8: clone (partial bitmap)          1.83(2.26+0.12)   1.82(2.22+0.14) -0.5%
    5310.9: pack to file (partial bitmap)   6.86(6.58+0.30)   2.87(2.74+0.20) -58.2%
    

Kirill

^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH 1/2 v8] pack-objects: respect --local/--honor-pack-keep/--incremental when bitmap is in use
  2016-09-10 14:57                                           ` Kirill Smelkov
@ 2016-09-10 15:01                                             ` Kirill Smelkov
  2016-09-13  6:23                                               ` Junio C Hamano
  2016-09-10 15:05                                             ` [PATCH] t/perf/run: Don't forget to copy config.mak.autogen & friends to build area Kirill Smelkov
  2016-09-12 17:33                                             ` [PATCH 1/2 v5] pack-objects: respect --local/--honor-pack-keep/--incremental when bitmap is in use Junio C Hamano
  2 siblings, 1 reply; 62+ messages in thread
From: Kirill Smelkov @ 2016-09-10 15:01 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Jeff King, Vicent Marti, Jérome Perrin, Isabelle Vallet,
	Kazuhiko Shiozaki, Julien Muchembled, git, Kirill Smelkov

Since 6b8fda2d (pack-objects: use bitmaps when packing objects) there
are two codepaths in pack-objects: with & without using bitmap
reachability index.

However add_object_entry_from_bitmap(), despite its non-bitmapped
counterpart add_object_entry(), in no way does check for whether --local
or --honor-pack-keep or --incremental should be respected. In
non-bitmapped codepath this is handled in want_object_in_pack(), but
bitmapped codepath has simply no such checking at all.

The bitmapped codepath however was allowing to pass in all those options
and with bitmap indices still being used under such conditions -
potentially giving wrong output (e.g. including objects from non-local or
.keep'ed pack).

We can easily fix this by noting the following: when an object comes to
add_object_entry_from_bitmap() it can come for two reasons:

    1. entries coming from main pack covered by bitmap index, and
    2. object coming from, possibly alternate, loose or other packs.

"2" can be already handled by want_object_in_pack() and to cover
"1" we can teach want_object_in_pack() to expect that *found_pack can be
non-NULL, meaning calling client already found object's pack entry.

In want_object_in_pack() we care to start the checks from already found
pack, if we have one, this way determining the answer right away
in case neither --local nor --honour-pack-keep are active. In
particular, as p5310-pack-bitmaps.sh shows (3 consecutive runs), we do
not do harm to served-with-bitmap clones performance-wise:

    Test                      56dfeb62          this tree
    -----------------------------------------------------------------
    5310.2: repack to disk    9.08(8.20+0.25)   9.09(8.14+0.32) +0.1%
    5310.3: simulated clone   1.92(2.12+0.08)   1.93(2.12+0.09) +0.5%
    5310.4: simulated fetch   0.82(1.07+0.04)   0.82(1.06+0.04) +0.0%
    5310.6: partial bitmap    1.96(2.42+0.13)   1.95(2.40+0.15) -0.5%

    Test                      56dfeb62          this tree
    -----------------------------------------------------------------
    5310.2: repack to disk    9.11(8.16+0.32)   9.11(8.19+0.28) +0.0%
    5310.3: simulated clone   1.93(2.14+0.07)   1.92(2.11+0.10) -0.5%
    5310.4: simulated fetch   0.82(1.06+0.04)   0.82(1.04+0.05) +0.0%
    5310.6: partial bitmap    1.95(2.38+0.16)   1.94(2.39+0.14) -0.5%

    Test                      56dfeb62          this tree
    -----------------------------------------------------------------
    5310.2: repack to disk    9.13(8.17+0.31)   9.07(8.13+0.28) -0.7%
    5310.3: simulated clone   1.92(2.13+0.07)   1.91(2.12+0.06) -0.5%
    5310.4: simulated fetch   0.82(1.08+0.03)   0.82(1.08+0.03) +0.0%
    5310.6: partial bitmap    1.96(2.43+0.14)   1.96(2.42+0.14) +0.0%

with delta timings showing they are all within noise from run to run.

In the general case we do not want to call find_pack_entry_one() more than
once, because it is expensive. This patch splits the loop in
want_object_in_pack() into two parts: finding the object and seeing if it
impacts our choice to include it in the pack. We may call the inexpensive
want_found_object() twice, but we will never call find_pack_entry_one() if we
do not need to.

I appreciate help and discussing this change with Junio C Hamano and
Jeff King.

Signed-off-by: Kirill Smelkov <kirr@nexedi.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
---
 builtin/pack-objects.c  | 97 ++++++++++++++++++++++++++++++++-----------------
 t/t5310-pack-bitmaps.sh | 92 ++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 156 insertions(+), 33 deletions(-)

diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index c4c2a3c..19668d3 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -944,13 +944,48 @@ static int have_duplicate_entry(const unsigned char *sha1,
 	return 1;
 }
 
+static int want_found_object(int exclude, struct packed_git *p)
+{
+	if (exclude)
+		return 1;
+	if (incremental)
+		return 0;
+
+	/*
+	 * When asked to do --local (do not include an object that appears in a
+	 * pack we borrow from elsewhere) or --honor-pack-keep (do not include
+	 * an object that appears in a pack marked with .keep), finding a pack
+	 * that matches the criteria is sufficient for us to decide to omit it.
+	 * However, even if this pack does not satisfy the criteria, we need to
+	 * make sure no copy of this object appears in _any_ pack that makes us
+	 * to omit the object, so we need to check all the packs.
+	 *
+	 * We can however first check whether these options can possible matter;
+	 * if they do not matter we know we want the object in generated pack.
+	 * Otherwise, we signal "-1" at the end to tell the caller that we do
+	 * not know either way, and it needs to check more packs.
+	 */
+	if (!ignore_packed_keep &&
+	    (!local || !have_non_local_packs))
+		return 1;
+
+	if (local && !p->pack_local)
+		return 0;
+	if (ignore_packed_keep && p->pack_local && p->pack_keep)
+		return 0;
+
+	/* we don't know yet; keep looking for more packs */
+	return -1;
+}
+
 /*
  * Check whether we want the object in the pack (e.g., we do not want
  * objects found in non-local stores if the "--local" option was used).
  *
- * As a side effect of this check, we will find the packed version of this
- * object, if any. We therefore pass out the pack information to avoid having
- * to look it up again later.
+ * If the caller already knows an existing pack it wants to take the object
+ * from, that is passed in *found_pack and *found_offset; otherwise this
+ * function finds if there is any pack that has the object and returns the pack
+ * and its offset in these variables.
  */
 static int want_object_in_pack(const unsigned char *sha1,
 			       int exclude,
@@ -958,15 +993,30 @@ static int want_object_in_pack(const unsigned char *sha1,
 			       off_t *found_offset)
 {
 	struct packed_git *p;
+	int want;
 
 	if (!exclude && local && has_loose_object_nonlocal(sha1))
 		return 0;
 
-	*found_pack = NULL;
-	*found_offset = 0;
+	/*
+	 * If we already know the pack object lives in, start checks from that
+	 * pack - in the usual case when neither --local was given nor .keep files
+	 * are present we will determine the answer right now.
+	 */
+	if (*found_pack) {
+		want = want_found_object(exclude, *found_pack);
+		if (want != -1)
+			return want;
+	}
 
 	for (p = packed_git; p; p = p->next) {
-		off_t offset = find_pack_entry_one(sha1, p);
+		off_t offset;
+
+		if (p == *found_pack)
+			offset = *found_offset;
+		else
+			offset = find_pack_entry_one(sha1, p);
+
 		if (offset) {
 			if (!*found_pack) {
 				if (!is_pack_valid(p))
@@ -974,31 +1024,9 @@ static int want_object_in_pack(const unsigned char *sha1,
 				*found_offset = offset;
 				*found_pack = p;
 			}
-			if (exclude)
-				return 1;
-			if (incremental)
-				return 0;
-
-			/*
-			 * When asked to do --local (do not include an
-			 * object that appears in a pack we borrow
-			 * from elsewhere) or --honor-pack-keep (do not
-			 * include an object that appears in a pack marked
-			 * with .keep), we need to make sure no copy of this
-			 * object come from in _any_ pack that causes us to
-			 * omit it, and need to complete this loop.  When
-			 * neither option is in effect, we know the object
-			 * we just found is going to be packed, so break
-			 * out of the loop to return 1 now.
-			 */
-			if (!ignore_packed_keep &&
-			    (!local || !have_non_local_packs))
-				break;
-
-			if (local && !p->pack_local)
-				return 0;
-			if (ignore_packed_keep && p->pack_local && p->pack_keep)
-				return 0;
+			want = want_found_object(exclude, p);
+			if (want != -1)
+				return want;
 		}
 	}
 
@@ -1039,8 +1067,8 @@ static const char no_closure_warning[] = N_(
 static int add_object_entry(const unsigned char *sha1, enum object_type type,
 			    const char *name, int exclude)
 {
-	struct packed_git *found_pack;
-	off_t found_offset;
+	struct packed_git *found_pack = NULL;
+	off_t found_offset = 0;
 	uint32_t index_pos;
 
 	if (have_duplicate_entry(sha1, exclude, &index_pos))
@@ -1073,6 +1101,9 @@ static int add_object_entry_from_bitmap(const unsigned char *sha1,
 	if (have_duplicate_entry(sha1, 0, &index_pos))
 		return 0;
 
+	if (!want_object_in_pack(sha1, 0, &pack, &offset))
+		return 0;
+
 	create_object_entry(sha1, type, name_hash, 0, 0, index_pos, pack, offset);
 
 	display_progress(progress_state, nr_result);
diff --git a/t/t5310-pack-bitmaps.sh b/t/t5310-pack-bitmaps.sh
index 3893afd..a278d30 100755
--- a/t/t5310-pack-bitmaps.sh
+++ b/t/t5310-pack-bitmaps.sh
@@ -7,6 +7,18 @@ objpath () {
 	echo ".git/objects/$(echo "$1" | sed -e 's|\(..\)|\1/|')"
 }
 
+# show objects present in pack ($1 should be associated *.idx)
+list_packed_objects () {
+	git show-index <"$1" | cut -d' ' -f2
+}
+
+# has_any pattern-file content-file
+# tests whether content-file has any entry from pattern-file with entries being
+# whole lines.
+has_any () {
+	grep -Ff "$1" "$2"
+}
+
 test_expect_success 'setup repo with moderate-sized history' '
 	for i in $(test_seq 1 10); do
 		test_commit $i
@@ -16,6 +28,7 @@ test_expect_success 'setup repo with moderate-sized history' '
 		test_commit side-$i
 	done &&
 	git checkout master &&
+	bitmaptip=$(git rev-parse master) &&
 	blob=$(echo tagged-blob | git hash-object -w --stdin) &&
 	git tag tagged-blob $blob &&
 	git config repack.writebitmaps true &&
@@ -118,6 +131,71 @@ test_expect_success 'incremental repack can disable bitmaps' '
 	git repack -d --no-write-bitmap-index
 '
 
+test_expect_success 'pack-objects respects --local (non-local loose)' '
+	git init --bare alt.git &&
+	echo $(pwd)/alt.git/objects >.git/objects/info/alternates &&
+	echo content1 >file1 &&
+	# non-local loose object which is not present in bitmapped pack
+	altblob=$(GIT_DIR=alt.git git hash-object -w file1) &&
+	# non-local loose object which is also present in bitmapped pack
+	git cat-file blob $blob | GIT_DIR=alt.git git hash-object -w --stdin &&
+	git add file1 &&
+	test_tick &&
+	git commit -m commit_file1 &&
+	echo HEAD | git pack-objects --local --stdout --revs >1.pack &&
+	git index-pack 1.pack &&
+	list_packed_objects 1.idx >1.objects &&
+	printf "%s\n" "$altblob" "$blob" >nonlocal-loose &&
+	! has_any nonlocal-loose 1.objects
+'
+
+test_expect_success 'pack-objects respects --honor-pack-keep (local non-bitmapped pack)' '
+	echo content2 >file2 &&
+	blob2=$(git hash-object -w file2) &&
+	git add file2 &&
+	test_tick &&
+	git commit -m commit_file2 &&
+	printf "%s\n" "$blob2" "$bitmaptip" >keepobjects &&
+	pack2=$(git pack-objects pack2 <keepobjects) &&
+	mv pack2-$pack2.* .git/objects/pack/ &&
+	>.git/objects/pack/pack2-$pack2.keep &&
+	rm $(objpath $blob2) &&
+	echo HEAD | git pack-objects --honor-pack-keep --stdout --revs >2a.pack &&
+	git index-pack 2a.pack &&
+	list_packed_objects 2a.idx >2a.objects &&
+	! has_any keepobjects 2a.objects
+'
+
+test_expect_success 'pack-objects respects --local (non-local pack)' '
+	mv .git/objects/pack/pack2-$pack2.* alt.git/objects/pack/ &&
+	echo HEAD | git pack-objects --local --stdout --revs >2b.pack &&
+	git index-pack 2b.pack &&
+	list_packed_objects 2b.idx >2b.objects &&
+	! has_any keepobjects 2b.objects
+'
+
+test_expect_success 'pack-objects respects --honor-pack-keep (local bitmapped pack)' '
+	ls .git/objects/pack/ | grep bitmap >output &&
+	test_line_count = 1 output &&
+	packbitmap=$(basename $(cat output) .bitmap) &&
+	list_packed_objects .git/objects/pack/$packbitmap.idx >packbitmap.objects &&
+	test_when_finished "rm -f .git/objects/pack/$packbitmap.keep" &&
+	>.git/objects/pack/$packbitmap.keep &&
+	echo HEAD | git pack-objects --honor-pack-keep --stdout --revs >3a.pack &&
+	git index-pack 3a.pack &&
+	list_packed_objects 3a.idx >3a.objects &&
+	! has_any packbitmap.objects 3a.objects
+'
+
+test_expect_success 'pack-objects respects --local (non-local bitmapped pack)' '
+	mv .git/objects/pack/$packbitmap.* alt.git/objects/pack/ &&
+	test_when_finished "mv alt.git/objects/pack/$packbitmap.* .git/objects/pack/" &&
+	echo HEAD | git pack-objects --local --stdout --revs >3b.pack &&
+	git index-pack 3b.pack &&
+	list_packed_objects 3b.idx >3b.objects &&
+	! has_any packbitmap.objects 3b.objects
+'
+
 test_expect_success 'full repack, reusing previous bitmaps' '
 	git repack -ad &&
 	ls .git/objects/pack/ | grep bitmap >output &&
@@ -143,6 +221,20 @@ test_expect_success 'create objects for missing-HAVE tests' '
 	EOF
 '
 
+test_expect_success 'pack-objects respects --incremental' '
+	cat >revs2 <<-EOF &&
+	HEAD
+	$commit
+	EOF
+	git pack-objects --incremental --stdout --revs <revs2 >4.pack &&
+	git index-pack 4.pack &&
+	list_packed_objects 4.idx >4.objects &&
+	test_line_count = 4 4.objects &&
+	git rev-list --objects $commit >revlist &&
+	cut -d" " -f1 revlist |sort >objects &&
+	test_cmp 4.objects objects
+'
+
 test_expect_success 'pack with missing blob' '
 	rm $(objpath $blob) &&
 	git pack-objects --stdout --revs <revs >/dev/null
-- 
2.9.2.701.gf965a18.dirty

^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH 2/2 v8] pack-objects: use reachability bitmap index when generating non-stdout pack
  2016-09-10 14:59                                           ` Kirill Smelkov
@ 2016-09-10 15:01                                             ` Kirill Smelkov
  2016-09-12 19:21                                             ` [PATCH 2/2 v7] " Junio C Hamano
  1 sibling, 0 replies; 62+ messages in thread
From: Kirill Smelkov @ 2016-09-10 15:01 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Jeff King, Vicent Marti, Jérome Perrin, Isabelle Vallet,
	Kazuhiko Shiozaki, Julien Muchembled, git, Kirill Smelkov

Starting from 6b8fda2d (pack-objects: use bitmaps when packing objects)
if a repository has bitmap index, pack-objects can nicely speedup
"Counting objects" graph traversal phase. That however was done only for
case when resultant pack is sent to stdout, not written into a file.

The reason here is for on-disk repack by default we want:

- to produce good pack (with bitmap index not-yet-packed objects are
  emitted to pack in suboptimal order).

- to use more robust pack-generation codepath (avoiding possible
  bugs in bitmap code and possible bitmap index corruption).

Jeff King further explains:

    The reason for this split is that pack-objects tries to determine how
    "careful" it should be based on whether we are packing to disk or to
    stdout. Packing to disk implies "git repack", and that we will likely
    delete the old packs after finishing. We want to be more careful (so
    as not to carry forward a corruption, and to generate a more optimal
    pack), and we presumably run less frequently and can afford extra CPU.
    Whereas packing to stdout implies serving a remote via "git fetch" or
    "git push". This happens more frequently (e.g., a server handling many
    fetching clients), and we assume the receiving end takes more
    responsibility for verifying the data.

    But this isn't always the case. One might want to generate on-disk
    packfiles for a specialized object transfer. Just using "--stdout" and
    writing to a file is not optimal, as it will not generate the matching
    pack index.

    So it would be useful to have some way of overriding this heuristic:
    to tell pack-objects that even though it should generate on-disk
    files, it is still OK to use the reachability bitmaps to do the
    traversal.

So we can teach pack-objects to use bitmap index for initial object
counting phase when generating resultant pack file too:

- if we take care to not let it be activated under git-repack:

  See above about repack robustness and not forward-carrying corruption.

- if we know bitmap index generation is not enabled for resultant pack:

  The current code has singleton bitmap_git, so it cannot work
  simultaneously with two bitmap indices.

  We also want to avoid (at least with current implementation)
  generating bitmaps off of bitmaps. The reason here is: when generating
  a pack, not-yet-packed objects will be emitted into pack in
  suboptimal order and added to tail of the bitmap as "extended entries".
  When the resultant pack + some new objects in associated repository
  are in turn used to generate another pack with bitmap, the situation
  repeats: new objects are again not emitted optimally and just added to
  bitmap tail - not in recency order.

  So the pack badness can grow over time when at each step we have
  bitmapped pack + some other objects. That's why we want to avoid
  generating bitmaps off of bitmaps, not to let pack badness grow.

- if we keep pack reuse enabled still only for "send-to-stdout" case:

  Because pack-to-file needs to generate index for destination pack, and
  currently on pack reuse raw entries are directly written out to the
  destination pack by write_reused_pack(), bypassing needed for pack index
  generation bookkeeping done by regular codepath in write_one() and
  friends.

  ( In the future we might teach pack-reuse code about cases when index
    also needs to be generated for resultant pack and remove
    pack-reuse-only-for-stdout limitation )

This way for pack-objects -> file we get nice speedup:

    erp5.git[1] (~230MB) extracted from ~ 5GB lab.nexedi.com backup
    repository managed by git-backup[2] via

    time echo 0186ac99 | git pack-objects --revs erp5pack

before:  37.2s
after:   26.2s

And for `git repack -adb` packed git.git

    time echo 5c589a73 | git pack-objects --revs gitpack

before:   7.1s
after:    3.6s

i.e. it can be 30% - 50% speedup for pack extraction.

git-backup extracts many packs on repositories restoration. That was my
initial motivation for the patch.

[1] https://lab.nexedi.com/nexedi/erp5
[2] https://lab.nexedi.com/kirr/git-backup

NOTE

Jeff also suggests that pack.useBitmaps was probably a mistake to
introduce originally. This way we are not adding another config point,
but instead just always default to-file pack-objects not to use bitmap
index: Tools which need to generate on-disk packs with using bitmap, can
pass --use-bitmap-index explicitly. And git-repack does never pass
--use-bitmap-index, so this way we can be sure regular on-disk repacking
remains robust.

NOTE2

`git pack-objects --stdout >file.pack` + `git index-pack file.pack` is much slower
than `git pack-objects file.pack`. Extracting erp5.git pack from
lab.nexedi.com backup repository:

    $ time echo 0186ac99 | git pack-objects --stdout --revs >erp5pack-stdout.pack

    real    0m22.309s
    user    0m21.148s
    sys     0m0.932s

    $ time git index-pack erp5pack-stdout.pack

    real    0m50.873s   <-- more than 2 times slower than time to generate pack itself!
    user    0m49.300s
    sys     0m1.360s

So the time for

    `pack-object --stdout >file.pack` + `index-pack file.pack`  is  72s,

while

    `pack-objects file.pack` which does both pack and index     is  27s.

And even

    `pack-objects --no-use-bitmap-index file.pack`              is  37s.

Jeff explains:

    The packfile does not carry the sha1 of the objects. A receiving
    index-pack has to compute them itself, including inflating and applying
    all of the deltas.

that's why for `git-backup restore` we want to teach `git pack-objects
file.pack` to use bitmaps instead of using `git pack-objects --stdout
>file.pack` + `git index-pack file.pack`.

NOTE3

The speedup is now tracked via t/perf/p5310-pack-bitmaps.sh

    Test                                    56dfeb62          this tree
    --------------------------------------------------------------------------------
    5310.2: repack to disk                  8.98(8.05+0.29)   9.05(8.08+0.33) +0.8%
    5310.3: simulated clone                 2.02(2.27+0.09)   2.01(2.25+0.08) -0.5%
    5310.4: simulated fetch                 0.81(1.07+0.02)   0.81(1.05+0.04) +0.0%
    5310.5: pack to file                    7.58(7.04+0.28)   7.60(7.04+0.30) +0.3%
    5310.6: pack to file (bitmap)           7.55(7.02+0.28)   3.25(2.82+0.18) -57.0%
    5310.8: clone (partial bitmap)          1.83(2.26+0.12)   1.82(2.22+0.14) -0.5%
    5310.9: pack to file (partial bitmap)   6.86(6.58+0.30)   2.87(2.74+0.20) -58.2%

More context:

    http://marc.info/?t=146792101400001&r=1&w=2
    http://public-inbox.org/git/20160707190917.20011-1-kirr@nexedi.com/T/#t

Cc: Vicent Marti <tanoku@gmail.com>
Helped-by: Jeff King <peff@peff.net>
Signed-off-by: Kirill Smelkov <kirr@nexedi.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
---
 builtin/pack-objects.c       | 31 ++++++++++++++++++++++++-------
 t/perf/p5310-pack-bitmaps.sh | 14 +++++++++++++-
 t/t5310-pack-bitmaps.sh      | 12 ++++++++++++
 3 files changed, 49 insertions(+), 8 deletions(-)

diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index 19668d3..d48c290 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -67,7 +67,8 @@ static struct packed_git *reuse_packfile;
 static uint32_t reuse_packfile_objects;
 static off_t reuse_packfile_offset;
 
-static int use_bitmap_index = 1;
+static int use_bitmap_index_default = 1;
+static int use_bitmap_index = -1;
 static int write_bitmap_index;
 static uint16_t write_bitmap_options;
 
@@ -2274,7 +2275,7 @@ static int git_pack_config(const char *k, const char *v, void *cb)
 			write_bitmap_options &= ~BITMAP_OPT_HASH_CACHE;
 	}
 	if (!strcmp(k, "pack.usebitmaps")) {
-		use_bitmap_index = git_config_bool(k, v);
+		use_bitmap_index_default = git_config_bool(k, v);
 		return 0;
 	}
 	if (!strcmp(k, "pack.threads")) {
@@ -2523,13 +2524,13 @@ static void loosen_unused_packed_objects(struct rev_info *revs)
 }
 
 /*
- * This tracks any options which a reader of the pack might
- * not understand, and which would therefore prevent blind reuse
- * of what we have on disk.
+ * This tracks any options which pack-reuse code expects to be on, or which a
+ * reader of the pack might not understand, and which would therefore prevent
+ * blind reuse of what we have on disk.
  */
 static int pack_options_allow_reuse(void)
 {
-	return allow_ofs_delta;
+	return pack_to_stdout && allow_ofs_delta;
 }
 
 static int get_object_list_from_bitmap(struct rev_info *revs)
@@ -2822,7 +2823,23 @@ int cmd_pack_objects(int argc, const char **argv, const char *prefix)
 	if (!rev_list_all || !rev_list_reflog || !rev_list_index)
 		unpack_unreachable_expiration = 0;
 
-	if (!use_internal_rev_list || !pack_to_stdout || is_repository_shallow())
+	/*
+	 * "soft" reasons not to use bitmaps - for on-disk repack by default we want
+	 *
+	 * - to produce good pack (with bitmap index not-yet-packed objects are
+	 *   packed in suboptimal order).
+	 *
+	 * - to use more robust pack-generation codepath (avoiding possible
+	 *   bugs in bitmap code and possible bitmap index corruption).
+	 */
+	if (!pack_to_stdout)
+		use_bitmap_index_default = 0;
+
+	if (use_bitmap_index < 0)
+		use_bitmap_index = use_bitmap_index_default;
+
+	/* "hard" reasons not to use bitmaps; these just won't work at all */
+	if (!use_internal_rev_list || (!pack_to_stdout && write_bitmap_index) || is_repository_shallow())
 		use_bitmap_index = 0;
 
 	if (pack_to_stdout || !rev_list_all)
diff --git a/t/perf/p5310-pack-bitmaps.sh b/t/perf/p5310-pack-bitmaps.sh
index de2a224..bb91dbb 100755
--- a/t/perf/p5310-pack-bitmaps.sh
+++ b/t/perf/p5310-pack-bitmaps.sh
@@ -32,6 +32,14 @@ test_perf 'simulated fetch' '
 	} | git pack-objects --revs --stdout >/dev/null
 '
 
+test_perf 'pack to file' '
+	git pack-objects --all pack1 </dev/null >/dev/null
+'
+
+test_perf 'pack to file (bitmap)' '
+	git pack-objects --use-bitmap-index --all pack1b </dev/null >/dev/null
+'
+
 test_expect_success 'create partial bitmap state' '
 	# pick a commit to represent the repo tip in the past
 	cutoff=$(git rev-list HEAD~100 -1) &&
@@ -53,8 +61,12 @@ test_expect_success 'create partial bitmap state' '
 	git update-ref HEAD $orig_tip
 '
 
-test_perf 'partial bitmap' '
+test_perf 'clone (partial bitmap)' '
 	git pack-objects --stdout --all </dev/null >/dev/null
 '
 
+test_perf 'pack to file (partial bitmap)' '
+	git pack-objects --use-bitmap-index --all pack2b </dev/null >/dev/null
+'
+
 test_done
diff --git a/t/t5310-pack-bitmaps.sh b/t/t5310-pack-bitmaps.sh
index a278d30..9602e9a 100755
--- a/t/t5310-pack-bitmaps.sh
+++ b/t/t5310-pack-bitmaps.sh
@@ -196,6 +196,18 @@ test_expect_success 'pack-objects respects --local (non-local bitmapped pack)' '
 	! has_any packbitmap.objects 3b.objects
 '
 
+test_expect_success 'pack-objects to file can use bitmap' '
+	# make sure we still have 1 bitmap index from previous tests
+	ls .git/objects/pack/ | grep bitmap >output &&
+	test_line_count = 1 output &&
+	# verify equivalent packs are generated with/without using bitmap index
+	packasha1=$(git pack-objects --no-use-bitmap-index --all packa </dev/null) &&
+	packbsha1=$(git pack-objects --use-bitmap-index --all packb </dev/null) &&
+	list_packed_objects <packa-$packasha1.idx >packa.objects &&
+	list_packed_objects <packb-$packbsha1.idx >packb.objects &&
+	test_cmp packa.objects packb.objects
+'
+
 test_expect_success 'full repack, reusing previous bitmaps' '
 	git repack -ad &&
 	ls .git/objects/pack/ | grep bitmap >output &&
-- 
2.9.2.701.gf965a18.dirty

^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH] t/perf/run: Don't forget to copy config.mak.autogen & friends to build area
  2016-09-10 14:57                                           ` Kirill Smelkov
  2016-09-10 15:01                                             ` [PATCH 1/2 v8] " Kirill Smelkov
@ 2016-09-10 15:05                                             ` Kirill Smelkov
  2016-09-12 19:12                                               ` Junio C Hamano
  2016-09-12 17:33                                             ` [PATCH 1/2 v5] pack-objects: respect --local/--honor-pack-keep/--incremental when bitmap is in use Junio C Hamano
  2 siblings, 1 reply; 62+ messages in thread
From: Kirill Smelkov @ 2016-09-10 15:05 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Jeff King, Vicent Marti, Jérome Perrin, Isabelle Vallet,
	Kazuhiko Shiozaki, Julien Muchembled, git, Kirill Smelkov

Otherwise for people who use autotools-based configure in main worktree,
the performance testing results will be inconsistent as work and build
trees could be using e.g. different optimization levels.

See e.g.

	http://public-inbox.org/git/20160818175222.bmm3ivjheokf2qzl@sigill.intra.peff.net/

for example.

NOTE config.status has to be copied because otherwise without it the build
would want to run reconfigure this way loosing just copied config.mak.autogen.

Signed-off-by: Kirill Smelkov <kirr@nexedi.com>
---
 ( Resending as separate patch-mail, just in case )

 t/perf/run | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/t/perf/run b/t/perf/run
index cfd7012..aa383c2 100755
--- a/t/perf/run
+++ b/t/perf/run
@@ -30,7 +30,7 @@ unpack_git_rev () {
 }
 build_git_rev () {
 	rev=$1
-	cp ../../config.mak build/$rev/config.mak
+	cp -t build/$rev ../../{config.mak,config.mak.autogen,config.status}
 	(cd build/$rev && make $GIT_PERF_MAKE_OPTS) ||
 	die "failed to build revision '$mydir'"
 }
-- 
2.9.2.701.gf965a18.dirty

^ permalink raw reply related	[flat|nested] 62+ messages in thread

* Re: [PATCH 1/2 v5] pack-objects: respect --local/--honor-pack-keep/--incremental when bitmap is in use
  2016-09-10 14:57                                           ` Kirill Smelkov
  2016-09-10 15:01                                             ` [PATCH 1/2 v8] " Kirill Smelkov
  2016-09-10 15:05                                             ` [PATCH] t/perf/run: Don't forget to copy config.mak.autogen & friends to build area Kirill Smelkov
@ 2016-09-12 17:33                                             ` Junio C Hamano
  2 siblings, 0 replies; 62+ messages in thread
From: Junio C Hamano @ 2016-09-12 17:33 UTC (permalink / raw)
  To: Kirill Smelkov
  Cc: Jeff King, Vicent Marti, Jérome Perrin, Isabelle Vallet,
	Kazuhiko Shiozaki, Julien Muchembled, git

Kirill Smelkov <kirr@nexedi.com> writes:

> On Thu, Aug 18, 2016 at 01:52:22PM -0400, Jeff King wrote:
> > 
> > Good to know there is no regression. It is curious that there is a
> > slight _improvement_ across the board. Do we have an explanation for
> > that? It seems odd that noise would be so consistent.
>
> Yes, I too thought it and it turned out to be t/perf/run does not copy
> config.mak.autogen & friends to build/ and I'm using autoconf with
> CFLAGS="-march=native -O3 ..."
>
> Junio, I could not resist to the following:
> ...
> With corrected t/perf/run the timings are more realistic - e.g. 3
> consecutive runs of `./run 56dfeb62 . ./p5310-pack-bitmaps.sh`:

Wow, that's what I call an exchange with quality during a review ;-)

Thanks for the curiosity and digging it to the root cause of the
anomaly.  Some GNUism/bashism in the way copying is spelled in the
patch bothers me, but that is easily fixable.

Thanks.

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH] t/perf/run: Don't forget to copy config.mak.autogen & friends to build area
  2016-09-10 15:05                                             ` [PATCH] t/perf/run: Don't forget to copy config.mak.autogen & friends to build area Kirill Smelkov
@ 2016-09-12 19:12                                               ` Junio C Hamano
  2016-09-12 19:17                                                 ` Junio C Hamano
  0 siblings, 1 reply; 62+ messages in thread
From: Junio C Hamano @ 2016-09-12 19:12 UTC (permalink / raw)
  To: Kirill Smelkov
  Cc: Jeff King, Vicent Marti, Jérome Perrin, Isabelle Vallet,
	Kazuhiko Shiozaki, Julien Muchembled, git

Kirill Smelkov <kirr@nexedi.com> writes:

> Otherwise for people who use autotools-based configure in main worktree,
> the performance testing results will be inconsistent as work and build
> trees could be using e.g. different optimization levels.
>
> See e.g.
>
> 	http://public-inbox.org/git/20160818175222.bmm3ivjheokf2qzl@sigill.intra.peff.net/
>
> for example.
>
> NOTE config.status has to be copied because otherwise without it the build
> would want to run reconfigure this way loosing just copied config.mak.autogen.
>
> Signed-off-by: Kirill Smelkov <kirr@nexedi.com>
> ---
>  ( Resending as separate patch-mail, just in case )
>
>  t/perf/run | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/t/perf/run b/t/perf/run
> index cfd7012..aa383c2 100755
> --- a/t/perf/run
> +++ b/t/perf/run
> @@ -30,7 +30,7 @@ unpack_git_rev () {
>  }
>  build_git_rev () {
>  	rev=$1
> -	cp ../../config.mak build/$rev/config.mak
> +	cp -t build/$rev ../../{config.mak,config.mak.autogen,config.status}

That unfortunately is a GNUism -t with a bash-ism {a,b,c}; just keep
it simple and stupid to make sure it is portable.

This is not even a part that we measure the runtime for anyway.

>  	(cd build/$rev && make $GIT_PERF_MAKE_OPTS) ||
>  	die "failed to build revision '$mydir'"
>  }

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH] t/perf/run: Don't forget to copy config.mak.autogen & friends to build area
  2016-09-12 19:12                                               ` Junio C Hamano
@ 2016-09-12 19:17                                                 ` Junio C Hamano
  2016-09-12 23:10                                                   ` Junio C Hamano
  0 siblings, 1 reply; 62+ messages in thread
From: Junio C Hamano @ 2016-09-12 19:17 UTC (permalink / raw)
  To: Kirill Smelkov
  Cc: Jeff King, Vicent Marti, Jérome Perrin, Isabelle Vallet,
	Kazuhiko Shiozaki, Julien Muchembled, git

Junio C Hamano <gitster@pobox.com> writes:

>>  build_git_rev () {
>>  	rev=$1
>> -	cp ../../config.mak build/$rev/config.mak
>> +	cp -t build/$rev ../../{config.mak,config.mak.autogen,config.status}
>
> That unfortunately is a GNUism -t with a bash-ism {a,b,c}; just keep
> it simple and stupid to make sure it is portable.
>
> This is not even a part that we measure the runtime for anyway.

In other words, something along this line, perhaps.

 t/perf/run | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/t/perf/run b/t/perf/run
index aa383c2..69a4714 100755
--- a/t/perf/run
+++ b/t/perf/run
@@ -30,7 +30,10 @@ unpack_git_rev () {
 }
 build_git_rev () {
 	rev=$1
-	cp -t build/$rev ../../{config.mak,config.mak.autogen,config.status}
+	for config in config.mak config.mak.autogen config.status
+	do
+		cp "../../$config" "build/$rev/"
+	done
 	(cd build/$rev && make $GIT_PERF_MAKE_OPTS) ||
 	die "failed to build revision '$mydir'"
 }
-- 
2.10.0-342-gc678130


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* Re: [PATCH 2/2 v7] pack-objects: use reachability bitmap index when generating non-stdout pack
  2016-09-10 14:59                                           ` Kirill Smelkov
  2016-09-10 15:01                                             ` [PATCH 2/2 v8] " Kirill Smelkov
@ 2016-09-12 19:21                                             ` Junio C Hamano
  1 sibling, 0 replies; 62+ messages in thread
From: Junio C Hamano @ 2016-09-12 19:21 UTC (permalink / raw)
  To: Kirill Smelkov
  Cc: Jeff King, Vicent Marti, Jérome Perrin, Isabelle Vallet,
	Kazuhiko Shiozaki, Julien Muchembled, git

Kirill Smelkov <kirr@nexedi.com> writes:

>> This is v7, but as I understand your numbering, it goes with v5 of patch
>> 1/2 that I just reviewed (usually we just increment the version number
>> on the whole series and treat it as a unit, even if some patches didn't
>> change from version to version).
>
> The reason those patches are having their own numbers is that they are
> orthogonal to each other and can be applied / rejected independently.

In such a case, we wouldn't label them 1/2 and 2/2, which tells the
readers that these are two pieces that are to be applied together to
form a single unit of change.  That was what these numbered patches
with different version numbers confusing.

> But ok, since now we have them considered both together, their next
> versions posted will be uniform v8.

OK.  Thanks for clarifying.

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH] t/perf/run: Don't forget to copy config.mak.autogen & friends to build area
  2016-09-12 19:17                                                 ` Junio C Hamano
@ 2016-09-12 23:10                                                   ` Junio C Hamano
  2016-09-13  6:58                                                     ` Kirill Smelkov
  0 siblings, 1 reply; 62+ messages in thread
From: Junio C Hamano @ 2016-09-12 23:10 UTC (permalink / raw)
  To: Kirill Smelkov
  Cc: Jeff King, Vicent Marti, Jérome Perrin, Isabelle Vallet,
	Kazuhiko Shiozaki, Julien Muchembled, git

Junio C Hamano <gitster@pobox.com> writes:

> In other words, something along this line, perhaps.
> ...

Not quite.  There is no guanratee that the user is using autoconf at
all.  It should be more like this, I think.

 t/perf/run | 8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/t/perf/run b/t/perf/run
index aa383c2..7ec3734 100755
--- a/t/perf/run
+++ b/t/perf/run
@@ -30,7 +30,13 @@ unpack_git_rev () {
 }
 build_git_rev () {
 	rev=$1
-	cp -t build/$rev ../../{config.mak,config.mak.autogen,config.status}
+	for config in config.mak config.mak.autogen config.status
+	do
+		if test -f "../../$config"
+		then
+			cp "../../$config" "build/$rev/"
+		fi
+	done
 	(cd build/$rev && make $GIT_PERF_MAKE_OPTS) ||
 	die "failed to build revision '$mydir'"
 }



^ permalink raw reply related	[flat|nested] 62+ messages in thread

* Re: [PATCH 1/2 v8] pack-objects: respect --local/--honor-pack-keep/--incremental when bitmap is in use
  2016-09-10 15:01                                             ` [PATCH 1/2 v8] " Kirill Smelkov
@ 2016-09-13  6:23                                               ` Junio C Hamano
  2016-09-13  7:50                                                 ` Kirill Smelkov
  0 siblings, 1 reply; 62+ messages in thread
From: Junio C Hamano @ 2016-09-13  6:23 UTC (permalink / raw)
  To: Kirill Smelkov
  Cc: Jeff King, Vicent Marti, Jérome Perrin, Isabelle Vallet,
	Kazuhiko Shiozaki, Julien Muchembled, git

Kirill Smelkov <kirr@nexedi.com> writes:

> +static int want_found_object(int exclude, struct packed_git *p)
> +{
> +	if (exclude)
> +		return 1;
> +	if (incremental)
> +		return 0;
> +
> +	/*
> +	 * When asked to do --local (do not include an object that appears in a
> +	 * pack we borrow from elsewhere) or --honor-pack-keep (do not include
> +	 * an object that appears in a pack marked with .keep), finding a pack
> +	 * that matches the criteria is sufficient for us to decide to omit it.
> +	 * However, even if this pack does not satisfy the criteria, we need to
> +	 * make sure no copy of this object appears in _any_ pack that makes us
> +	 * to omit the object, so we need to check all the packs.
> +	 *
> +	 * We can however first check whether these options can possible matter;
> +	 * if they do not matter we know we want the object in generated pack.
> +	 * Otherwise, we signal "-1" at the end to tell the caller that we do
> +	 * not know either way, and it needs to check more packs.
> +	 */
> +	if (!ignore_packed_keep &&
> +	    (!local || !have_non_local_packs))
> +		return 1;
> +
> +	if (local && !p->pack_local)
> +		return 0;
> +	if (ignore_packed_keep && p->pack_local && p->pack_keep)
> +		return 0;
> +
> +	/* we don't know yet; keep looking for more packs */
> +	return -1;
> +}

Moving this logic out to this helper made the main logic in the
caller easier to grasp.

> @@ -958,15 +993,30 @@ static int want_object_in_pack(const unsigned char *sha1,
>  			       off_t *found_offset)
>  {
>  	struct packed_git *p;
> +	int want;
>  
>  	if (!exclude && local && has_loose_object_nonlocal(sha1))
>  		return 0;
>  
> +	/*
> +	 * If we already know the pack object lives in, start checks from that
> +	 * pack - in the usual case when neither --local was given nor .keep files
> +	 * are present we will determine the answer right now.
> +	 */
> +	if (*found_pack) {
> +		want = want_found_object(exclude, *found_pack);
> +		if (want != -1)
> +			return want;
> +	}
>  
>  	for (p = packed_git; p; p = p->next) {
> +		off_t offset;
> +
> +		if (p == *found_pack)
> +			offset = *found_offset;
> +		else
> +			offset = find_pack_entry_one(sha1, p);
> +
>  		if (offset) {
>  			if (!*found_pack) {
>  				if (!is_pack_valid(p))
> @@ -974,31 +1024,9 @@ static int want_object_in_pack(const unsigned char *sha1,
>  				*found_offset = offset;
>  				*found_pack = p;
>  			}
> +			want = want_found_object(exclude, p);
> +			if (want != -1)
> +				return want;
>  		}
>  	}

As Peff noted in his earlier review, however, MRU code needed to be
grafted in to the caller (an update to the MRU list was done in the
code that was moved to the want_found_object() helper).  I think I
did it correctly, which ended up looking like this:

                want = want_found_object(exclude, p);
                if (!exclude && want > 0)
                        mru_mark(packed_git_mru, entry);
                if (want != -1)
                        return want;

I somewhat feel that it is ugly that the helper knows about exclude
(i.e. in the original code, we immediately returned 1 without
futzing with the MRU when we find an entry that is to be excluded,
which now is done in the helper), and the caller also knows about
exclude (i.e. the caller knows that the helper may return positive
in two cases, it knows that MRU marking needs to happen only one of
the two cases, and it also knows that "exclude" is what
differentiates between the two cases) at the same time.

But probably the reason why I feel it ugly is only because I knew
how the original looked like.  I dunno.




^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH] t/perf/run: Don't forget to copy config.mak.autogen & friends to build area
  2016-09-12 23:10                                                   ` Junio C Hamano
@ 2016-09-13  6:58                                                     ` Kirill Smelkov
  0 siblings, 0 replies; 62+ messages in thread
From: Kirill Smelkov @ 2016-09-13  6:58 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Jeff King, Vicent Marti, Jérome Perrin, Isabelle Vallet,
	Kazuhiko Shiozaki, Julien Muchembled, git

On Mon, Sep 12, 2016 at 04:10:09PM -0700, Junio C Hamano wrote:
> Junio C Hamano <gitster@pobox.com> writes:
> 
> > In other words, something along this line, perhaps.
> > ...
> 
> Not quite.  There is no guanratee that the user is using autoconf at
> all.  It should be more like this, I think.
> 
>  t/perf/run | 8 +++++++-
>  1 file changed, 7 insertions(+), 1 deletion(-)
> 
> diff --git a/t/perf/run b/t/perf/run
> index aa383c2..7ec3734 100755
> --- a/t/perf/run
> +++ b/t/perf/run
> @@ -30,7 +30,13 @@ unpack_git_rev () {
>  }
>  build_git_rev () {
>  	rev=$1
> -	cp -t build/$rev ../../{config.mak,config.mak.autogen,config.status}
> +	for config in config.mak config.mak.autogen config.status
> +	do
> +		if test -f "../../$config"
> +		then
> +			cp "../../$config" "build/$rev/"
> +		fi
> +	done
>  	(cd build/$rev && make $GIT_PERF_MAKE_OPTS) ||
>  	die "failed to build revision '$mydir'"
>  }

Junio, thanks for encouraging feedback and for catching the *-isms. What
you propose is good (and we also automatically fix error when there was
no config.mak - it was working but cp was giving an error to stderr but
script was continuing normally).

I would amend your squash the following way:

* `test -f` -> `test -e`, because -f tests whether a file exists _and_
  is regular file. Some people might have config.mak as a symlink for
  example. We don't want to miss them too.

Please find updated patch below:

---- 8< ----
From: Kirill Smelkov <kirr@nexedi.com>
Subject: [PATCH] t/perf/run: Don't forget to copy config.mak.autogen & friends
 to build area

Otherwise for people who use autotools-based configure in main worktree,
the performance testing results will be inconsistent as work and build
trees could be using e.g. different optimization levels.

See e.g.

	http://public-inbox.org/git/20160818175222.bmm3ivjheokf2qzl@sigill.intra.peff.net/

for example.

NOTE config.status has to be copied because otherwise without it the build
would want to run reconfigure this way loosing just copied config.mak.autogen.

Signed-off-by: Kirill Smelkov <kirr@nexedi.com>
---
 t/perf/run | 8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/t/perf/run b/t/perf/run
index cfd7012..e8adeda 100755
--- a/t/perf/run
+++ b/t/perf/run
@@ -30,7 +30,13 @@ unpack_git_rev () {
 }
 build_git_rev () {
 	rev=$1
-	cp ../../config.mak build/$rev/config.mak
+	for config in config.mak config.mak.autogen config.status
+	do
+		if test -e "../../$config"
+		then
+			cp "../../$config" "build/$rev/"
+		fi
+	done
 	(cd build/$rev && make $GIT_PERF_MAKE_OPTS) ||
 	die "failed to build revision '$mydir'"
 }
-- 
2.9.2.701.gf965a18.dirty

^ permalink raw reply related	[flat|nested] 62+ messages in thread

* Re: [PATCH 1/2 v8] pack-objects: respect --local/--honor-pack-keep/--incremental when bitmap is in use
  2016-09-13  6:23                                               ` Junio C Hamano
@ 2016-09-13  7:50                                                 ` Kirill Smelkov
  0 siblings, 0 replies; 62+ messages in thread
From: Kirill Smelkov @ 2016-09-13  7:50 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Jeff King, Vicent Marti, Jérome Perrin, Isabelle Vallet,
	Kazuhiko Shiozaki, Julien Muchembled, git

On Mon, Sep 12, 2016 at 11:23:18PM -0700, Junio C Hamano wrote:
> Kirill Smelkov <kirr@nexedi.com> writes:
> 
> > +static int want_found_object(int exclude, struct packed_git *p)
> > +{
> > +	if (exclude)
> > +		return 1;
> > +	if (incremental)
> > +		return 0;
> > +
> > +	/*
> > +	 * When asked to do --local (do not include an object that appears in a
> > +	 * pack we borrow from elsewhere) or --honor-pack-keep (do not include
> > +	 * an object that appears in a pack marked with .keep), finding a pack
> > +	 * that matches the criteria is sufficient for us to decide to omit it.
> > +	 * However, even if this pack does not satisfy the criteria, we need to
> > +	 * make sure no copy of this object appears in _any_ pack that makes us
> > +	 * to omit the object, so we need to check all the packs.
> > +	 *
> > +	 * We can however first check whether these options can possible matter;
> > +	 * if they do not matter we know we want the object in generated pack.
> > +	 * Otherwise, we signal "-1" at the end to tell the caller that we do
> > +	 * not know either way, and it needs to check more packs.
> > +	 */
> > +	if (!ignore_packed_keep &&
> > +	    (!local || !have_non_local_packs))
> > +		return 1;
> > +
> > +	if (local && !p->pack_local)
> > +		return 0;
> > +	if (ignore_packed_keep && p->pack_local && p->pack_keep)
> > +		return 0;
> > +
> > +	/* we don't know yet; keep looking for more packs */
> > +	return -1;
> > +}
> 
> Moving this logic out to this helper made the main logic in the
> caller easier to grasp.
> 
> > @@ -958,15 +993,30 @@ static int want_object_in_pack(const unsigned char *sha1,
> >  			       off_t *found_offset)
> >  {
> >  	struct packed_git *p;
> > +	int want;
> >  
> >  	if (!exclude && local && has_loose_object_nonlocal(sha1))
> >  		return 0;
> >  
> > +	/*
> > +	 * If we already know the pack object lives in, start checks from that
> > +	 * pack - in the usual case when neither --local was given nor .keep files
> > +	 * are present we will determine the answer right now.
> > +	 */
> > +	if (*found_pack) {
> > +		want = want_found_object(exclude, *found_pack);
> > +		if (want != -1)
> > +			return want;
> > +	}
> >  
> >  	for (p = packed_git; p; p = p->next) {
> > +		off_t offset;
> > +
> > +		if (p == *found_pack)
> > +			offset = *found_offset;
> > +		else
> > +			offset = find_pack_entry_one(sha1, p);
> > +
> >  		if (offset) {
> >  			if (!*found_pack) {
> >  				if (!is_pack_valid(p))
> > @@ -974,31 +1024,9 @@ static int want_object_in_pack(const unsigned char *sha1,
> >  				*found_offset = offset;
> >  				*found_pack = p;
> >  			}
> > +			want = want_found_object(exclude, p);
> > +			if (want != -1)
> > +				return want;
> >  		}
> >  	}
> 
> As Peff noted in his earlier review, however, MRU code needed to be
> grafted in to the caller (an update to the MRU list was done in the
> code that was moved to the want_found_object() helper).  I think I
> did it correctly, which ended up looking like this:
> 
>                 want = want_found_object(exclude, p);
>                 if (!exclude && want > 0)
>                         mru_mark(packed_git_mru, entry);
>                 if (want != -1)
>                         return want;
> 
> I somewhat feel that it is ugly that the helper knows about exclude
> (i.e. in the original code, we immediately returned 1 without
> futzing with the MRU when we find an entry that is to be excluded,
> which now is done in the helper), and the caller also knows about
> exclude (i.e. the caller knows that the helper may return positive
> in two cases, it knows that MRU marking needs to happen only one of
> the two cases, and it also knows that "exclude" is what
> differentiates between the two cases) at the same time.
> 
> But probably the reason why I feel it ugly is only because I knew
> how the original looked like.  I dunno.

Junio, the code above is correct semantic merge of pack-mru and my
topic, because in pack-mru if found and exclude=1, 1 was returned
without marking found pack.

But I wonder: even if we exclude an object, we were still looking for it
in packs, and when we found it, we found the corresponding pack too. So,
that pack _was_ most-recently-used, and it is correct to mark it as MRU.

We can do the simplification in the follow-up patch after the merge, so
merge does not change semantics and it is all bisectable, etc.

Jeff?

^ permalink raw reply	[flat|nested] 62+ messages in thread

end of thread, other threads:[~2016-09-13  7:50 UTC | newest]

Thread overview: 62+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-07-07 19:09 [PATCH] pack-objects: Use reachability bitmap index when generating non-stdout pack too Kirill Smelkov
2016-07-07 20:52 ` Jeff King
2016-07-08 10:38   ` Kirill Smelkov
2016-07-12 19:08     ` Kirill Smelkov
2016-07-13  8:30       ` Jeff King
2016-07-13  8:26     ` Jeff King
2016-07-13 10:52       ` Kirill Smelkov
2016-07-17 17:06         ` Kirill Smelkov
2016-07-19 11:29           ` Jeff King
2016-07-19 12:14             ` Kirill Smelkov
2016-07-25 18:40         ` Jeff King
2016-07-25 18:53           ` Jeff King
2016-07-27 20:15           ` Kirill Smelkov
2016-07-27 20:40             ` Junio C Hamano
2016-07-28 20:22               ` Kirill Smelkov
2016-07-28 21:18                 ` Junio C Hamano
2016-07-29  7:40                   ` Kirill Smelkov
2016-07-29  7:46                     ` [PATCH 1/2] pack-objects: Teach --use-bitmap-index codepath to respect --local, --honor-pack-keep and --incremental Kirill Smelkov
2016-08-01 18:17                       ` Junio C Hamano
2016-08-08 12:37                         ` Kirill Smelkov
2016-08-08 13:50                           ` Jeff King
2016-08-08 13:51                             ` Jeff King
2016-08-08 16:08                             ` Junio C Hamano
2016-08-08 19:06                             ` Junio C Hamano
2016-08-08 19:09                               ` Jeff King
2016-08-08 16:11                           ` Junio C Hamano
2016-08-08 18:19                             ` Kirill Smelkov
2016-08-08 18:57                               ` [PATCH v3] " Kirill Smelkov
2016-08-08 19:26                               ` [PATCH 1/2] " Junio C Hamano
2016-08-09 11:21                                 ` Kirill Smelkov
2016-08-09 11:25                                   ` [PATCH 1/2 v4] pack-objects: respect --local/--honor-pack-keep/--incremental when bitmap is in use Kirill Smelkov
2016-08-09 16:52                                   ` [PATCH 1/2] pack-objects: Teach --use-bitmap-index codepath to respect --local, --honor-pack-keep and --incremental Junio C Hamano
2016-08-09 19:29                                     ` Kirill Smelkov
2016-08-09 19:31                                       ` [PATCH 1/2 v5] pack-objects: respect --local/--honor-pack-keep/--incremental when bitmap is in use Kirill Smelkov
2016-08-18 17:52                                         ` Jeff King
2016-09-10 14:57                                           ` Kirill Smelkov
2016-09-10 15:01                                             ` [PATCH 1/2 v8] " Kirill Smelkov
2016-09-13  6:23                                               ` Junio C Hamano
2016-09-13  7:50                                                 ` Kirill Smelkov
2016-09-10 15:05                                             ` [PATCH] t/perf/run: Don't forget to copy config.mak.autogen & friends to build area Kirill Smelkov
2016-09-12 19:12                                               ` Junio C Hamano
2016-09-12 19:17                                                 ` Junio C Hamano
2016-09-12 23:10                                                   ` Junio C Hamano
2016-09-13  6:58                                                     ` Kirill Smelkov
2016-09-12 17:33                                             ` [PATCH 1/2 v5] pack-objects: respect --local/--honor-pack-keep/--incremental when bitmap is in use Junio C Hamano
2016-08-09 19:32                                       ` [PATCH 2/2 v7] pack-objects: use reachability bitmap index when generating non-stdout pack Kirill Smelkov
2016-08-18 18:06                                         ` Jeff King
2016-09-10 14:59                                           ` Kirill Smelkov
2016-09-10 15:01                                             ` [PATCH 2/2 v8] " Kirill Smelkov
2016-09-12 19:21                                             ` [PATCH 2/2 v7] " Junio C Hamano
2016-08-09 19:49                                       ` [PATCH 1/2] pack-objects: Teach --use-bitmap-index codepath to respect --local, --honor-pack-keep and --incremental Junio C Hamano
2016-07-29  7:47                     ` [PATCH v4 2/2] pack-objects: Teach it to use reachability bitmap index when generating non-stdout pack too Kirill Smelkov
2016-08-08 13:56                       ` Jeff King
2016-08-08 15:40                         ` Kirill Smelkov
2016-08-08 18:08                           ` Junio C Hamano
2016-08-08 18:13                             ` Kirill Smelkov
2016-08-08 18:28                               ` Junio C Hamano
2016-08-08 18:58                                 ` Kirill Smelkov
2016-08-08 18:55                           ` [PATCH v5] pack-objects: teach " Kirill Smelkov
2016-08-08 20:53                             ` Junio C Hamano
2016-08-09 11:21                               ` Kirill Smelkov
2016-08-09 11:26                                 ` [PATCH 2/2 v6] pack-objects: use reachability bitmap index when generating non-stdout pack Kirill Smelkov

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).