git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
* [PATCH 1/2] fetch: run gc --auto after fetching
  2012-12-07 13:53 [PATCH 0/2] make repeated fetch faster on "read only" repos Jeff King
@ 2012-12-07 13:58 ` Jeff King
  0 siblings, 0 replies; 22+ messages in thread
From: Jeff King @ 2012-12-07 13:58 UTC (permalink / raw)
  To: git

We generally try to run "gc --auto" after any commands that
might introduce a large number of new objects. An obvious
place to do so is after running "fetch", which may introduce
new loose objects or packs (depending on the size of the
fetch).

While an active developer repository will probably
eventually trigger a "gc --auto" on another action (e.g.,
git-rebase), there are two good reasons why it is nicer to
do it at fetch time:

  1. Read-only repositories which track an upstream (e.g., a
     continuous integration server which fetches and builds,
     but never makes new commits) will accrue loose objects
     and small packs, but never coalesce them into a more
     efficient larger pack.

  2. Fetching is often already perceived to be slow to the
     user, since they have to wait on the network. It's much
     more pleasant to include a potentially slow auto-gc as
     part of the already-long network fetch than in the
     middle of productive work with git-rebase or similar.

Signed-off-by: Jeff King <peff@peff.net>
---
The original "gc --auto" patch in 2007 suggested adding this to fetch,
but we never did (there was some brief mention that maybe tweaking
fetch.unpacklimit would be relevant, but I think you would eventually
want to gc, whether you are accruing packs or loose objects).

I didn't bother protecting this with an option (as we do for
receive.gc). I don't see much point, as anybody who doesn't want gc on
their client repos will already have set gc.auto to prevent it from
triggering via other commands).

 builtin/fetch.c | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/builtin/fetch.c b/builtin/fetch.c
index 4b5a898..1ddbf0d 100644
--- a/builtin/fetch.c
+++ b/builtin/fetch.c
@@ -959,6 +959,9 @@ int cmd_fetch(int argc, const char **argv, const char *prefix)
 	struct string_list list = STRING_LIST_INIT_NODUP;
 	struct remote *remote;
 	int result = 0;
+	static const char *argv_gc_auto[] = {
+		"gc", "--auto", NULL,
+	};
 
 	packet_trace_identity("fetch");
 
@@ -1026,5 +1029,7 @@ int cmd_fetch(int argc, const char **argv, const char *prefix)
 	list.strdup_strings = 1;
 	string_list_clear(&list, 0);
 
+	run_command_v_opt(argv_gc_auto, RUN_GIT_CMD);
+
 	return result;
 }
-- 
1.8.1.rc0.4.g5948dfd.dirty

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 0/2] optimizing pack access on "read only" fetch repos
@ 2013-01-26 22:40 Jeff King
  2013-01-26 22:40 ` [PATCH 1/2] fetch: run gc --auto after fetching Jeff King
                   ` (2 more replies)
  0 siblings, 3 replies; 22+ messages in thread
From: Jeff King @ 2013-01-26 22:40 UTC (permalink / raw)
  To: git

This is a repost from here:

  http://thread.gmane.org/gmane.comp.version-control.git/211176

which got no response initially. Basically the issue is that read-only
repos (e.g., a CI server) whose workflow is something like:

  git fetch $some_branch &&
  git checkout -f $some_branch &&
  make test

will never run git-gc, and will accumulate a bunch of small packs and
loose objects, leading to poor performance.

Patch 1 runs "gc --auto" on fetch, which I think is sane to do.

Patch 2 optimizes our pack dir re-scanning for fetch-pack (which, unlike
the rest of git, should expect to be missing lots of objects, since we
are deciding what to fetch).

I think 1 is a no-brainer. If your repo is packed, patch 2 matters less,
but it still seems like a sensible optimization to me.

  [1/2]: fetch: run gc --auto after fetching
  [2/2]: fetch-pack: avoid repeatedly re-scanning pack directory

-Peff

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH 1/2] fetch: run gc --auto after fetching
  2013-01-26 22:40 [PATCH 0/2] optimizing pack access on "read only" fetch repos Jeff King
@ 2013-01-26 22:40 ` Jeff King
  2013-01-27  1:51   ` Jonathan Nieder
       [not found]   ` <87bmopzbqx.fsf@gmail.com>
  2013-01-26 22:40 ` [PATCH 2/2] fetch-pack: avoid repeatedly re-scanning pack directory Jeff King
  2013-01-27  6:32 ` [PATCH 0/2] optimizing pack access on "read only" fetch repos Junio C Hamano
  2 siblings, 2 replies; 22+ messages in thread
From: Jeff King @ 2013-01-26 22:40 UTC (permalink / raw)
  To: git

We generally try to run "gc --auto" after any commands that
might introduce a large number of new objects. An obvious
place to do so is after running "fetch", which may introduce
new loose objects or packs (depending on the size of the
fetch).

While an active developer repository will probably
eventually trigger a "gc --auto" on another action (e.g.,
git-rebase), there are two good reasons why it is nicer to
do it at fetch time:

  1. Read-only repositories which track an upstream (e.g., a
     continuous integration server which fetches and builds,
     but never makes new commits) will accrue loose objects
     and small packs, but never coalesce them into a more
     efficient larger pack.

  2. Fetching is often already perceived to be slow to the
     user, since they have to wait on the network. It's much
     more pleasant to include a potentially slow auto-gc as
     part of the already-long network fetch than in the
     middle of productive work with git-rebase or similar.

Signed-off-by: Jeff King <peff@peff.net>
---
 builtin/fetch.c | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/builtin/fetch.c b/builtin/fetch.c
index 4b5a898..1ddbf0d 100644
--- a/builtin/fetch.c
+++ b/builtin/fetch.c
@@ -959,6 +959,9 @@ int cmd_fetch(int argc, const char **argv, const char *prefix)
 	struct string_list list = STRING_LIST_INIT_NODUP;
 	struct remote *remote;
 	int result = 0;
+	static const char *argv_gc_auto[] = {
+		"gc", "--auto", NULL,
+	};
 
 	packet_trace_identity("fetch");
 
@@ -1026,5 +1029,7 @@ int cmd_fetch(int argc, const char **argv, const char *prefix)
 	list.strdup_strings = 1;
 	string_list_clear(&list, 0);
 
+	run_command_v_opt(argv_gc_auto, RUN_GIT_CMD);
+
 	return result;
 }
-- 
1.8.0.2.16.g72e2fc9

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 2/2] fetch-pack: avoid repeatedly re-scanning pack directory
  2013-01-26 22:40 [PATCH 0/2] optimizing pack access on "read only" fetch repos Jeff King
  2013-01-26 22:40 ` [PATCH 1/2] fetch: run gc --auto after fetching Jeff King
@ 2013-01-26 22:40 ` Jeff King
  2013-01-27 10:27   ` Jonathan Nieder
  2013-01-27  6:32 ` [PATCH 0/2] optimizing pack access on "read only" fetch repos Junio C Hamano
  2 siblings, 1 reply; 22+ messages in thread
From: Jeff King @ 2013-01-26 22:40 UTC (permalink / raw)
  To: git

When we look up a sha1 object for reading, we first check
packfiles, and then loose objects. If we still haven't found
it, we re-scan the list of packfiles in `objects/pack`. This
final step ensures that we can co-exist with a simultaneous
repack process which creates a new pack and then prunes the
old object.

This extra re-scan usually does not have a performance
impact for two reasons:

  1. If an object is missing, then typically the re-scan
     will find a new pack, then no more misses will occur.
     Or if it truly is missing, then our next step is
     usually to die().

  2. Re-scanning is cheap enough that we do not even notice.

However, these do not always hold. The assumption in (1) is
that the caller is expecting to find the object. This is
usually the case, but the call to `parse_object` in
`everything_local` does not follow this pattern. It is
looking to see whether we have objects that the remote side
is advertising, not something we expect to have. Therefore
if we are fetching from a remote which has many refs
pointing to objects we do not have, we may end up
re-scanning the pack directory many times.

Even with this extra re-scanning, the impact is often not
noticeable due to (2); we just readdir() the packs directory
and skip any packs that are already loaded. However, if
there are a large number of packs, then even enumerating the
directory directory can be expensive (especially if we do it
repeatedly). Having this many packs is a good sign the user
should run `git gc`, but it would still be nice to avoid
having to scan the directory at all.

Signed-off-by: Jeff King <peff@peff.net>
---
 fetch-pack.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/fetch-pack.c b/fetch-pack.c
index f0acdf7..6d8926a 100644
--- a/fetch-pack.c
+++ b/fetch-pack.c
@@ -594,6 +594,9 @@ static int everything_local(struct fetch_pack_args *args,
 	for (ref = *refs; ref; ref = ref->next) {
 		struct object *o;
 
+		if (!has_sha1_file(ref->old_sha1))
+			continue;
+
 		o = parse_object(ref->old_sha1);
 		if (!o)
 			continue;
-- 
1.8.0.2.16.g72e2fc9

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH 1/2] fetch: run gc --auto after fetching
  2013-01-26 22:40 ` [PATCH 1/2] fetch: run gc --auto after fetching Jeff King
@ 2013-01-27  1:51   ` Jonathan Nieder
       [not found]   ` <87bmopzbqx.fsf@gmail.com>
  1 sibling, 0 replies; 22+ messages in thread
From: Jonathan Nieder @ 2013-01-27  1:51 UTC (permalink / raw)
  To: Jeff King; +Cc: git, Junio C Hamano

Jeff King wrote:

> --- a/builtin/fetch.c
> +++ b/builtin/fetch.c
> @@ -959,6 +959,9 @@ int cmd_fetch(int argc, const char **argv, const char *prefix)
>  	struct string_list list = STRING_LIST_INIT_NODUP;
>  	struct remote *remote;
>  	int result = 0;
> +	static const char *argv_gc_auto[] = {
> +		"gc", "--auto", NULL,
> +	};
>  
>  	packet_trace_identity("fetch");
>  
> @@ -1026,5 +1029,7 @@ int cmd_fetch(int argc, const char **argv, const char *prefix)
>  	list.strdup_strings = 1;
>  	string_list_clear(&list, 0);
>  
> +	run_command_v_opt(argv_gc_auto, RUN_GIT_CMD);
> +
>  	return result;

Good idea, and the execution is obviously correct.  Thanks.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 0/2] optimizing pack access on "read only" fetch repos
  2013-01-26 22:40 [PATCH 0/2] optimizing pack access on "read only" fetch repos Jeff King
  2013-01-26 22:40 ` [PATCH 1/2] fetch: run gc --auto after fetching Jeff King
  2013-01-26 22:40 ` [PATCH 2/2] fetch-pack: avoid repeatedly re-scanning pack directory Jeff King
@ 2013-01-27  6:32 ` Junio C Hamano
  2013-01-29  8:06   ` Shawn Pearce
                     ` (2 more replies)
  2 siblings, 3 replies; 22+ messages in thread
From: Junio C Hamano @ 2013-01-27  6:32 UTC (permalink / raw)
  To: Jeff King; +Cc: git

Jeff King <peff@peff.net> writes:

> This is a repost from here:
>
>   http://thread.gmane.org/gmane.comp.version-control.git/211176
>
> which got no response initially. Basically the issue is that read-only
> repos (e.g., a CI server) whose workflow is something like:
>
>   git fetch $some_branch &&
>   git checkout -f $some_branch &&
>   make test
>
> will never run git-gc, and will accumulate a bunch of small packs and
> loose objects, leading to poor performance.
>
> Patch 1 runs "gc --auto" on fetch, which I think is sane to do.
>
> Patch 2 optimizes our pack dir re-scanning for fetch-pack (which, unlike
> the rest of git, should expect to be missing lots of objects, since we
> are deciding what to fetch).
>
> I think 1 is a no-brainer. If your repo is packed, patch 2 matters less,
> but it still seems like a sensible optimization to me.
>
>   [1/2]: fetch: run gc --auto after fetching
>   [2/2]: fetch-pack: avoid repeatedly re-scanning pack directory
>
> -Peff

Both makes sense to me.

I also wonder if we would be helped by another "repack" mode that
coalesces small packs into a single one with minimum overhead, and
run that often from "gc --auto", so that we do not end up having to
have 50 packfiles.

When we have 2 or more small and young packs, we could:

 - iterate over idx files for these packs to enumerate the objects
   to be packed, replacing read_object_list_from_stdin() step;

 - always choose to copy the data we have in these existing packs,
   instead of doing a full prepare_pack(); and

 - use the order the objects appear in the original packs, bypassing
   compute_write_order().

The procedure cannot be a straight byte-for-byte copy, because some
objects may appear in multiple packs, and extra copies of the same
object have to be excised from the result.  OFS_DELTA offsets need
to be adjusted for objects that appear later in the output and for
objects that were deltified against such an object that recorded its
base with OFS_DELTA format.

But other than such OFS_DELTA adjustments, it feels that such an
"only coalesce multiple packs into one" mode should be fairly quick.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 2/2] fetch-pack: avoid repeatedly re-scanning pack directory
  2013-01-26 22:40 ` [PATCH 2/2] fetch-pack: avoid repeatedly re-scanning pack directory Jeff King
@ 2013-01-27 10:27   ` Jonathan Nieder
  2013-01-27 20:09     ` Junio C Hamano
  0 siblings, 1 reply; 22+ messages in thread
From: Jonathan Nieder @ 2013-01-27 10:27 UTC (permalink / raw)
  To: Jeff King; +Cc: git, Junio C Hamano

Hi,

Jeff King wrote:

> When we look up a sha1 object for reading, we first check
> packfiles, and then loose objects. If we still haven't found
> it, we re-scan the list of packfiles in `objects/pack`. This
> final step ensures that we can co-exist with a simultaneous
> repack process which creates a new pack and then prunes the
> old object.

I like the context above and what follows it, but I think you forgot
to mention what the patch actually does. :)

I guess it is:

	However, in the first scan over refs in fetch-pack.c::everything_local,
	this double-check of packfiles is not necessary since we are only
	trying to get a rough estimate of the last time we fetched from this
	remote repository in order to find good candidate common commits ---
	a missed object would only result in a slightly slower fetch.

	Avoid that slow second scan in the common case by guarding the object
	lookup with has_sha1_file().

Sounds like it would not affect most fetches except by making them
a lot faster in the many-refs case, so for what it's worth I like it.

I had not read this codepath before.  I'm left with a few questions:

 * Why is 49bb805e ("Do not ask for objects known to be complete",
   2005-10-19) trying to do?  Are we hurting that in any way?

   For the sake of an example, suppose in my stalled project I
   maintain 20 topic branches against an unmoving mainline I do not
   advertise and you regularly fetch from me.  The cutoff is the
   *newest* commit date of any of my topic branches you already have.
   By declaring you have that topic branch you avoid a complicated
   negotiation to discover that we both have the mainline.  Is that
   the goal?

 * Is has_sha1_file() generally succeptible to the race against repack
   you mentioned?  How is that normally dealt with?

 * Can a slow operation get confused if an object is incorporated into
   a pack and then expelled again by two repacks in sequence?

Thanks,
Jonathan

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 2/2] fetch-pack: avoid repeatedly re-scanning pack directory
  2013-01-27 10:27   ` Jonathan Nieder
@ 2013-01-27 20:09     ` Junio C Hamano
  2013-01-27 23:20       ` Jonathan Nieder
  0 siblings, 1 reply; 22+ messages in thread
From: Junio C Hamano @ 2013-01-27 20:09 UTC (permalink / raw)
  To: Jonathan Nieder; +Cc: Jeff King, git

Jonathan Nieder <jrnieder@gmail.com> writes:

> Jeff King wrote:
>
>> When we look up a sha1 object for reading, we first check
>> packfiles, and then loose objects. If we still haven't found
>> it, we re-scan the list of packfiles in `objects/pack`. This
>> final step ensures that we can co-exist with a simultaneous
>> repack process which creates a new pack and then prunes the
>> old object.
>
> I like the context above and what follows it, but I think you forgot
> to mention what the patch actually does. :)
>
> I guess it is:
>
> 	However, in the first scan over refs in fetch-pack.c::everything_local,
> 	this double-check of packfiles is not necessary since we are only
> 	trying to get a rough estimate of the last time we fetched from this
> 	remote repository in order to find good candidate common commits ---
> 	a missed object would only result in a slightly slower fetch.

It is not about a rough estimate nor common commits, though.  The
"everything local" check in question is interested in only one
thing: are we _clearly_ up to date without fetching anything from
them?

Loosening the check may miss the rare case where we race against a
simultaneous repack and will cause us to go to the network when we
do not have to, and it becomes a trade off between the common unracy
case going faster by allowing the "Are we clearly up to date" check
to cheat, at the expense of rare racy cases suffering unnecessary
object transfer overhead.

> 	Avoid that slow second scan in the common case by guarding the object
> 	lookup with has_sha1_file().

This conclusion is correct.

> I had not read this codepath before.  I'm left with a few questions:
>
>  * Why is 49bb805e ("Do not ask for objects known to be complete",
>    2005-10-19) trying to do?  Are we hurting that in any way?

An earlier fetch may have acquired all the necessary objects but may
not have updated our refs for some reason (e.g. fast-forward check
may have fired).  In such a case, we may already have a history that
is good (i.e. not missing paths down to the common history) in our
repository that is not connected to any of our refs, and we can
update our refs (or write to FETCH_HEAD) without asking the remote
end to do any common ancestor computation or object transfer.

That was the primary thing the patch wanted to do.

As a side-effect, we know more objects than just the objects at the
tips of our refs are complete and that may help the later common
history discovery step, but obviously we do not want to dig the
history down to root.  The cutoff value is merely a heuristics
chosen without any deep thought.

>  * Is has_sha1_file() generally succeptible to the race against repack
>    you mentioned?  How is that normally dealt with?

By failing to find, so that the user will restart.  When the caller
really wants to use the object, parse_objects() => read_sha1_file()
=> read_object() is used and we will see the retry.

>  * Can a slow operation get confused if an object is incorporated into
>    a pack and then expelled again by two repacks in sequence?

If it checks "the object should be there" first, wait for a long
time, and then tries to find that object's data, the later access
will go to the parse_objects() callpath and I think it should do the
right thing.  If that slow opearation stops inside read_object(), it
could find it unable to map the loose object file and then unable to
find it in the pack, either.  Is that what you are worried about?

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 2/2] fetch-pack: avoid repeatedly re-scanning pack directory
  2013-01-27 20:09     ` Junio C Hamano
@ 2013-01-27 23:20       ` Jonathan Nieder
  0 siblings, 0 replies; 22+ messages in thread
From: Jonathan Nieder @ 2013-01-27 23:20 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Jeff King, git

Junio C Hamano wrote:

> It is not about a rough estimate nor common commits, though.  The
> "everything local" check in question is interested in only one
> thing: are we _clearly_ up to date without fetching anything from
> them?
[...]
> Jonathan Nieder <jrnieder@gmail.com> writes:

>>  * Why is 49bb805e ("Do not ask for objects known to be complete",
>>    2005-10-19) trying to do?  Are we hurting that in any way?
>
> An earlier fetch may have acquired all the necessary objects but may
> not have updated our refs for some reason (e.g. fast-forward check
> may have fired).  In such a case, we may already have a history that
> is good (i.e. not missing paths down to the common history) in our
> repository that is not connected to any of our refs, and we can
> update our refs (or write to FETCH_HEAD) without asking the remote
> end to do any common ancestor computation or object transfer.
>
> That was the primary thing the patch wanted to do.

Interesting.

After I fetch objects for branches a, b, and c which all have different
commit times in a fetch that fails due to the fast-forward check, I
have the following objects:

	a	commit date = 25 January 2013
	b	commit date = 26 January 2013
	c	commit date = 27 January 2013

When I try to fetch again (forcibly this time), git notices that the
objects are available locally and sets "cutoff" to 27 January 2013 as
a hint about the last time I fetched.

mark_recent_complete_commits() is called, and since these objects are
not reachable from any local refs, none are visited in the walk, and
49bb805e does not affect the outcome of the fetch positively or
negatively.

On the other hand, if I fetched a, b, and c to local branches and then
built on top of them, 49bb805e ensures that a second fetch of the same
refs will know right away not to request objects for c.  So it brings
some of the benefits of 2759cbc7 (git-fetch-pack: avoid unnecessary
zero packing, 2005-10-18) when the receiving side has either (A)
renamed refs or (B) built new history on top of them.

Correct?

It is only in case (B) that the cutoff matters.  If we miscalculate
the cutoff, that means either (i) cutoff == 0, and we would lose the
benefit of the walk that finds complete commits, or (ii) cutoff is a
little earlier, and the walk to find complete commits would take a
little longer.  Neither effects the result of the fetch, and neither
is likely to be a significant enough performance difference to offset
the benefit of Jeff's patch.

Thanks for explaining.
Jonathan

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 0/2] optimizing pack access on "read only" fetch repos
  2013-01-27  6:32 ` [PATCH 0/2] optimizing pack access on "read only" fetch repos Junio C Hamano
@ 2013-01-29  8:06   ` Shawn Pearce
  2013-01-29  8:29   ` Jeff King
  2013-01-29 11:01   ` Duy Nguyen
  2 siblings, 0 replies; 22+ messages in thread
From: Shawn Pearce @ 2013-01-29  8:06 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Jeff King, git

On Sat, Jan 26, 2013 at 10:32 PM, Junio C Hamano <gitster@pobox.com> wrote:
> Jeff King <peff@peff.net> writes:
>
>> This is a repost from here:
>>
>>   http://thread.gmane.org/gmane.comp.version-control.git/211176
>>
>> which got no response initially. Basically the issue is that read-only
>> repos (e.g., a CI server) whose workflow is something like:
>>
>>   git fetch $some_branch &&
>>   git checkout -f $some_branch &&
>>   make test
>>
>> will never run git-gc, and will accumulate a bunch of small packs and
>> loose objects, leading to poor performance.
...
> I also wonder if we would be helped by another "repack" mode that
> coalesces small packs into a single one with minimum overhead, and
> run that often from "gc --auto", so that we do not end up having to
> have 50 packfiles.

Yes. This does help....

> When we have 2 or more small and young packs, we could:
>
>  - iterate over idx files for these packs to enumerate the objects
>    to be packed, replacing read_object_list_from_stdin() step;
>
>  - always choose to copy the data we have in these existing packs,
>    instead of doing a full prepare_pack(); and
>
>  - use the order the objects appear in the original packs, bypassing
>    compute_write_order().

Hmm, sounds familiar. Seems like its what we do in JGit for Android. :-)

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 0/2] optimizing pack access on "read only" fetch repos
  2013-01-27  6:32 ` [PATCH 0/2] optimizing pack access on "read only" fetch repos Junio C Hamano
  2013-01-29  8:06   ` Shawn Pearce
@ 2013-01-29  8:29   ` Jeff King
  2013-01-29 15:25     ` Martin Fick
  2013-01-29 15:58     ` Junio C Hamano
  2013-01-29 11:01   ` Duy Nguyen
  2 siblings, 2 replies; 22+ messages in thread
From: Jeff King @ 2013-01-29  8:29 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Shawn O. Pearce, git

On Sat, Jan 26, 2013 at 10:32:42PM -0800, Junio C Hamano wrote:

> Both makes sense to me.
> 
> I also wonder if we would be helped by another "repack" mode that
> coalesces small packs into a single one with minimum overhead, and
> run that often from "gc --auto", so that we do not end up having to
> have 50 packfiles.
> 
> When we have 2 or more small and young packs, we could:
> 
>  - iterate over idx files for these packs to enumerate the objects
>    to be packed, replacing read_object_list_from_stdin() step;
> 
>  - always choose to copy the data we have in these existing packs,
>    instead of doing a full prepare_pack(); and
> 
>  - use the order the objects appear in the original packs, bypassing
>    compute_write_order().

I'm not sure. If I understand you correctly, it would basically just be
concatenating packs without trying to do delta compression between the
objects which are ending up in the same pack. So it would save us from
having to do (up to) 50 binary searches to find an object in a pack, but
would not actually save us much space.

I would be interested to see the timing on how quick it is compared to a
real repack, as the I/O that happens during a repack is non-trivial
(although if you are leaving aside the big "main" pack, then it is
probably not bad).

But how do these somewhat mediocre concatenated packs get turned into
real packs? Pack-objects does not consider deltas between objects in the
same pack. And when would you decide to make a real pack? How do you
know you have 50 young and small packs, and not 50 mediocre coalesced
packs?

-Peff

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 0/2] optimizing pack access on "read only" fetch repos
  2013-01-27  6:32 ` [PATCH 0/2] optimizing pack access on "read only" fetch repos Junio C Hamano
  2013-01-29  8:06   ` Shawn Pearce
  2013-01-29  8:29   ` Jeff King
@ 2013-01-29 11:01   ` Duy Nguyen
  2 siblings, 0 replies; 22+ messages in thread
From: Duy Nguyen @ 2013-01-29 11:01 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Jeff King, git

On Sun, Jan 27, 2013 at 1:32 PM, Junio C Hamano <gitster@pobox.com> wrote:
> I also wonder if we would be helped by another "repack" mode that
> coalesces small packs into a single one with minimum overhead, and
> run that often from "gc --auto", so that we do not end up having to
> have 50 packfiles.
>
> When we have 2 or more small and young packs, we could:
>
>  - iterate over idx files for these packs to enumerate the objects
>    to be packed, replacing read_object_list_from_stdin() step;
>
>  - always choose to copy the data we have in these existing packs,
>    instead of doing a full prepare_pack(); and
>
>  - use the order the objects appear in the original packs, bypassing
>    compute_write_order().

Isn't it easier and cheaper to create the "master index", something
like bup does?
-- 
Duy

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 0/2] optimizing pack access on "read only" fetch repos
  2013-01-29  8:29   ` Jeff King
@ 2013-01-29 15:25     ` Martin Fick
  2013-01-29 15:58     ` Junio C Hamano
  1 sibling, 0 replies; 22+ messages in thread
From: Martin Fick @ 2013-01-29 15:25 UTC (permalink / raw)
  To: Jeff King, Junio C Hamano; +Cc: Shawn O. Pearce, git



Jeff King <peff@peff.net> wrote:

>On Sat, Jan 26, 2013 at 10:32:42PM -0800, Junio C Hamano wrote:
>
>> Both makes sense to me.
>> 
>> I also wonder if we would be helped by another "repack" mode that
>> coalesces small packs into a single one with minimum overhead, and
>> run that often from "gc --auto", so that we do not end up having to
>> have 50 packfiles.
>> 
>> When we have 2 or more small and young packs, we could:
>> 
>>  - iterate over idx files for these packs to enumerate the objects
>>    to be packed, replacing read_object_list_from_stdin() step;
>> 
>>  - always choose to copy the data we have in these existing packs,
>>    instead of doing a full prepare_pack(); and
>> 
>>  - use the order the objects appear in the original packs, bypassing
>>    compute_write_order().
>
>I'm not sure. If I understand you correctly, it would basically just be
>concatenating packs without trying to do delta compression between the
>objects which are ending up in the same pack. So it would save us from
>having to do (up to) 50 binary searches to find an object in a pack,
>but
>would not actually save us much space.
>
>I would be interested to see the timing on how quick it is compared to
>a
>real repack, as the I/O that happens during a repack is non-trivial
>(although if you are leaving aside the big "main" pack, then it is
>probably not bad).
>
>But how do these somewhat mediocre concatenated packs get turned into
>real packs? Pack-objects does not consider deltas between objects in
>the
>same pack. And when would you decide to make a real pack? How do you
>know you have 50 young and small packs, and not 50 mediocre coalesced
>packs?


If we are reconsidering repacking strategies, I would like to propose an approach that might be a more general improvement to repacking which would help in more situations. 

You could roll together any packs which are close in size, say within 50% of each other.  With this strategy you will end up with files which are spread out by size exponentially.  I implementated this strategy on top of the current gc script using keep files, it works fairly well:

https://gerrit-review.googlesource.com/#/c/35215/3/contrib/git-exproll.sh

This saves some time, but mostly it saves I/O when repacking regularly.  I suspect that if this strategy were used in core git that further optimizations could be made to also reduce the repack time, but I don't know enough about repacking to know?  We run it nightly on our servers, both write and read only mirrors. We us are a ratio of 5 currently to drastically reduce large repack file rollovers,

-Martin

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 0/2] optimizing pack access on "read only" fetch repos
  2013-01-29  8:29   ` Jeff King
  2013-01-29 15:25     ` Martin Fick
@ 2013-01-29 15:58     ` Junio C Hamano
  2013-01-29 21:19       ` Jeff King
  1 sibling, 1 reply; 22+ messages in thread
From: Junio C Hamano @ 2013-01-29 15:58 UTC (permalink / raw)
  To: Jeff King; +Cc: Shawn O. Pearce, git

Jeff King <peff@peff.net> writes:

>> I also wonder if we would be helped by another "repack" mode that
>> coalesces small packs into a single one with minimum overhead, and
>> run that often from "gc --auto", so that we do not end up having to
>> have 50 packfiles.
>> ...
>
> I'm not sure. If I understand you correctly, it would basically just be
> concatenating packs without trying to do delta compression between the
> objects which are ending up in the same pack. So it would save us from
> having to do (up to) 50 binary searches to find an object in a pack, but
> would not actually save us much space.

The point is not about space.  Disk is cheap, and it is not making
it any worse than what happens to your target audience, that is a
fetch-only repository with only "gc --auto" in it, where nobody
passes "-f" to "repack" to cause recomputation of delta.

What I was trying to seek was a way to reduce the runtime penalty we
pay every time we run git in such a repository.

 - Object look-up cost will become log2(50*n) from 50*log2(n), which
   is about 50/log2(50) improvement;

 - System resource cost we incur by having to keep 50 file
   descriptors open and maintaining 50 mmap windows will reduce by
   50 fold.

 - Anything else I missed?

> I would be interested to see the timing on how quick it is compared to a
> real repack,...

Yes, that is what I meant by "wonder if we would be helped by" ;-)

> But how do these somewhat mediocre concatenated packs get turned into
> real packs?

How do they get processed in a fetch-only repositories that
sometimes run "gc --auto" today?  By runninng "repack -a -d -f"
occasionally, perhaps?

At some point, you would need to run a repack that involves a real
object-graph traversal that feeds you the paths for objects to
obtain a reasonably compressed pack.  We can inspect existing packs
and guess a rough traversal order for commits, but for trees and
blobs, you cannot unify existing delta chains from multiple packs
effectively with data in the pack alone.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 0/2] optimizing pack access on "read only" fetch repos
  2013-01-29 15:58     ` Junio C Hamano
@ 2013-01-29 21:19       ` Jeff King
  2013-01-29 22:26         ` Junio C Hamano
  2013-01-31 16:47         ` Shawn Pearce
  0 siblings, 2 replies; 22+ messages in thread
From: Jeff King @ 2013-01-29 21:19 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Shawn O. Pearce, git

On Tue, Jan 29, 2013 at 07:58:01AM -0800, Junio C Hamano wrote:

> The point is not about space.  Disk is cheap, and it is not making
> it any worse than what happens to your target audience, that is a
> fetch-only repository with only "gc --auto" in it, where nobody
> passes "-f" to "repack" to cause recomputation of delta.
> 
> What I was trying to seek was a way to reduce the runtime penalty we
> pay every time we run git in such a repository.
> 
>  - Object look-up cost will become log2(50*n) from 50*log2(n), which
>    is about 50/log2(50) improvement;

Yes and no. Our heuristic is to look at the last-used pack for an
object. So assuming we have locality of requests, we should quite often
get "lucky" and find the object in the first log2 search. Even if we
don't assume locality, a situation with one large pack and a few small
packs will have the large one as "last used" more often than the others,
and it will also have the looked-for object more often than the others

So I can see how it is something we could potentially optimize, but I
could also see it being surprisingly not a big deal. I'd be very
interested to see real measurements, even of something as simple as a
"master index" which can reference multiple packfiles.

>  - System resource cost we incur by having to keep 50 file
>    descriptors open and maintaining 50 mmap windows will reduce by
>    50 fold.

I wonder how measurable that is (and if it matters on Linux versus less
efficient platforms).

> > I would be interested to see the timing on how quick it is compared to a
> > real repack,...
> 
> Yes, that is what I meant by "wonder if we would be helped by" ;-)

There is only one way to find out... :)

Maybe I am blessed with nice machines, but I have mostly found the
repack process not to be that big a deal these days (especially with
threaded delta compression).

> > But how do these somewhat mediocre concatenated packs get turned into
> > real packs?
> 
> How do they get processed in a fetch-only repositories that
> sometimes run "gc --auto" today?  By runninng "repack -a -d -f"
> occasionally, perhaps?

Do we run "repack -adf" regularly? The usual "git gc" procedure will not
use "-f", and without that, we will not even consider making deltas
between objects that were formerly in different packs (but now are in
the same pack).

So you are avoiding doing medium-effort packs ("repack -ad") in favor of
doing potentially quick packs, but occasionally doing a big-effort pack
("repack -adf"). It may be reasonable advice to "repack -adf"
occasionally, but I suspect most people are not doing it regularly (if
only because "git gc" does not do it by default).

-Peff

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 0/2] optimizing pack access on "read only" fetch repos
  2013-01-29 21:19       ` Jeff King
@ 2013-01-29 22:26         ` Junio C Hamano
  2013-01-31 16:47         ` Shawn Pearce
  1 sibling, 0 replies; 22+ messages in thread
From: Junio C Hamano @ 2013-01-29 22:26 UTC (permalink / raw)
  To: Jeff King; +Cc: Shawn O. Pearce, git

Jeff King <peff@peff.net> writes:

>> > But how do these somewhat mediocre concatenated packs get turned into
>> > real packs?
>> 
>> How do they get processed in a fetch-only repositories that
>> sometimes run "gc --auto" today?  By runninng "repack -a -d -f"
>> occasionally, perhaps?
>
> Do we run "repack -adf" regularly? The usual "git gc" procedure will not
> use "-f", and without that, we will not even consider making deltas
> between objects that were formerly in different packs (but now are in
> the same pack).

Correct.  It is not a new problem, and while I think it would need
some solution, the "coalesce 50 small young packs into one" is not
an attempt to address it.

> So you are avoiding doing medium-effort packs ("repack -ad") in favor of
> doing potentially quick packs, but occasionally doing a big-effort pack
> ("repack -adf").

So I think "but occasionally" part is not correct.  In either way,
the packs use suboptimal delta, and you have to eventually pack with
"-f", whether you coalesce 50 packs into 1 often or keep paying the
cost of having 50 packs longer.

The trade-off is purely between "one potentially quick coalescing
per fetch in fetch-only repository" vs "any use of the data in the
fetch-only repository (what do they do?  build?  serving gitweb
locally?) has to deal with 25 packs on average, and we still need to
pay medium repack cost every 50 fetches".

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 0/2] optimizing pack access on "read only" fetch repos
  2013-01-29 21:19       ` Jeff King
  2013-01-29 22:26         ` Junio C Hamano
@ 2013-01-31 16:47         ` Shawn Pearce
  2013-02-01  9:14           ` Jeff King
  1 sibling, 1 reply; 22+ messages in thread
From: Shawn Pearce @ 2013-01-31 16:47 UTC (permalink / raw)
  To: Jeff King; +Cc: Junio C Hamano, git

On Tue, Jan 29, 2013 at 1:19 PM, Jeff King <peff@peff.net> wrote:
> On Tue, Jan 29, 2013 at 07:58:01AM -0800, Junio C Hamano wrote:
>
>> The point is not about space.  Disk is cheap, and it is not making
>> it any worse than what happens to your target audience, that is a
>> fetch-only repository with only "gc --auto" in it, where nobody
>> passes "-f" to "repack" to cause recomputation of delta.
>>
>> What I was trying to seek was a way to reduce the runtime penalty we
>> pay every time we run git in such a repository.
>>
>>  - Object look-up cost will become log2(50*n) from 50*log2(n), which
>>    is about 50/log2(50) improvement;
>
> Yes and no. Our heuristic is to look at the last-used pack for an
> object. So assuming we have locality of requests, we should quite often
> get "lucky" and find the object in the first log2 search. Even if we
> don't assume locality, a situation with one large pack and a few small
> packs will have the large one as "last used" more often than the others,
> and it will also have the looked-for object more often than the others

Opening all of those files does impact performance. It depends on how
slow your open(2) syscall is. I know on Mac OS X that its not the
fastest function we get from the C library. Performing ~40 opens to
look through the most recent pack files and finally find the "real"
pack that contains that tag you asked `git show` for isn't that quick.

Some of us also use Git on filesystems that are network based, and
slow compared to local disk Linux ext2/3/4 with gobs of free RAM.

> So I can see how it is something we could potentially optimize, but I
> could also see it being surprisingly not a big deal. I'd be very
> interested to see real measurements, even of something as simple as a
> "master index" which can reference multiple packfiles.

I actually tried this many many years ago. There are threads in the
archive about it. Its slower. We ruled it out.

>>  - System resource cost we incur by having to keep 50 file
>>    descriptors open and maintaining 50 mmap windows will reduce by
>>    50 fold.
>
> I wonder how measurable that is (and if it matters on Linux versus less
> efficient platforms).

It does matter. We know it has a negative impact on JGit even on Linux
for example. You don't want 300 packs in a repository. 50 might be
tolerable. 300 is not.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 0/2] optimizing pack access on "read only" fetch repos
  2013-01-31 16:47         ` Shawn Pearce
@ 2013-02-01  9:14           ` Jeff King
  2013-02-02 10:07             ` Shawn Pearce
  0 siblings, 1 reply; 22+ messages in thread
From: Jeff King @ 2013-02-01  9:14 UTC (permalink / raw)
  To: Shawn Pearce; +Cc: Junio C Hamano, git

On Thu, Jan 31, 2013 at 08:47:37AM -0800, Shawn O. Pearce wrote:

> >>  - System resource cost we incur by having to keep 50 file
> >>    descriptors open and maintaining 50 mmap windows will reduce by
> >>    50 fold.
> >
> > I wonder how measurable that is (and if it matters on Linux versus less
> > efficient platforms).
> 
> It does matter. We know it has a negative impact on JGit even on Linux
> for example. You don't want 300 packs in a repository. 50 might be
> tolerable. 300 is not.

I'd love to see numbers if you have them. It's not that I don't believe
it is slower, but knowing _how much_ is important when thinking about
what kind of performance increase we are looking to get (which in turn
impacts how much effort to put into the repacking).

-Peff

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 0/2] optimizing pack access on "read only" fetch repos
  2013-02-01  9:14           ` Jeff King
@ 2013-02-02 10:07             ` Shawn Pearce
  0 siblings, 0 replies; 22+ messages in thread
From: Shawn Pearce @ 2013-02-02 10:07 UTC (permalink / raw)
  To: Jeff King; +Cc: Junio C Hamano, git

On Fri, Feb 1, 2013 at 1:14 AM, Jeff King <peff@peff.net> wrote:
> On Thu, Jan 31, 2013 at 08:47:37AM -0800, Shawn O. Pearce wrote:
>
>> >>  - System resource cost we incur by having to keep 50 file
>> >>    descriptors open and maintaining 50 mmap windows will reduce by
>> >>    50 fold.
>> >
>> > I wonder how measurable that is (and if it matters on Linux versus less
>> > efficient platforms).
>>
>> It does matter. We know it has a negative impact on JGit even on Linux
>> for example. You don't want 300 packs in a repository. 50 might be
>> tolerable. 300 is not.
>
> I'd love to see numbers if you have them. It's not that I don't believe
> it is slower, but knowing _how much_ is important when thinking about
> what kind of performance increase we are looking to get (which in turn
> impacts how much effort to put into the repacking).

Never done a formal experiment. Just working from memory where 4 years
and 3 desks ago I got called because one of our Git servers was
struggling to keep up with user git fetch commands. Turns out the
repository had O(200) pack files. git gc that normally took only about
5 minutes took a hellva lot longer, like an hour or more.

The problem happened because the server was saving every push to a
pack and never exploding incoming packs to loose objects. This meant
the thin packs from the wire got delta bases appended to them.
pack-objects was looking at O(50) different alternatives for most
objects when trying to decide which one it should copy into the output
pack... for either a fetch/clone client, or the gc I was trying to run
to fix the repository.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: git gc --auto aquires *.lock files that make a subsequent git-fetch error out
       [not found]   ` <87bmopzbqx.fsf@gmail.com>
@ 2017-07-12 20:00     ` Jeff King
  2017-07-12 20:30       ` Ævar Arnfjörð Bjarmason
  0 siblings, 1 reply; 22+ messages in thread
From: Jeff King @ 2017-07-12 20:00 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason; +Cc: Git List

On Wed, Jul 12, 2017 at 09:38:46PM +0200, Ævar Arnfjörð Bjarmason wrote:

> In 131b8fcbfb ("fetch: run gc --auto after fetching", 2013-01-26) first
> released with v1.8.2 Jeff changed git-fetch to run "git gc --auto"
> afterwards.
> 
> This means that if you run two git fetches in a row the second one may
> fail because it can't acquire the *.lock files on the remote branches you
> have & which the next git-fetch needs to update.

Is it really "in a row" that's a problem? The second fetch should not
begin until the first one is done, including until its auto-gc exits.
And even with background gc, we do the ref-locking operations first, due
to 62aad1849 (gc --auto: do not lock refs in the background,
2014-05-25).

> I happen to run into this on a git.git which has a lot of remotes (most
> people on-list whose remotes I know about) and fetch them in parallel:
> 
>     $ git config alias.pfetch
>     !parallel 'git fetch {}' ::: $(git remote)

Ah, so it's not in a row. It's parallel. Then yes, you may run into
problems with the gc locks conflicting with real operations. This isn't
really unique to fetch. Any simultaneous operation can run into problems
(e.g., on a busy server repo you may see conflicts between pack-refs and
regular pushes).

> And so would 'git fetch --all':
> 
>     $ GIT_TRACE=1 git fetch --all 2>&1|grep --line-buffered built-in|grep -v rev-list
>     19:31:26.273577 git.c:328               trace: built-in: git 'fetch' '--all'
>     19:31:26.278869 git.c:328               trace: built-in: git 'fetch' '--append' 'origin'
>     19:31:27.993312 git.c:328               trace: built-in: git 'gc' '--auto'
>     19:31:27.995855 git.c:328               trace: built-in: git 'fetch' '--append' 'avar'
>     19:31:29.656925 git.c:328               trace: built-in: git 'gc' '--auto'
> 
> I think those two cases are bugs (but ones which I don't have the
> inclination to chase myself beyond sending this E-Mail). We should be
> running the 'git gc --auto' at the very end of the entire program, not
> after fetching every single remote.
> 
> Passing some env variable (similar to the config we pass via the env) to
> subprograms to make them avoid "git gc --auto" so the main process can
> do it would probably be the most simple solution.

Yes, I agree that's poor. Ideally there would be a command-line option
to tell the sub-fetches not to run auto-gc. It could be done with:

  git -c gc.auto=0 fetch --append ...

Or we could even take the "--append" as a hint not to run auto-gc.

> The more general case (such as with my parallel invocation) is harder to
> solve.

Yes, I don't think it can solved. The most general case is two totally
unrelated processes which know nothing about each other.

> Maybe "git gc --auto" should have a heuristic so it checks whether
> there's been recent activity on the repo, and waits until there's been
> say 60 seconds of no activity, or alternatively if it's waited 600
> seconds and hasn't run gc yet.

That sounds complicated.

> Ideally a "real" invocation like git-fetch would have a way to simply
> steal any *.lock a background "git gc --auto" creates, aborting the gc
> but allowing the "real" invocation to proceed. But that sounds even
> trickier to implement, and might without an extra heuristic on top
> postpone gc indefinitely.

The locks are generally due to ref-packing and reflog expiration.  I
think in the long run, it would be nice to move to a ref store that
didn't need packing, and that could do reflog expiration more
atomically.

I think the way "reflog expire" is done holds the locks for a lot longer
than is strictly necessary, too (it actually computes reachability for
--expire-unreachable on the fly while holding some locks).

-Peff

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: git gc --auto aquires *.lock files that make a subsequent git-fetch error out
  2017-07-12 20:00     ` git gc --auto aquires *.lock files that make a subsequent git-fetch error out Jeff King
@ 2017-07-12 20:30       ` Ævar Arnfjörð Bjarmason
  2017-07-12 20:43         ` Jeff King
  0 siblings, 1 reply; 22+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2017-07-12 20:30 UTC (permalink / raw)
  To: Jeff King; +Cc: Git List


On Wed, Jul 12 2017, Jeff King jotted:

> On Wed, Jul 12, 2017 at 09:38:46PM +0200, Ævar Arnfjörð Bjarmason wrote:
>
>> In 131b8fcbfb ("fetch: run gc --auto after fetching", 2013-01-26) first
>> released with v1.8.2 Jeff changed git-fetch to run "git gc --auto"
>> afterwards.
>>
>> This means that if you run two git fetches in a row the second one may
>> fail because it can't acquire the *.lock files on the remote branches you
>> have & which the next git-fetch needs to update.
>
> Is it really "in a row" that's a problem? The second fetch should not
> begin until the first one is done, including until its auto-gc exits.
> And even with background gc, we do the ref-locking operations first, due
> to 62aad1849 (gc --auto: do not lock refs in the background,
> 2014-05-25).
>
>> I happen to run into this on a git.git which has a lot of remotes (most
>> people on-list whose remotes I know about) and fetch them in parallel:
>>
>>     $ git config alias.pfetch
>>     !parallel 'git fetch {}' ::: $(git remote)
>
> Ah, so it's not in a row. It's parallel. Then yes, you may run into
> problems with the gc locks conflicting with real operations. This isn't
> really unique to fetch. Any simultaneous operation can run into problems
> (e.g., on a busy server repo you may see conflicts between pack-refs and
> regular pushes).

This is what I thought at first, and I've only encountered the issue in
this parallel mode (mainly because it's tedious to reproduce). But I
think the traces below show that it would happen with "git fetch --all"
& "git remote update" as well, so the parallel invocations didn't
matter.

I.e. I'd just update my first remote, then git-gc would start in the
background and lock refs for my other remotes, which I'd then fail to
update.

>> And so would 'git fetch --all':
>>
>>     $ GIT_TRACE=1 git fetch --all 2>&1|grep --line-buffered built-in|grep -v rev-list
>>     19:31:26.273577 git.c:328               trace: built-in: git 'fetch' '--all'
>>     19:31:26.278869 git.c:328               trace: built-in: git 'fetch' '--append' 'origin'
>>     19:31:27.993312 git.c:328               trace: built-in: git 'gc' '--auto'
>>     19:31:27.995855 git.c:328               trace: built-in: git 'fetch' '--append' 'avar'
>>     19:31:29.656925 git.c:328               trace: built-in: git 'gc' '--auto'
>>
>> I think those two cases are bugs (but ones which I don't have the
>> inclination to chase myself beyond sending this E-Mail). We should be
>> running the 'git gc --auto' at the very end of the entire program, not
>> after fetching every single remote.
>>
>> Passing some env variable (similar to the config we pass via the env) to
>> subprograms to make them avoid "git gc --auto" so the main process can
>> do it would probably be the most simple solution.
>
> Yes, I agree that's poor. Ideally there would be a command-line option
> to tell the sub-fetches not to run auto-gc. It could be done with:
>
>   git -c gc.auto=0 fetch --append ...
>
> Or we could even take the "--append" as a hint not to run auto-gc.
>
>> The more general case (such as with my parallel invocation) is harder to
>> solve.
>
> Yes, I don't think it can solved. The most general case is two totally
> unrelated processes which know nothing about each other.
>
>> Maybe "git gc --auto" should have a heuristic so it checks whether
>> there's been recent activity on the repo, and waits until there's been
>> say 60 seconds of no activity, or alternatively if it's waited 600
>> seconds and hasn't run gc yet.
>
> That sounds complicated.
>
>> Ideally a "real" invocation like git-fetch would have a way to simply
>> steal any *.lock a background "git gc --auto" creates, aborting the gc
>> but allowing the "real" invocation to proceed. But that sounds even
>> trickier to implement, and might without an extra heuristic on top
>> postpone gc indefinitely.
>
> The locks are generally due to ref-packing and reflog expiration.  I
> think in the long run, it would be nice to move to a ref store that
> didn't need packing, and that could do reflog expiration more
> atomically.
>
> I think the way "reflog expire" is done holds the locks for a lot longer
> than is strictly necessary, too (it actually computes reachability for
> --expire-unreachable on the fly while holding some locks).
>
> -Peff

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: git gc --auto aquires *.lock files that make a subsequent git-fetch error out
  2017-07-12 20:30       ` Ævar Arnfjörð Bjarmason
@ 2017-07-12 20:43         ` Jeff King
  0 siblings, 0 replies; 22+ messages in thread
From: Jeff King @ 2017-07-12 20:43 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason; +Cc: Git List

On Wed, Jul 12, 2017 at 10:30:25PM +0200, Ævar Arnfjörð Bjarmason wrote:

> > Is it really "in a row" that's a problem? The second fetch should not
> > begin until the first one is done, including until its auto-gc exits.
> > And even with background gc, we do the ref-locking operations first, due
> > to 62aad1849 (gc --auto: do not lock refs in the background,
> > 2014-05-25).
> >
> >> I happen to run into this on a git.git which has a lot of remotes (most
> >> people on-list whose remotes I know about) and fetch them in parallel:
> >>
> >>     $ git config alias.pfetch
> >>     !parallel 'git fetch {}' ::: $(git remote)
> >
> > Ah, so it's not in a row. It's parallel. Then yes, you may run into
> > problems with the gc locks conflicting with real operations. This isn't
> > really unique to fetch. Any simultaneous operation can run into problems
> > (e.g., on a busy server repo you may see conflicts between pack-refs and
> > regular pushes).
> 
> This is what I thought at first, and I've only encountered the issue in
> this parallel mode (mainly because it's tedious to reproduce). But I
> think the traces below show that it would happen with "git fetch --all"
> & "git remote update" as well, so the parallel invocations didn't
> matter.
> 
> I.e. I'd just update my first remote, then git-gc would start in the
> background and lock refs for my other remotes, which I'd then fail to
> update.

No, it should be OK because of the commit I mentioned at the top of the
quoted section. Each one runs sequentially.

-Peff

^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2017-07-12 20:43 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-01-26 22:40 [PATCH 0/2] optimizing pack access on "read only" fetch repos Jeff King
2013-01-26 22:40 ` [PATCH 1/2] fetch: run gc --auto after fetching Jeff King
2013-01-27  1:51   ` Jonathan Nieder
     [not found]   ` <87bmopzbqx.fsf@gmail.com>
2017-07-12 20:00     ` git gc --auto aquires *.lock files that make a subsequent git-fetch error out Jeff King
2017-07-12 20:30       ` Ævar Arnfjörð Bjarmason
2017-07-12 20:43         ` Jeff King
2013-01-26 22:40 ` [PATCH 2/2] fetch-pack: avoid repeatedly re-scanning pack directory Jeff King
2013-01-27 10:27   ` Jonathan Nieder
2013-01-27 20:09     ` Junio C Hamano
2013-01-27 23:20       ` Jonathan Nieder
2013-01-27  6:32 ` [PATCH 0/2] optimizing pack access on "read only" fetch repos Junio C Hamano
2013-01-29  8:06   ` Shawn Pearce
2013-01-29  8:29   ` Jeff King
2013-01-29 15:25     ` Martin Fick
2013-01-29 15:58     ` Junio C Hamano
2013-01-29 21:19       ` Jeff King
2013-01-29 22:26         ` Junio C Hamano
2013-01-31 16:47         ` Shawn Pearce
2013-02-01  9:14           ` Jeff King
2013-02-02 10:07             ` Shawn Pearce
2013-01-29 11:01   ` Duy Nguyen
  -- strict thread matches above, loose matches on Subject: below --
2012-12-07 13:53 [PATCH 0/2] make repeated fetch faster on "read only" repos Jeff King
2012-12-07 13:58 ` [PATCH 1/2] fetch: run gc --auto after fetching Jeff King

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).