git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
* Repacking a repository uses up all available disk space
@ 2016-06-12 21:25 Konstantin Ryabitsev
  2016-06-12 21:38 ` Jeff King
  0 siblings, 1 reply; 11+ messages in thread
From: Konstantin Ryabitsev @ 2016-06-12 21:25 UTC (permalink / raw)
  To: git

[-- Attachment #1: Type: text/plain, Size: 936 bytes --]

Hello:

I have a problematic repository that:

- Takes up 9GB on disk
- Passes 'git fsck --full' with no errors
- When cloned with --mirror, takes up 38M on the target system
- When attempting to repack, creates millions of files and eventually
  eats up all available disk space

Repacking the result of 'git clone --mirror' shows no problem, so it's
got to be something really weird with that particular instance of the
repository.

If anyone is interested in poking at this particular problem to figure
out what causes the repack process to eat up all available disk space,
you can find the tarball of the problematic repository here:

http://mricon.com/misc/src.git.tar.xz (warning: 6.6GB)

You can clone the non-problematic version of this repository from
git://codeaurora.org/quic/chrome4sdp/breakpad/breakpad/src.git

Best,
-- 
Konstantin Ryabitsev
Linux Foundation Collab Projects
Montréal, Québec

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Repacking a repository uses up all available disk space
  2016-06-12 21:25 Repacking a repository uses up all available disk space Konstantin Ryabitsev
@ 2016-06-12 21:38 ` Jeff King
  2016-06-12 21:54   ` Konstantin Ryabitsev
  0 siblings, 1 reply; 11+ messages in thread
From: Jeff King @ 2016-06-12 21:38 UTC (permalink / raw)
  To: Konstantin Ryabitsev; +Cc: git

On Sun, Jun 12, 2016 at 05:25:14PM -0400, Konstantin Ryabitsev wrote:

> Hello:
> 
> I have a problematic repository that:
> 
> - Takes up 9GB on disk
> - Passes 'git fsck --full' with no errors
> - When cloned with --mirror, takes up 38M on the target system

Cloning will only copy the objects that are reachable from the refs. So
presumably the other 8.9GB is either reachable from reflogs, or not
reachable at all (due to rewinding history or deleting branches).

> - When attempting to repack, creates millions of files and eventually
>   eats up all available disk space

That means these objects fall into the unreachable category. Git will
prune unreachable loose objects after a grace period based on the
filesystem mtime of the objects; the default is 2 weeks.

For unreachable packed objects, their mtime is jumbled in with the rest
of the objects in the packfile.  So Git's strategy is to "eject" such
objects from the packfiles into individual loose objects, and let them
"age out" of the grace period individually.

Generally this works just fine, but there are corner cases where you
might have a very large number of such objects, and the loose storage is
much more expensive than the packed (e.g., because each object is stored
individually, not as a delta).

It sounds like this is the case you're running into.

The solution is to lower the grace period time, with something like:

  git gc --prune=5.minutes.ago

or even:

  git gc --prune=now

That will prune the unreachable objects immediately (and the packfile
ejector is smart enough to skip ejecting any file that would just get
deleted immediately anyway).

-Peff

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Repacking a repository uses up all available disk space
  2016-06-12 21:38 ` Jeff King
@ 2016-06-12 21:54   ` Konstantin Ryabitsev
  2016-06-12 22:13     ` Jeff King
  0 siblings, 1 reply; 11+ messages in thread
From: Konstantin Ryabitsev @ 2016-06-12 21:54 UTC (permalink / raw)
  To: Jeff King; +Cc: git

[-- Attachment #1: Type: text/plain, Size: 1800 bytes --]

On Sun, Jun 12, 2016 at 05:38:04PM -0400, Jeff King wrote:
> > - When attempting to repack, creates millions of files and eventually
> >   eats up all available disk space
> 
> That means these objects fall into the unreachable category. Git will
> prune unreachable loose objects after a grace period based on the
> filesystem mtime of the objects; the default is 2 weeks.
> 
> For unreachable packed objects, their mtime is jumbled in with the rest
> of the objects in the packfile.  So Git's strategy is to "eject" such
> objects from the packfiles into individual loose objects, and let them
> "age out" of the grace period individually.
> 
> Generally this works just fine, but there are corner cases where you
> might have a very large number of such objects, and the loose storage is
> much more expensive than the packed (e.g., because each object is stored
> individually, not as a delta).
> 
> It sounds like this is the case you're running into.
> 
> The solution is to lower the grace period time, with something like:
> 
>   git gc --prune=5.minutes.ago
> 
> or even:
> 
>   git gc --prune=now

You are correct, this solves the problem, however I'm curious. The usual
maintenance for these repositories is a regular run of:

- git fsck --full
- git repack -Adl -b --pack-kept-objects
- git pack-refs --all
- git prune

The reason it's split into repack + prune instead of just gc is because
we use alternates to save on disk space and try not to prune repos that
are used as alternates by other repos in order to avoid potential
corruption.

Am I not doing something that needs to be doing in order to avoid the
same problem?

Thanks for your help.

Regards,
-- 
Konstantin Ryabitsev
Linux Foundation Collab Projects
Montréal, Québec

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Repacking a repository uses up all available disk space
  2016-06-12 21:54   ` Konstantin Ryabitsev
@ 2016-06-12 22:13     ` Jeff King
  2016-06-13  0:24       ` Duy Nguyen
  2016-06-13  1:43       ` Nasser Grainawi
  0 siblings, 2 replies; 11+ messages in thread
From: Jeff King @ 2016-06-12 22:13 UTC (permalink / raw)
  To: Konstantin Ryabitsev; +Cc: git

On Sun, Jun 12, 2016 at 05:54:36PM -0400, Konstantin Ryabitsev wrote:

> >   git gc --prune=now
> 
> You are correct, this solves the problem, however I'm curious. The usual
> maintenance for these repositories is a regular run of:
> 
> - git fsck --full
> - git repack -Adl -b --pack-kept-objects
> - git pack-refs --all
> - git prune
> 
> The reason it's split into repack + prune instead of just gc is because
> we use alternates to save on disk space and try not to prune repos that
> are used as alternates by other repos in order to avoid potential
> corruption.
> 
> Am I not doing something that needs to be doing in order to avoid the
> same problem?

Your approach makes sense; we do the same thing at GitHub for the same
reasons[1]. The main thing you are missing that gc will do is that it
knows the prune-time it is going to feed to git-prune[2], and passes
that along to repack. That's what enables the "don't bother ejecting
these, because I'm about to delete them" optimization.

That option is not documented, because it was always assumed to be an
internal thing to git-gc, but it is:

  git repack ... --unpack-unreachable=5.minutes.ago

or whatever.

-Peff

[1] We don't run the fsck at the front, though, because it's really
    expensive.  I'm not sure it buys you much, either. The repack
    will do a full walk of the graph, so it gets you a connectivity
    check, as well as a full content check of the commits and trees. The
    blobs are copied as-is from the old pack, but there is a checksum on
    the pack data (to catch any bit flips by the disk storage). So the
    only thing the fsck is getting you is that it fully reconstructs the
    deltas for each blob and checks their sha1. That's more robust than
    a checksum, but it's a lot more expensive.

[2] It's unclear to me if you're passing any options to git-prune, but
    you may want to pass "--expire" with a short grace period. Without
    any options it prunes every unreachable thing, which can lead to
    races if the repository is actively being used.

    At GitHub we actually have a patch to `repack` that keeps all
    objects, reachable or not, in the pack, and use it for all of our
    automated maintenance. Since we don't drop objects at all, we can't
    ever have such a race. Aside from some pathological cases, it wastes
    much less space than you'd expect. We turn the flag off for special
    cases (e.g., somebody has rewound history and wants to expunge a
    sensitive object).

    I'm happy to share the "keep everything" patch if you're interested.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Repacking a repository uses up all available disk space
  2016-06-12 22:13     ` Jeff King
@ 2016-06-13  0:24       ` Duy Nguyen
  2016-06-13  4:58         ` Jeff King
  2016-06-13  1:43       ` Nasser Grainawi
  1 sibling, 1 reply; 11+ messages in thread
From: Duy Nguyen @ 2016-06-13  0:24 UTC (permalink / raw)
  To: Jeff King; +Cc: Konstantin Ryabitsev, Git Mailing List

On Mon, Jun 13, 2016 at 5:13 AM, Jeff King <peff@peff.net> wrote:
> On Sun, Jun 12, 2016 at 05:54:36PM -0400, Konstantin Ryabitsev wrote:
>
>> >   git gc --prune=now
>>
>> You are correct, this solves the problem, however I'm curious. The usual
>> maintenance for these repositories is a regular run of:
>>
>> - git fsck --full
>> - git repack -Adl -b --pack-kept-objects
>> - git pack-refs --all
>> - git prune
>>
>> The reason it's split into repack + prune instead of just gc is because
>> we use alternates to save on disk space and try not to prune repos that
>> are used as alternates by other repos in order to avoid potential
>> corruption.

Isn't this what extensions.preciousObjects is for? It looks like prune
just refuses to run in precious objects mode though, and repack is
skipped by gc, but if that repack command works, maybe we should do
something like that in git-gc?

BTW Jeff, I think we need more documentation for
extensions.preciousObjects. It's only documented in technical/ which
is practically invisible to all users. Maybe
include::repository-version.txt in config.txt, or somewhere close to
alternates?

> [2] It's unclear to me if you're passing any options to git-prune, but
>     you may want to pass "--expire" with a short grace period. Without
>     any options it prunes every unreachable thing, which can lead to
>     races if the repository is actively being used.
>
>     At GitHub we actually have a patch to `repack` that keeps all
>     objects, reachable or not, in the pack, and use it for all of our
>     automated maintenance. Since we don't drop objects at all, we can't
>     ever have such a race. Aside from some pathological cases, it wastes
>     much less space than you'd expect. We turn the flag off for special
>     cases (e.g., somebody has rewound history and wants to expunge a
>     sensitive object).
>
>     I'm happy to share the "keep everything" patch if you're interested.

Ah ok, I guess this is why we just skip repack. I guess '-Adl -b
--pack-kept-objects' is not enough then.
-- 
Duy

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Repacking a repository uses up all available disk space
  2016-06-12 22:13     ` Jeff King
  2016-06-13  0:24       ` Duy Nguyen
@ 2016-06-13  1:43       ` Nasser Grainawi
  2016-06-13  4:33         ` [PATCH 0/3] repack --keep-unreachable Jeff King
  1 sibling, 1 reply; 11+ messages in thread
From: Nasser Grainawi @ 2016-06-13  1:43 UTC (permalink / raw)
  To: Jeff King; +Cc: Konstantin Ryabitsev, git

On Jun 12, 2016, at 4:13 PM, Jeff King <peff@peff.net> wrote:
> 
>    At GitHub we actually have a patch to `repack` that keeps all
>    objects, reachable or not, in the pack, and use it for all of our
>    automated maintenance. Since we don't drop objects at all, we can't
>    ever have such a race. Aside from some pathological cases, it wastes
>    much less space than you'd expect. We turn the flag off for special
>    cases (e.g., somebody has rewound history and wants to expunge a
>    sensitive object).
> 
>    I'm happy to share the "keep everything" patch if you're interested.

We have the same kind of patch actually (for the same reason), but back on the shell implementation of repack. It'd be great if you could share your modern version.

Nasser

-- 
Qualcomm Innovation Center, Inc.
The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum, 
a Linux Foundation Collaborative Project

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH 0/3] repack --keep-unreachable
  2016-06-13  1:43       ` Nasser Grainawi
@ 2016-06-13  4:33         ` Jeff King
  2016-06-13  4:33           ` [PATCH 1/3] repack: document --unpack-unreachable option Jeff King
                             ` (2 more replies)
  0 siblings, 3 replies; 11+ messages in thread
From: Jeff King @ 2016-06-13  4:33 UTC (permalink / raw)
  To: Nasser Grainawi; +Cc: Konstantin Ryabitsev, git, Junio C Hamano

On Sun, Jun 12, 2016 at 07:43:27PM -0600, Nasser Grainawi wrote:

> On Jun 12, 2016, at 4:13 PM, Jeff King <peff@peff.net> wrote:
> > 
> >    At GitHub we actually have a patch to `repack` that keeps all
> >    objects, reachable or not, in the pack, and use it for all of our
> >    automated maintenance. Since we don't drop objects at all, we can't
> >    ever have such a race. Aside from some pathological cases, it wastes
> >    much less space than you'd expect. We turn the flag off for special
> >    cases (e.g., somebody has rewound history and wants to expunge a
> >    sensitive object).
> > 
> >    I'm happy to share the "keep everything" patch if you're interested.
> 
> We have the same kind of patch actually (for the same reason), but
> back on the shell implementation of repack. It'd be great if you could
> share your modern version.

Here is a cleaned-up version of what we run at GitHub (so this is a
concept that has been exercised for a few years in production, but I had
to forward port the patches a bit; I _probably_ didn't introduce any
bugs. :) ).

The heavy lifting is done by the existing --keep-unreachable option to
pack-objects, which Junio added a long time ago[1] in support of a safer
"gc --auto". But it doesn't look like we ever documented or exercised
it, and "gc --auto" ended up using the loosen-unreachable strategy
instead. In fact, the rest of that series seems to have been dropped; I
couldn't find any discussion on the list explaining it, or why this one
patch was kept (so I don't think anybody upstream has ever used this
code, but as I said, we have been doing so for a few years, so I feel
confident in it).

  [1/3]: repack: document --unpack-unreachable option
  [2/3]: repack: add --keep-unreachable option
  [3/3]: repack: extend --keep-unreachable to loose objects

-Peff

[1] http://article.gmane.org/gmane.comp.version-control.git/58413

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH 1/3] repack: document --unpack-unreachable option
  2016-06-13  4:33         ` [PATCH 0/3] repack --keep-unreachable Jeff King
@ 2016-06-13  4:33           ` Jeff King
  2016-06-13  4:36           ` [PATCH 2/3] repack: add --keep-unreachable option Jeff King
  2016-06-13  4:38           ` [PATCH 3/3] repack: extend --keep-unreachable to loose objects Jeff King
  2 siblings, 0 replies; 11+ messages in thread
From: Jeff King @ 2016-06-13  4:33 UTC (permalink / raw)
  To: Nasser Grainawi; +Cc: Konstantin Ryabitsev, git, Junio C Hamano

This was added back in 7e52f56 (gc: do not explode objects
which will be immediately pruned, 2012-04-07), but not
documented at the time, since it was an internal detail
between git-gc and git-repack. However, as people with
complicated setups may want to effectively reimplement the
steps of git-gc themselves, it is nice for us to document
these interfaces.

Signed-off-by: Jeff King <peff@peff.net>
---
 Documentation/git-repack.txt | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/Documentation/git-repack.txt b/Documentation/git-repack.txt
index b9c02ce..cde7b44 100644
--- a/Documentation/git-repack.txt
+++ b/Documentation/git-repack.txt
@@ -128,6 +128,12 @@ other objects in that pack they already have locally.
 	with `-b` or `repack.writeBitmaps`, as it ensures that the
 	bitmapped packfile has the necessary objects.
 
+--unpack-unreachable=<when>::
+	When loosening unreachable objects, do not bother loosening any
+	objects older than `<when>`. This can be used to optimize out
+	the write of any objects that would be immediately pruned by
+	a follow-up `git prune`.
+
 Configuration
 -------------
 
-- 
2.9.0.rc2.149.gd580ccd

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH 2/3] repack: add --keep-unreachable option
  2016-06-13  4:33         ` [PATCH 0/3] repack --keep-unreachable Jeff King
  2016-06-13  4:33           ` [PATCH 1/3] repack: document --unpack-unreachable option Jeff King
@ 2016-06-13  4:36           ` Jeff King
  2016-06-13  4:38           ` [PATCH 3/3] repack: extend --keep-unreachable to loose objects Jeff King
  2 siblings, 0 replies; 11+ messages in thread
From: Jeff King @ 2016-06-13  4:36 UTC (permalink / raw)
  To: Nasser Grainawi; +Cc: Konstantin Ryabitsev, git, Junio C Hamano

The usual way to do a full repack (and what is done by
git-gc) is to run "repack -Ad --unpack-unreachable=<when>",
which will loosen any unreachable objects newer than
"<when>", and drop any older ones.

This is a safer alternative to "repack -ad", because
"<when>" becomes a grace period during which we will not
drop any new objects that are about to be referenced.
However, it isn't perfectly safe. It's always possible that
a process is about to reference an old object. Even if that
process were to take care to update the timestamp on the
object, there is no atomicity with a simultaneously running
"repack" process.

So while unlikely, there is a small race wherein we may drop
an object that is in the process of being referenced. If you
do automated repacking on a large number of active
repositories, you may hit it eventually, and the result is a
corrupted repository.

It would be nice to fix that race in the long run, but it's
complicated.  In the meantime, there is a much simpler
strategy for automated repository maintenance: do not drop
objects at all. We already have a "--keep-unreachable"
option in pack-objects; we just need to plumb it through
from git-repack.

Note that this _isn't_ plumbed through from git-gc, so at
this point it's strictly a tool for people doing their own
advanced repository maintenance strategy.

Signed-off-by: Jeff King <peff@peff.net>
---
 Documentation/git-repack.txt         |  6 ++++++
 builtin/repack.c                     |  9 +++++++++
 t/t7701-repack-unpack-unreachable.sh | 15 +++++++++++++++
 3 files changed, 30 insertions(+)

diff --git a/Documentation/git-repack.txt b/Documentation/git-repack.txt
index cde7b44..68702ea 100644
--- a/Documentation/git-repack.txt
+++ b/Documentation/git-repack.txt
@@ -134,6 +134,12 @@ other objects in that pack they already have locally.
 	the write of any objects that would be immediately pruned by
 	a follow-up `git prune`.
 
+-k::
+--keep-unreachable::
+	When used with `-ad`, any unreachable objects from existing
+	packs will be appended to the end of the packfile instead of
+	being removed.
+
 Configuration
 -------------
 
diff --git a/builtin/repack.c b/builtin/repack.c
index 858db38..573e66c 100644
--- a/builtin/repack.c
+++ b/builtin/repack.c
@@ -146,6 +146,7 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
 	int pack_everything = 0;
 	int delete_redundant = 0;
 	const char *unpack_unreachable = NULL;
+	int keep_unreachable = 0;
 	const char *window = NULL, *window_memory = NULL;
 	const char *depth = NULL;
 	const char *max_pack_size = NULL;
@@ -175,6 +176,8 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
 				N_("write bitmap index")),
 		OPT_STRING(0, "unpack-unreachable", &unpack_unreachable, N_("approxidate"),
 				N_("with -A, do not loosen objects older than this")),
+		OPT_BOOL('k', "keep-unreachable", &keep_unreachable,
+				N_("with -a, repack unreachable objects")),
 		OPT_STRING(0, "window", &window, N_("n"),
 				N_("size of the window used for delta compression")),
 		OPT_STRING(0, "window-memory", &window_memory, N_("bytes"),
@@ -196,6 +199,10 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
 	if (delete_redundant && repository_format_precious_objects)
 		die(_("cannot delete packs in a precious-objects repo"));
 
+	if (keep_unreachable &&
+	    (unpack_unreachable || (pack_everything & LOOSEN_UNREACHABLE)))
+		die(_("--keep-unreachable and -A are incompatible"));
+
 	if (pack_kept_objects < 0)
 		pack_kept_objects = write_bitmaps;
 
@@ -239,6 +246,8 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
 			} else if (pack_everything & LOOSEN_UNREACHABLE) {
 				argv_array_push(&cmd.args,
 						"--unpack-unreachable");
+			} else if (keep_unreachable) {
+				argv_array_push(&cmd.args, "--keep-unreachable");
 			} else {
 				argv_array_push(&cmd.env_array, "GIT_REF_PARANOIA=1");
 			}
diff --git a/t/t7701-repack-unpack-unreachable.sh b/t/t7701-repack-unpack-unreachable.sh
index b66e383..f13df43 100755
--- a/t/t7701-repack-unpack-unreachable.sh
+++ b/t/t7701-repack-unpack-unreachable.sh
@@ -122,4 +122,19 @@ test_expect_success 'keep packed objects found only in index' '
 	git cat-file blob :file
 '
 
+test_expect_success 'repack -k keeps unreachable packed objects' '
+	# create packed-but-unreachable object
+	sha1=$(echo unreachable-packed | git hash-object -w --stdin) &&
+	pack=$(echo $sha1 | git pack-objects .git/objects/pack/pack) &&
+	git prune-packed &&
+
+	# -k should keep it
+	git repack -adk &&
+	git cat-file -p $sha1 &&
+
+	# and double check that without -k it would have been removed
+	git repack -ad &&
+	test_must_fail git cat-file -p $sha1
+'
+
 test_done
-- 
2.9.0.rc2.149.gd580ccd

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH 3/3] repack: extend --keep-unreachable to loose objects
  2016-06-13  4:33         ` [PATCH 0/3] repack --keep-unreachable Jeff King
  2016-06-13  4:33           ` [PATCH 1/3] repack: document --unpack-unreachable option Jeff King
  2016-06-13  4:36           ` [PATCH 2/3] repack: add --keep-unreachable option Jeff King
@ 2016-06-13  4:38           ` Jeff King
  2 siblings, 0 replies; 11+ messages in thread
From: Jeff King @ 2016-06-13  4:38 UTC (permalink / raw)
  To: Nasser Grainawi; +Cc: Konstantin Ryabitsev, git, Junio C Hamano

If you use "repack -adk" currently, we will pack all objects
that are already packed into the new pack, and then drop the
old packs. However, loose unreachable objects will be left
as-is. In theory these are meant to expire eventually with
"git prune". But if you are using "repack -k", you probably
want to keep things forever and therefore do not run "git
prune" at all. Meaning those loose objects may build up over
time and end up fooling any object-count heuristics (such as
the one done by "gc --auto", though since git-gc does not
support "repack -k", this really applies to whatever custom
scripts people might have driving "repack -k").

With this patch, we instead stuff any loose unreachable
objects into the pack along with the already-packed
unreachable objects. This may seem wasteful, but it is
really no more so than using "repack -k" in the first place.
We are at a slight disadvantage, in that we have no useful
ordering for the result, or names to hand to the delta code.
However, this is again no worse than what "repack -k" is
already doing for the packed objects. The packing of these
objects doesn't matter much because they should not be
accessed frequently (unless they actually _do_ become
referenced, but then they would get moved to a different
part of the packfile during the next repack).

Signed-off-by: Jeff King <peff@peff.net>
---
 Documentation/git-repack.txt         |  3 ++-
 builtin/pack-objects.c               | 31 +++++++++++++++++++++++++++++++
 builtin/repack.c                     |  1 +
 t/t7701-repack-unpack-unreachable.sh | 13 +++++++++++++
 4 files changed, 47 insertions(+), 1 deletion(-)

diff --git a/Documentation/git-repack.txt b/Documentation/git-repack.txt
index 68702ea..b58b6b5 100644
--- a/Documentation/git-repack.txt
+++ b/Documentation/git-repack.txt
@@ -138,7 +138,8 @@ other objects in that pack they already have locally.
 --keep-unreachable::
 	When used with `-ad`, any unreachable objects from existing
 	packs will be appended to the end of the packfile instead of
-	being removed.
+	being removed. In addition, any unreachable loose objects will
+	be packed (and their loose counterparts removed).
 
 Configuration
 -------------
diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index 8f5e358..a2f8cfd 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -44,6 +44,7 @@ static int non_empty;
 static int reuse_delta = 1, reuse_object = 1;
 static int keep_unreachable, unpack_unreachable, include_tag;
 static unsigned long unpack_unreachable_expiration;
+static int pack_loose_unreachable;
 static int local;
 static int incremental;
 static int ignore_packed_keep;
@@ -2378,6 +2379,32 @@ static void add_objects_in_unpacked_packs(struct rev_info *revs)
 	free(in_pack.array);
 }
 
+static int add_loose_object(const unsigned char *sha1, const char *path,
+			    void *data)
+{
+	enum object_type type = sha1_object_info(sha1, NULL);
+
+	if (type < 0) {
+		warning("loose object at %s could not be examined", path);
+		return 0;
+	}
+
+	add_object_entry(sha1, type, "", 0);
+	return 0;
+}
+
+/*
+ * We actually don't even have to worry about reachability here.
+ * add_object_entry will weed out duplicates, so we just add every
+ * loose object we find.
+ */
+static void add_unreachable_loose_objects(void)
+{
+	for_each_loose_file_in_objdir(get_object_directory(),
+				      add_loose_object,
+				      NULL, NULL, NULL);
+}
+
 static int has_sha1_pack_kept_or_nonlocal(const unsigned char *sha1)
 {
 	static struct packed_git *last_found = (void *)1;
@@ -2547,6 +2574,8 @@ static void get_object_list(int ac, const char **av)
 
 	if (keep_unreachable)
 		add_objects_in_unpacked_packs(&revs);
+	if (pack_loose_unreachable)
+		add_unreachable_loose_objects();
 	if (unpack_unreachable)
 		loosen_unused_packed_objects(&revs);
 
@@ -2647,6 +2676,8 @@ int cmd_pack_objects(int argc, const char **argv, const char *prefix)
 			 N_("include tag objects that refer to objects to be packed")),
 		OPT_BOOL(0, "keep-unreachable", &keep_unreachable,
 			 N_("keep unreachable objects")),
+		OPT_BOOL(0, "pack-loose-unreachable", &pack_loose_unreachable,
+			 N_("pack loose unreachable objects")),
 		{ OPTION_CALLBACK, 0, "unpack-unreachable", NULL, N_("time"),
 		  N_("unpack unreachable objects newer than <time>"),
 		  PARSE_OPT_OPTARG, option_parse_unpack_unreachable },
diff --git a/builtin/repack.c b/builtin/repack.c
index 573e66c..f7b7409 100644
--- a/builtin/repack.c
+++ b/builtin/repack.c
@@ -248,6 +248,7 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
 						"--unpack-unreachable");
 			} else if (keep_unreachable) {
 				argv_array_push(&cmd.args, "--keep-unreachable");
+				argv_array_push(&cmd.args, "--pack-loose-unreachable");
 			} else {
 				argv_array_push(&cmd.env_array, "GIT_REF_PARANOIA=1");
 			}
diff --git a/t/t7701-repack-unpack-unreachable.sh b/t/t7701-repack-unpack-unreachable.sh
index f13df43..987573c 100755
--- a/t/t7701-repack-unpack-unreachable.sh
+++ b/t/t7701-repack-unpack-unreachable.sh
@@ -137,4 +137,17 @@ test_expect_success 'repack -k keeps unreachable packed objects' '
 	test_must_fail git cat-file -p $sha1
 '
 
+test_expect_success 'repack -k packs unreachable loose objects' '
+	# create loose unreachable object
+	sha1=$(echo would-be-deleted-loose | git hash-object -w --stdin) &&
+	objpath=.git/objects/$(echo $sha1 | sed "s,..,&/,") &&
+	test_path_is_file $objpath &&
+
+	# and confirm that the loose object goes away, but we can
+	# still access it (ergo, it is packed)
+	git repack -adk &&
+	test_path_is_missing $objpath &&
+	git cat-file -p $sha1
+'
+
 test_done
-- 
2.9.0.rc2.149.gd580ccd

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: Repacking a repository uses up all available disk space
  2016-06-13  0:24       ` Duy Nguyen
@ 2016-06-13  4:58         ` Jeff King
  0 siblings, 0 replies; 11+ messages in thread
From: Jeff King @ 2016-06-13  4:58 UTC (permalink / raw)
  To: Duy Nguyen; +Cc: Konstantin Ryabitsev, Git Mailing List

On Mon, Jun 13, 2016 at 07:24:51AM +0700, Duy Nguyen wrote:

> >> - git fsck --full
> >> - git repack -Adl -b --pack-kept-objects
> >> - git pack-refs --all
> >> - git prune
> >>
> >> The reason it's split into repack + prune instead of just gc is because
> >> we use alternates to save on disk space and try not to prune repos that
> >> are used as alternates by other repos in order to avoid potential
> >> corruption.
> 
> Isn't this what extensions.preciousObjects is for? It looks like prune
> just refuses to run in precious objects mode though, and repack is
> skipped by gc, but if that repack command works, maybe we should do
> something like that in git-gc?

Sort of. preciousObjects is a fail-safe so that you do not ever
accidentally run an object-deleting operation where you shouldn't (e.g.,
in the shared repository used by others as an alternate). So the
important step there is that before running "repack", you would want to
make sure you have taken into account the reachability of anybody
sharing from you.

So you could do something like (in your shared repository):

  git config core.repositoryFormatVersion 1
  git config extension.preciousObjects true

  # this will fail, because it's dangerous!
  git gc

  # but we can do it safely if we take into account the other repos
  for repo in $(somehow_get_list_of_shared_repos); do
	git fetch $repo +refs/*:refs/shared/$repo/*
  done
  git config extension.preciousObjects false
  git gc
  git config extension.preciousObjects true

So it really is orthogonal to running the various gc commands yourself;
it's just here to prevent you shooting yourself in the foot.

It may still be useful in such a case to split up the commands in your
own script, though. In my case, you'll note that the commands above are
racy (what happens if somebody pushes a reference to a shared object
between your fetch and the gc invocation?). So we use a custom "repack
-k" to get around that (it just keeps everything).

You _could_ have gc automatically switch to "-k" in a preciousObjects
repository. That's at least safe. But note that it doesn't really solve
all of the problems (you do still want to have ref tips from the leaf
repositories, because it affects things like bitmaps, and packing
order).

> BTW Jeff, I think we need more documentation for
> extensions.preciousObjects. It's only documented in technical/ which
> is practically invisible to all users. Maybe
> include::repository-version.txt in config.txt, or somewhere close to
> alternates?

I'm a little hesitant to document it for end users because it's still
pretty experimental. In fact, even we are not using it at GitHub
currently. We don't have a big problem with "oops, I accidentally ran
something destructive in the shared repository", because nothing except
the maintenance script ever even goes into the shared repository.

The reason I introduced it in the first place is that I was
experimenting with the idea of actually symlinking "objects/" in the
leaf repos into the shared repository. That eliminates the object
writing in the "fetch" step above, which can be a bottleneck in some
cases (not just the I/O, but the shared repo ends up having a _lot_ of
refs, and fetch can be pretty slow).

But in that case, anything that deletes an object in one of the leaf
repos is very dangerous, as it has no idea that its object store is
shared with other leaf repos. So I really wanted a fail safe so that
running "git gc" wasn't catastrophic.

I still think that's a viable approach, but my experiments got
side-tracked and I never produced anything worth looking at. So until
there's something end users can actually make use of, I'm hesitant to
push that stuff into the regular-user documentation. Anybody who is
playing with it at this point probably _should_ be familiar with what's
in Documentation/technical.

-Peff

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2016-06-13  4:58 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-06-12 21:25 Repacking a repository uses up all available disk space Konstantin Ryabitsev
2016-06-12 21:38 ` Jeff King
2016-06-12 21:54   ` Konstantin Ryabitsev
2016-06-12 22:13     ` Jeff King
2016-06-13  0:24       ` Duy Nguyen
2016-06-13  4:58         ` Jeff King
2016-06-13  1:43       ` Nasser Grainawi
2016-06-13  4:33         ` [PATCH 0/3] repack --keep-unreachable Jeff King
2016-06-13  4:33           ` [PATCH 1/3] repack: document --unpack-unreachable option Jeff King
2016-06-13  4:36           ` [PATCH 2/3] repack: add --keep-unreachable option Jeff King
2016-06-13  4:38           ` [PATCH 3/3] repack: extend --keep-unreachable to loose objects Jeff King

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).