git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: Elijah Newren <newren@gmail.com>
To: git@vger.kernel.org
Cc: jonathantanmy@google.com, peff@peff.net, jrnieder@gmail.com,
	Elijah Newren <newren@gmail.com>
Subject: Re: [PATCH] gc: do not warn about too many loose objects
Date: Mon, 16 Jul 2018 12:15:05 -0700	[thread overview]
Message-ID: <20180716191505.857-1-newren@gmail.com> (raw)
In-Reply-To: <20180716172717.237373-1-jonathantanmy@google.com>


On Mon, Jul 16, 2018 at 10:27 AM, Jonathan Tan <jonathantanmy@google.com> wrote:
> In a087cc9819 ("git-gc --auto: protect ourselves from accumulated
> cruft", 2007-09-17), the user was warned if there were too many
> unreachable loose objects. This made sense at the time, because gc
> couldn't prune them safely. But subsequently, git prune learned the
> ability to not prune recently created loose objects, making pruning able
> to be done more safely, and gc was made to automatically prune old
> unreachable loose objects in 25ee9731c1 ("gc: call "prune --expire
> 2.weeks.ago" by default", 2008-03-12).
...
>
> ---
> This was noticed when a daemonized gc run wrote this warning to the log
> file, and returned 0; but a subsequent run merely read the log file, saw
> that it is non-empty and returned -1 (which is inconsistent in that such
> a run should return 0, as it did the first time).

Yeah, we've hit this several times too.  I even created a testcase and a
workaround, though I never got it into proper submission form.

The basic problem here, at least for us, is that gc has enough
information to know it could expunge some objects, but because of how
it is structured in terms of several substeps (reflog expiration,
repack, prune), the information is lost between the steps and it
instead writes them out as unreachable objects.  If we could prune (or
avoid exploding) loose objects that are only reachable from reflog
entries that we are expiring, then the problem goes away for us.  (I
totally understand that other repos may have enough unreachable
objects for other reasons that Peff's suggestion to just pack up
unreachable objects is still a really good idea.  But on its own, it
seems like a waste since it's packing stuff that we know we could just
expunge.)

Anyway, my very rough testcase is below.  The workaround is the
git_actual_garbage_collect() function (minus the call to
wait_for_background_gc_to_finish).

Elijah

---


wait_for_background_gc_to_finish() {
    while ( ps -ef | grep -v grep | grep --quiet git.gc.--auto ); do
        sleep 1;
    done
}

git_standard_garbage_collect() {
    # Current git gc sprays unreachable objects back in loose form; this is
    # fine in many cases, but is annoying when done with objects which
    # newly become unreachable because of something else git-gc does and
    # git-gc doesn't clean them up.
    git gc --auto
    wait_for_background_gc_to_finish
}

git_actual_garbage_collect() {
    GITDIR=$(git rev-parse --git-dir)

    # Record all revisions stored in reflog before and after gc
    git rev-list --no-walk --reflog >$GITDIR/gc.original-refs
    git gc --auto
    wait_for_background_gc_to_finish
    git rev-list --no-walk --reflog >$GITDIR/gc.final-refs

    # Find out which reflog entries were removed
    DELETED_REFS=$(comm -23 <(sort $GITDIR/gc.original-refs) <(sort $GITDIR/gc.final-refs))

    # Get the list of objects which used to be reachable, but were made
    # unreachable due to gc's reflog expiration.  To get these, I need
    # the intersection of things reachable from $DELETED_REFS and things
    # which are unreachable now.
    git rev-list --objects $DELETED_REFS --not --all --reflog | awk '{print $1}' >$GITDIR/gc.previously-reachable-objects
    git prune --expire=now --dry-run | awk '{print $1}' > $GITDIR/gc.unreachable-objects

    # Delete all the previously-reachable-objects made unreachable by the
    # reflog expiration done by git gc
    comm -12 <(sort $GITDIR/gc.unreachable-objects) <(sort $GITDIR/gc.previously-reachable-objects) | sed -e "s#^\(..\)#$GITDIR/objects/\1/#" | xargs rm
}


test -d testcase && { echo "testcase exists; exiting"; exit 1; }
git init testcase/
cd testcase

# Create a basic commit
echo hello >world
git add world
git commit -q -m "Initial"

# Create a commit with lots of files
for i in {0000..9999}; do echo $i >$i; done
git add [0-9]*
git commit --quiet -m "Lots of numbers"

# Pack it all up
git gc --quiet

# Stop referencing the garbage
git reset --quiet --hard HEAD~1

# Pretend we did all the above stuff 30 days ago
for rlog in $(find .git/logs -type f); do
  # Subtract 3E6 (just over 30 days) from every date (assuming dates have
  # exactly 10 digits, which just happens to be valid...right now at least)
  perl -i -ne '/(.*?)(\b[0-9]{10}\b)(.*)/ && print $1,$2-3000000,$3,"\n"' $rlog
done

# HOWEVER: note that the pack is new; if we make the pack old, the old objects
# will get pruned for us.  But it is quite common to have new packfiles with
# old-and-soon-to-be-unreferenced-objects because frequent gc's mean moving
# the objects to new packfiles often, and eventually the reflog is expired.
# If you want to test them being part of an old packfile, uncomment this:
#   find .git/objects/pack -type f | xargs touch -t 200001010101

# Create 50 packfiles in the current repo so that 'git gc --auto' will
# trigger `git repack -A -d -l` instead of just `git repack -d -l`
for i in {01..50}; do
  git fast-export master | sed -e s/Initial/Initi$i/ | git -c fastimport.unpacklimit=0 fast-import --quiet |& grep -v "Not updating refs/heads/master"
done

echo "*** Before gc, reflog refers to garbage-collectible commits: ***"
git rev-list --no-walk --all --reflog
cat .git/logs/refs/heads/master
echo "*** Before gc, everything is packed with no loose objects: ***"
git count-objects -v

git_standard_garbage_collect  # Just `git gc --auto`
#git_actual_garbage_collect    # What I really want from `git gc --auto`

echo -e "\n\n*** After gc, commit garbage collected and objects made loose: ***"
git rev-list --no-walk --all --reflog
cat .git/logs/refs/heads/master
git count-objects -v

echo -e "\n\n*** Now we can trigger the 'too many unreachable' error: ***"
git gc --auto

  parent reply	other threads:[~2018-07-16 19:15 UTC|newest]

Thread overview: 57+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-07-16 17:27 [PATCH] gc: do not warn about too many loose objects Jonathan Tan
2018-07-16 17:51 ` Jeff King
2018-07-16 18:22   ` Jonathan Nieder
2018-07-16 18:52     ` Jeff King
2018-07-16 19:09       ` Jonathan Nieder
2018-07-16 19:41         ` Jeff King
2018-07-16 19:54           ` Jonathan Nieder
2018-07-16 20:29             ` Jeff King
2018-07-16 20:37               ` Jonathan Nieder
2018-07-16 21:09                 ` Jeff King
2018-07-16 21:40                   ` Jonathan Nieder
2018-07-16 21:45                     ` Jeff King
2018-07-16 22:03                       ` Jonathan Nieder
2018-07-16 22:43                         ` Jeff King
2018-07-16 22:56                           ` Jonathan Nieder
2018-07-16 23:26                             ` Jeff King
2018-07-17  1:53                               ` Jonathan Nieder
2018-07-17  8:59                                 ` Ævar Arnfjörð Bjarmason
2018-07-17 14:03                                   ` Jonathan Nieder
2018-07-17 15:24                                     ` Ævar Arnfjörð Bjarmason
2018-07-17 20:27                                   ` Jeff King
2018-07-18 13:11                                     ` Ævar Arnfjörð Bjarmason
2018-07-18 17:29                                       ` Jeff King
2018-07-17 15:59                                 ` Duy Nguyen
2018-07-17 18:09                                 ` Junio C Hamano
2018-07-16 19:15 ` Elijah Newren [this message]
2018-07-16 19:19   ` Jonathan Nieder
2018-07-16 20:21     ` Elijah Newren
2018-07-16 20:35       ` Jeff King
2018-07-16 20:56         ` Jonathan Nieder
2018-07-16 21:12           ` Jeff King
2018-07-16 19:52   ` Jeff King
2018-07-16 20:16     ` Elijah Newren
2018-07-16 20:38       ` Jeff King
2018-07-16 21:09         ` Elijah Newren
2018-07-16 21:21           ` Jeff King
2018-07-16 22:07             ` Elijah Newren
2018-07-16 22:55               ` Jeff King
2018-07-16 23:06                 ` Elijah Newren
2018-07-16 21:31           ` Jonathan Nieder
2018-07-17  6:51 ` [PATCH v2 0/3] gc --auto: do not return error for prior errors in daemonized mode Jonathan Nieder
2018-07-17  6:53   ` [PATCH 1/3] gc: improve handling of errors reading gc.log Jonathan Nieder
2018-07-17 18:19     ` Junio C Hamano
2018-07-17 19:58     ` Jeff King
2018-07-17  6:54   ` [PATCH 2/3] gc: exit with status 128 on failure Jonathan Nieder
2018-07-17 18:22     ` Junio C Hamano
2018-07-17 19:59     ` Jeff King
2018-09-17 18:33       ` Jeff King
2018-09-17 18:40         ` Jonathan Nieder
2018-09-18 17:30           ` Jeff King
2018-07-17  6:57   ` [PATCH 3/3] gc: do not return error for prior errors in daemonized mode Jonathan Nieder
2018-07-17 20:13     ` Jeff King
2018-07-18 16:21       ` Junio C Hamano
2018-07-18 17:22         ` Jeff King
2018-07-18 18:19           ` Junio C Hamano
2018-07-18 19:06             ` Jeff King
2018-07-18 19:55               ` Junio C Hamano

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20180716191505.857-1-newren@gmail.com \
    --to=newren@gmail.com \
    --cc=git@vger.kernel.org \
    --cc=jonathantanmy@google.com \
    --cc=jrnieder@gmail.com \
    --cc=peff@peff.net \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).