git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: "Ævar Arnfjörð Bjarmason" <avarab@gmail.com>
To: git <git@vger.kernel.org>,
	Git for human beings <git-users@googlegroups.com>
Cc: Christian Couder <christian.couder@gmail.com>
Subject: How de-duplicate similar repositories with alternates
Date: Thu, 29 Nov 2018 15:59:26 +0100	[thread overview]
Message-ID: <87zhtsx73l.fsf@evledraar.gmail.com> (raw)

A co-worker asked me today how space could be saved when you have
multiple checkouts of the same repository (at different revs) on the
same machine. I said since these won't block-level de-duplicate well[1]
one way to do this is with alternates.

However, once you have an existing clone I didn't know how to get the
gains without a full re-clone, but I hadn't looked deeply into it. As it
turns out I'm wrong about that, which I found when writing the following
test-case which shows that it works:

    (
        cd /tmp &&
        rm -rf /tmp/git-{master,pu,pu-alt}.git &&

        # Normal clones
        git clone --bare --no-tags --single-branch --branch master https://github.com/git/git.git /tmp/git-master.git &&
        git clone --bare --no-tags --single-branch --branch pu https://github.com/git/git.git /tmp/git-pu.git &&

        # An 'alternate' clone using 'master' objects from another repo
        git --bare init /tmp/git-pu-alt.git &&
        for git in git-pu.git git-pu-alt.git
        do
            echo /tmp/git-master.git/objects >/tmp/$git/objects/info/alternates
        done &&
        git -C git-pu-alt.git fetch --no-tags https://github.com/git/git.git pu:pu

        # Respective sizes, 'alternate' clone much smaller
        du -shc /tmp/git-*.git &&

        # GC them all. Compacts the git-pu.git to git-pu-alt.git's size
        for repo in git-*.git
        do
            git -C $repo gc
        done &&
        du -shc /tmp/git-*.git

        # Add another big history (GFW) to git-{pu,master}.git (in that order!)
        for repo in $(ls -d /tmp/git-*.git | sort -r)
        do
            git -C $repo fetch --no-tags https://github.com/git-for-windows/git master:master-gfw
        done &&
        du -shc /tmp/git-*.git &&

        # Another GC. The objects now in git-master.git will be de-duped by all
        for repo in git-*.git
        do
            git -C $repo gc
        done &&
        du -shc /tmp/git-*.git
    )

This shows a scenario where we clone git.git at "master" and "pu" in
different places. After clone the relevant sizes are:

    108M    /tmp/git-master.git
    3.2M    /tmp/git-pu-alt.git
    109M    /tmp/git-pu.git
    219M    total

I.e. git-pu-alt.git is much smaller since it points via alternates to
git-master.git, and the history of "pu" shares most of the objects with
"master". But then how do you get those gains for git-pu.git? Turns out
you just "git gc"

    111M    /tmp/git-master.git
    2.1M    /tmp/git-pu-alt.git
    2.1M    /tmp/git-pu.git
    115M    total

This is the thing I was wrong about, in retrospect probably because I'd
been putting PATH_TO_REPO in objects/info/alternates, but we actually
need PATH_TO_REPO/objects, and "git gc" won't warn about this (or "git
fsck"). Probably a good idea to patch that at some point, i.e. whine
about paths in alternates that don't have objects, or at the very least
those that don't exist. #leftoverbits

Then when we fetch git-for-windows:master to all the repos they all grow
by the amount git-for-windows has diverged:

    144M    /tmp/git-master.git
    36M     /tmp/git-pu-alt.git
    36M     /tmp/git-pu.git
    214M    total

Note that the "sort -r" is critical here. If we fetched git-master.git
first (at this point the alternate for git-pu*.git) we wouldn't get the
duplication in the first place, but instead:

    144M    /tmp/git-master.git
    2.1M    /tmp/git-pu-alt.git
    2.1M    /tmp/git-pu.git
    148M    total

This shows the importance of keeping such an 'alternate' repo
up-to-date, i.e. we don't get the duplication in the first place, but
regardless (this from a run with sort -r) a "git gc" will coalesce them:

    131M    /tmp/git-master.git
    2.1M    /tmp/git-pu-alt.git
    2.2M    /tmp/git-pu.git
    135M    total

If you find this interesting make sure to read my
https://public-inbox.org/git/87k1s3bomt.fsf@evledraar.gmail.com/ and
https://public-inbox.org/git/87in7nbi5b.fsf@evledraar.gmail.com/ for the
caveats, i.e. if this is something intended for users then no ref in the
alternate can ever be rewound, that'll potentially result in repository
corruption.

1. https://public-inbox.org/git/87bmhiykvw.fsf@evledraar.gmail.com/

             reply	other threads:[~2018-11-29 14:59 UTC|newest]

Thread overview: 29+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-11-29 14:59 Ævar Arnfjörð Bjarmason [this message]
2018-11-29 16:09 ` How de-duplicate similar repositories with alternates Ævar Arnfjörð Bjarmason
2018-11-29 18:55 ` Stefan Beller
2018-11-29 20:10   ` Ævar Arnfjörð Bjarmason
2018-11-29 20:43     ` Duy Nguyen
2018-12-04  7:06   ` Jeff King
2018-12-04 12:07     ` Derrick Stolee
2018-12-04  6:59 ` Jeff King
2018-12-04 10:43   ` Ævar Arnfjörð Bjarmason
2018-12-04 13:27     ` [PATCH 0/3] sha1-file: warn if alternate is a git repo (not object dir) Ævar Arnfjörð Bjarmason
2018-12-04 13:27     ` [PATCH 1/3] sha1-file: test the error behavior of alt_odb_usable() Ævar Arnfjörð Bjarmason
2019-03-28 20:04       ` [PATCH v2] " Ævar Arnfjörð Bjarmason
2019-03-29 13:46         ` Jeff King
2019-03-29 13:55           ` Ævar Arnfjörð Bjarmason
2019-04-08 15:57             ` Ævar Arnfjörð Bjarmason
2019-04-09  8:21               ` Junio C Hamano
2019-04-09  8:45                 ` Ævar Arnfjörð Bjarmason
2019-04-09  9:43                   ` Junio C Hamano
2019-04-09 14:14                     ` Jeff King
2019-04-09  8:29               ` Junio C Hamano
2018-12-04 13:27     ` [PATCH 2/3] sha1-file: emit error if an alternate looks like a repository Ævar Arnfjörð Bjarmason
2018-12-05  3:35       ` Junio C Hamano
2018-12-05  6:10         ` Jeff King
2018-12-04 13:27     ` [PATCH 3/3] sha1-file: change alternate "error:" message to "warning:" Ævar Arnfjörð Bjarmason
2018-12-05  3:37       ` Junio C Hamano
2018-12-05  5:54         ` Jeff King
2018-12-05  3:30     ` How de-duplicate similar repositories with alternates Junio C Hamano
2018-12-04 13:35 ` Ævar Arnfjörð Bjarmason
2018-12-04 14:17   ` Derrick Stolee

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87zhtsx73l.fsf@evledraar.gmail.com \
    --to=avarab@gmail.com \
    --cc=christian.couder@gmail.com \
    --cc=git-users@googlegroups.com \
    --cc=git@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).