git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: "Ævar Arnfjörð Bjarmason" <avarab@gmail.com>
To: git <git@vger.kernel.org>,
	Git for human beings <git-users@googlegroups.com>
Cc: Christian Couder <christian.couder@gmail.com>,
	Duy Nguyen <pclouds@gmail.com>
Subject: Re: How de-duplicate similar repositories with alternates
Date: Thu, 29 Nov 2018 17:09:37 +0100	[thread overview]
Message-ID: <87sgzjyif2.fsf@evledraar.gmail.com> (raw)
In-Reply-To: <87zhtsx73l.fsf@evledraar.gmail.com>


On Thu, Nov 29 2018, Ævar Arnfjörð Bjarmason wrote:

> A co-worker asked me today how space could be saved when you have
> multiple checkouts of the same repository (at different revs) on the
> same machine. I said since these won't block-level de-duplicate well[1]
> one way to do this is with alternates.
>
> However, once you have an existing clone I didn't know how to get the
> gains without a full re-clone, but I hadn't looked deeply into it. As it
> turns out I'm wrong about that, which I found when writing the following
> test-case which shows that it works:
>
>     (
>         cd /tmp &&
>         rm -rf /tmp/git-{master,pu,pu-alt}.git &&
>
>         # Normal clones
>         git clone --bare --no-tags --single-branch --branch master https://github.com/git/git.git /tmp/git-master.git &&
>         git clone --bare --no-tags --single-branch --branch pu https://github.com/git/git.git /tmp/git-pu.git &&
>
>         # An 'alternate' clone using 'master' objects from another repo
>         git --bare init /tmp/git-pu-alt.git &&
>         for git in git-pu.git git-pu-alt.git
>         do
>             echo /tmp/git-master.git/objects >/tmp/$git/objects/info/alternates
>         done &&
>         git -C git-pu-alt.git fetch --no-tags https://github.com/git/git.git pu:pu
>
>         # Respective sizes, 'alternate' clone much smaller
>         du -shc /tmp/git-*.git &&
>
>         # GC them all. Compacts the git-pu.git to git-pu-alt.git's size
>         for repo in git-*.git
>         do
>             git -C $repo gc
>         done &&
>         du -shc /tmp/git-*.git
>
>         # Add another big history (GFW) to git-{pu,master}.git (in that order!)
>         for repo in $(ls -d /tmp/git-*.git | sort -r)
>         do
>             git -C $repo fetch --no-tags https://github.com/git-for-windows/git master:master-gfw
>         done &&
>         du -shc /tmp/git-*.git &&
>
>         # Another GC. The objects now in git-master.git will be de-duped by all
>         for repo in git-*.git
>         do
>             git -C $repo gc
>         done &&
>         du -shc /tmp/git-*.git
>     )
>
> This shows a scenario where we clone git.git at "master" and "pu" in
> different places. After clone the relevant sizes are:
>
>     108M    /tmp/git-master.git
>     3.2M    /tmp/git-pu-alt.git
>     109M    /tmp/git-pu.git
>     219M    total
>
> I.e. git-pu-alt.git is much smaller since it points via alternates to
> git-master.git, and the history of "pu" shares most of the objects with
> "master". But then how do you get those gains for git-pu.git? Turns out
> you just "git gc"
>
>     111M    /tmp/git-master.git
>     2.1M    /tmp/git-pu-alt.git
>     2.1M    /tmp/git-pu.git
>     115M    total
>
> This is the thing I was wrong about, in retrospect probably because I'd
> been putting PATH_TO_REPO in objects/info/alternates, but we actually
> need PATH_TO_REPO/objects, and "git gc" won't warn about this (or "git
> fsck"). Probably a good idea to patch that at some point, i.e. whine
> about paths in alternates that don't have objects, or at the very least
> those that don't exist. #leftoverbits

Actually looking at this again the thing that may have stumped me last
time is that this has a bad interaction with gc.bigPackThreshold. If you
have an alternate that would otherwise house most of your objects *and*
you have a pack that's larger than the gc.bigPackThreshold your mostly
redundant pack won't be removed.

That's understandable in terms of implementation, but unfortunate. It
would be nice if we learned some way to detect this, i.e. "I have this
10GB pack, but with this alternate I can extract this 100MB out of it
and throw it away". Now we just keep the 10GB pack even if it's mostly
redundant to what's in the alternate.

> Then when we fetch git-for-windows:master to all the repos they all grow
> by the amount git-for-windows has diverged:
>
>     144M    /tmp/git-master.git
>     36M     /tmp/git-pu-alt.git
>     36M     /tmp/git-pu.git
>     214M    total
>
> Note that the "sort -r" is critical here. If we fetched git-master.git
> first (at this point the alternate for git-pu*.git) we wouldn't get the
> duplication in the first place, but instead:
>
>     144M    /tmp/git-master.git
>     2.1M    /tmp/git-pu-alt.git
>     2.1M    /tmp/git-pu.git
>     148M    total
>
> This shows the importance of keeping such an 'alternate' repo
> up-to-date, i.e. we don't get the duplication in the first place, but
> regardless (this from a run with sort -r) a "git gc" will coalesce them:
>
>     131M    /tmp/git-master.git
>     2.1M    /tmp/git-pu-alt.git
>     2.2M    /tmp/git-pu.git
>     135M    total
>
> If you find this interesting make sure to read my
> https://public-inbox.org/git/87k1s3bomt.fsf@evledraar.gmail.com/ and
> https://public-inbox.org/git/87in7nbi5b.fsf@evledraar.gmail.com/ for the
> caveats, i.e. if this is something intended for users then no ref in the
> alternate can ever be rewound, that'll potentially result in repository
> corruption.
>
> 1. https://public-inbox.org/git/87bmhiykvw.fsf@evledraar.gmail.com/

  reply	other threads:[~2018-11-29 16:09 UTC|newest]

Thread overview: 29+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-11-29 14:59 How de-duplicate similar repositories with alternates Ævar Arnfjörð Bjarmason
2018-11-29 16:09 ` Ævar Arnfjörð Bjarmason [this message]
2018-11-29 18:55 ` Stefan Beller
2018-11-29 20:10   ` Ævar Arnfjörð Bjarmason
2018-11-29 20:43     ` Duy Nguyen
2018-12-04  7:06   ` Jeff King
2018-12-04 12:07     ` Derrick Stolee
2018-12-04  6:59 ` Jeff King
2018-12-04 10:43   ` Ævar Arnfjörð Bjarmason
2018-12-04 13:27     ` [PATCH 0/3] sha1-file: warn if alternate is a git repo (not object dir) Ævar Arnfjörð Bjarmason
2018-12-04 13:27     ` [PATCH 1/3] sha1-file: test the error behavior of alt_odb_usable() Ævar Arnfjörð Bjarmason
2019-03-28 20:04       ` [PATCH v2] " Ævar Arnfjörð Bjarmason
2019-03-29 13:46         ` Jeff King
2019-03-29 13:55           ` Ævar Arnfjörð Bjarmason
2019-04-08 15:57             ` Ævar Arnfjörð Bjarmason
2019-04-09  8:21               ` Junio C Hamano
2019-04-09  8:45                 ` Ævar Arnfjörð Bjarmason
2019-04-09  9:43                   ` Junio C Hamano
2019-04-09 14:14                     ` Jeff King
2019-04-09  8:29               ` Junio C Hamano
2018-12-04 13:27     ` [PATCH 2/3] sha1-file: emit error if an alternate looks like a repository Ævar Arnfjörð Bjarmason
2018-12-05  3:35       ` Junio C Hamano
2018-12-05  6:10         ` Jeff King
2018-12-04 13:27     ` [PATCH 3/3] sha1-file: change alternate "error:" message to "warning:" Ævar Arnfjörð Bjarmason
2018-12-05  3:37       ` Junio C Hamano
2018-12-05  5:54         ` Jeff King
2018-12-05  3:30     ` How de-duplicate similar repositories with alternates Junio C Hamano
2018-12-04 13:35 ` Ævar Arnfjörð Bjarmason
2018-12-04 14:17   ` Derrick Stolee

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87sgzjyif2.fsf@evledraar.gmail.com \
    --to=avarab@gmail.com \
    --cc=christian.couder@gmail.com \
    --cc=git-users@googlegroups.com \
    --cc=git@vger.kernel.org \
    --cc=pclouds@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).