git@vger.kernel.org mailing list mirror (one of many)
 help / Atom feed
* How de-duplicate similar repositories with alternates
@ 2018-11-29 14:59 Ævar Arnfjörð Bjarmason
  2018-11-29 16:09 ` Ævar Arnfjörð Bjarmason
                   ` (3 more replies)
  0 siblings, 4 replies; 20+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2018-11-29 14:59 UTC (permalink / raw)
  To: git, Git for human beings; +Cc: Christian Couder

A co-worker asked me today how space could be saved when you have
multiple checkouts of the same repository (at different revs) on the
same machine. I said since these won't block-level de-duplicate well[1]
one way to do this is with alternates.

However, once you have an existing clone I didn't know how to get the
gains without a full re-clone, but I hadn't looked deeply into it. As it
turns out I'm wrong about that, which I found when writing the following
test-case which shows that it works:

    (
        cd /tmp &&
        rm -rf /tmp/git-{master,pu,pu-alt}.git &&

        # Normal clones
        git clone --bare --no-tags --single-branch --branch master https://github.com/git/git.git /tmp/git-master.git &&
        git clone --bare --no-tags --single-branch --branch pu https://github.com/git/git.git /tmp/git-pu.git &&

        # An 'alternate' clone using 'master' objects from another repo
        git --bare init /tmp/git-pu-alt.git &&
        for git in git-pu.git git-pu-alt.git
        do
            echo /tmp/git-master.git/objects >/tmp/$git/objects/info/alternates
        done &&
        git -C git-pu-alt.git fetch --no-tags https://github.com/git/git.git pu:pu

        # Respective sizes, 'alternate' clone much smaller
        du -shc /tmp/git-*.git &&

        # GC them all. Compacts the git-pu.git to git-pu-alt.git's size
        for repo in git-*.git
        do
            git -C $repo gc
        done &&
        du -shc /tmp/git-*.git

        # Add another big history (GFW) to git-{pu,master}.git (in that order!)
        for repo in $(ls -d /tmp/git-*.git | sort -r)
        do
            git -C $repo fetch --no-tags https://github.com/git-for-windows/git master:master-gfw
        done &&
        du -shc /tmp/git-*.git &&

        # Another GC. The objects now in git-master.git will be de-duped by all
        for repo in git-*.git
        do
            git -C $repo gc
        done &&
        du -shc /tmp/git-*.git
    )

This shows a scenario where we clone git.git at "master" and "pu" in
different places. After clone the relevant sizes are:

    108M    /tmp/git-master.git
    3.2M    /tmp/git-pu-alt.git
    109M    /tmp/git-pu.git
    219M    total

I.e. git-pu-alt.git is much smaller since it points via alternates to
git-master.git, and the history of "pu" shares most of the objects with
"master". But then how do you get those gains for git-pu.git? Turns out
you just "git gc"

    111M    /tmp/git-master.git
    2.1M    /tmp/git-pu-alt.git
    2.1M    /tmp/git-pu.git
    115M    total

This is the thing I was wrong about, in retrospect probably because I'd
been putting PATH_TO_REPO in objects/info/alternates, but we actually
need PATH_TO_REPO/objects, and "git gc" won't warn about this (or "git
fsck"). Probably a good idea to patch that at some point, i.e. whine
about paths in alternates that don't have objects, or at the very least
those that don't exist. #leftoverbits

Then when we fetch git-for-windows:master to all the repos they all grow
by the amount git-for-windows has diverged:

    144M    /tmp/git-master.git
    36M     /tmp/git-pu-alt.git
    36M     /tmp/git-pu.git
    214M    total

Note that the "sort -r" is critical here. If we fetched git-master.git
first (at this point the alternate for git-pu*.git) we wouldn't get the
duplication in the first place, but instead:

    144M    /tmp/git-master.git
    2.1M    /tmp/git-pu-alt.git
    2.1M    /tmp/git-pu.git
    148M    total

This shows the importance of keeping such an 'alternate' repo
up-to-date, i.e. we don't get the duplication in the first place, but
regardless (this from a run with sort -r) a "git gc" will coalesce them:

    131M    /tmp/git-master.git
    2.1M    /tmp/git-pu-alt.git
    2.2M    /tmp/git-pu.git
    135M    total

If you find this interesting make sure to read my
https://public-inbox.org/git/87k1s3bomt.fsf@evledraar.gmail.com/ and
https://public-inbox.org/git/87in7nbi5b.fsf@evledraar.gmail.com/ for the
caveats, i.e. if this is something intended for users then no ref in the
alternate can ever be rewound, that'll potentially result in repository
corruption.

1. https://public-inbox.org/git/87bmhiykvw.fsf@evledraar.gmail.com/

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: How de-duplicate similar repositories with alternates
  2018-11-29 14:59 How de-duplicate similar repositories with alternates Ævar Arnfjörð Bjarmason
@ 2018-11-29 16:09 ` Ævar Arnfjörð Bjarmason
  2018-11-29 18:55 ` Stefan Beller
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 20+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2018-11-29 16:09 UTC (permalink / raw)
  To: git, Git for human beings; +Cc: Christian Couder, Duy Nguyen


On Thu, Nov 29 2018, Ævar Arnfjörð Bjarmason wrote:

> A co-worker asked me today how space could be saved when you have
> multiple checkouts of the same repository (at different revs) on the
> same machine. I said since these won't block-level de-duplicate well[1]
> one way to do this is with alternates.
>
> However, once you have an existing clone I didn't know how to get the
> gains without a full re-clone, but I hadn't looked deeply into it. As it
> turns out I'm wrong about that, which I found when writing the following
> test-case which shows that it works:
>
>     (
>         cd /tmp &&
>         rm -rf /tmp/git-{master,pu,pu-alt}.git &&
>
>         # Normal clones
>         git clone --bare --no-tags --single-branch --branch master https://github.com/git/git.git /tmp/git-master.git &&
>         git clone --bare --no-tags --single-branch --branch pu https://github.com/git/git.git /tmp/git-pu.git &&
>
>         # An 'alternate' clone using 'master' objects from another repo
>         git --bare init /tmp/git-pu-alt.git &&
>         for git in git-pu.git git-pu-alt.git
>         do
>             echo /tmp/git-master.git/objects >/tmp/$git/objects/info/alternates
>         done &&
>         git -C git-pu-alt.git fetch --no-tags https://github.com/git/git.git pu:pu
>
>         # Respective sizes, 'alternate' clone much smaller
>         du -shc /tmp/git-*.git &&
>
>         # GC them all. Compacts the git-pu.git to git-pu-alt.git's size
>         for repo in git-*.git
>         do
>             git -C $repo gc
>         done &&
>         du -shc /tmp/git-*.git
>
>         # Add another big history (GFW) to git-{pu,master}.git (in that order!)
>         for repo in $(ls -d /tmp/git-*.git | sort -r)
>         do
>             git -C $repo fetch --no-tags https://github.com/git-for-windows/git master:master-gfw
>         done &&
>         du -shc /tmp/git-*.git &&
>
>         # Another GC. The objects now in git-master.git will be de-duped by all
>         for repo in git-*.git
>         do
>             git -C $repo gc
>         done &&
>         du -shc /tmp/git-*.git
>     )
>
> This shows a scenario where we clone git.git at "master" and "pu" in
> different places. After clone the relevant sizes are:
>
>     108M    /tmp/git-master.git
>     3.2M    /tmp/git-pu-alt.git
>     109M    /tmp/git-pu.git
>     219M    total
>
> I.e. git-pu-alt.git is much smaller since it points via alternates to
> git-master.git, and the history of "pu" shares most of the objects with
> "master". But then how do you get those gains for git-pu.git? Turns out
> you just "git gc"
>
>     111M    /tmp/git-master.git
>     2.1M    /tmp/git-pu-alt.git
>     2.1M    /tmp/git-pu.git
>     115M    total
>
> This is the thing I was wrong about, in retrospect probably because I'd
> been putting PATH_TO_REPO in objects/info/alternates, but we actually
> need PATH_TO_REPO/objects, and "git gc" won't warn about this (or "git
> fsck"). Probably a good idea to patch that at some point, i.e. whine
> about paths in alternates that don't have objects, or at the very least
> those that don't exist. #leftoverbits

Actually looking at this again the thing that may have stumped me last
time is that this has a bad interaction with gc.bigPackThreshold. If you
have an alternate that would otherwise house most of your objects *and*
you have a pack that's larger than the gc.bigPackThreshold your mostly
redundant pack won't be removed.

That's understandable in terms of implementation, but unfortunate. It
would be nice if we learned some way to detect this, i.e. "I have this
10GB pack, but with this alternate I can extract this 100MB out of it
and throw it away". Now we just keep the 10GB pack even if it's mostly
redundant to what's in the alternate.

> Then when we fetch git-for-windows:master to all the repos they all grow
> by the amount git-for-windows has diverged:
>
>     144M    /tmp/git-master.git
>     36M     /tmp/git-pu-alt.git
>     36M     /tmp/git-pu.git
>     214M    total
>
> Note that the "sort -r" is critical here. If we fetched git-master.git
> first (at this point the alternate for git-pu*.git) we wouldn't get the
> duplication in the first place, but instead:
>
>     144M    /tmp/git-master.git
>     2.1M    /tmp/git-pu-alt.git
>     2.1M    /tmp/git-pu.git
>     148M    total
>
> This shows the importance of keeping such an 'alternate' repo
> up-to-date, i.e. we don't get the duplication in the first place, but
> regardless (this from a run with sort -r) a "git gc" will coalesce them:
>
>     131M    /tmp/git-master.git
>     2.1M    /tmp/git-pu-alt.git
>     2.2M    /tmp/git-pu.git
>     135M    total
>
> If you find this interesting make sure to read my
> https://public-inbox.org/git/87k1s3bomt.fsf@evledraar.gmail.com/ and
> https://public-inbox.org/git/87in7nbi5b.fsf@evledraar.gmail.com/ for the
> caveats, i.e. if this is something intended for users then no ref in the
> alternate can ever be rewound, that'll potentially result in repository
> corruption.
>
> 1. https://public-inbox.org/git/87bmhiykvw.fsf@evledraar.gmail.com/

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: How de-duplicate similar repositories with alternates
  2018-11-29 14:59 How de-duplicate similar repositories with alternates Ævar Arnfjörð Bjarmason
  2018-11-29 16:09 ` Ævar Arnfjörð Bjarmason
@ 2018-11-29 18:55 ` Stefan Beller
  2018-11-29 20:10   ` Ævar Arnfjörð Bjarmason
  2018-12-04  7:06   ` Jeff King
  2018-12-04  6:59 ` Jeff King
  2018-12-04 13:35 ` Ævar Arnfjörð Bjarmason
  3 siblings, 2 replies; 20+ messages in thread
From: Stefan Beller @ 2018-11-29 18:55 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason; +Cc: git, git-users, Christian Couder

On Thu, Nov 29, 2018 at 7:00 AM Ævar Arnfjörð Bjarmason
<avarab@gmail.com> wrote:
>
> A co-worker asked me today how space could be saved when you have
> multiple checkouts of the same repository (at different revs) on the
> same machine. I said since these won't block-level de-duplicate well[1]
> one way to do this is with alternates.

Another way is to use git-worktree, which would solve the gc issues
mentioned below?

I view alternates as a historic artefact as the deduping
of objects client side can be done using worktrees, and on the
serverside - I think - most of the git hosters use namespaces
and put a fork network into the same repository and use pack islands.

Can you elaborate on why worktrees would not solve the problem?
(I initially was hesitant to use them as I liked going into .git and tempering
with files such as the config directly. But now I cannot `cd .git` any more;
it turns out the advantages outweigh this corner case that I was attached to)

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: How de-duplicate similar repositories with alternates
  2018-11-29 18:55 ` Stefan Beller
@ 2018-11-29 20:10   ` Ævar Arnfjörð Bjarmason
  2018-11-29 20:43     ` Duy Nguyen
  2018-12-04  7:06   ` Jeff King
  1 sibling, 1 reply; 20+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2018-11-29 20:10 UTC (permalink / raw)
  To: Stefan Beller; +Cc: git, git-users, Christian Couder


On Thu, Nov 29 2018, Stefan Beller wrote:

> On Thu, Nov 29, 2018 at 7:00 AM Ævar Arnfjörð Bjarmason
> <avarab@gmail.com> wrote:
>>
>> A co-worker asked me today how space could be saved when you have
>> multiple checkouts of the same repository (at different revs) on the
>> same machine. I said since these won't block-level de-duplicate well[1]
>> one way to do this is with alternates.
>
> Another way is to use git-worktree, which would solve the gc issues
> mentioned below?
>
> I view alternates as a historic artefact as the deduping
> of objects client side can be done using worktrees, and on the
> serverside - I think - most of the git hosters use namespaces
> and put a fork network into the same repository and use pack islands.
>
> Can you elaborate on why worktrees would not solve the problem?
> (I initially was hesitant to use them as I liked going into .git and tempering
> with files such as the config directly. But now I cannot `cd .git` any more;
> it turns out the advantages outweigh this corner case that I was attached to)

This was discussed recently on-list & I chimed in with details about
that here:
https://public-inbox.org/git/87po1waqyc.fsf@evledraar.gmail.com/ &
https://public-inbox.org/git/87muwzc2kv.fsf@evledraar.gmail.com/

In particular the "multiple devs" use-case described in the latter
E-Mail is what I have in mind. Worktrees are inherently unsuitable for
that.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: How de-duplicate similar repositories with alternates
  2018-11-29 20:10   ` Ævar Arnfjörð Bjarmason
@ 2018-11-29 20:43     ` Duy Nguyen
  0 siblings, 0 replies; 20+ messages in thread
From: Duy Nguyen @ 2018-11-29 20:43 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason
  Cc: Stefan Beller, Git Mailing List, git-users, Christian Couder

On Thu, Nov 29, 2018 at 9:15 PM Ævar Arnfjörð Bjarmason
<avarab@gmail.com> wrote:
>
>
> On Thu, Nov 29 2018, Stefan Beller wrote:
>
> > On Thu, Nov 29, 2018 at 7:00 AM Ævar Arnfjörð Bjarmason
> > <avarab@gmail.com> wrote:
> >>
> >> A co-worker asked me today how space could be saved when you have
> >> multiple checkouts of the same repository (at different revs) on the
> >> same machine. I said since these won't block-level de-duplicate well[1]
> >> one way to do this is with alternates.
> >
> > Another way is to use git-worktree, which would solve the gc issues
> > mentioned below?
> >
> > I view alternates as a historic artefact as the deduping
> > of objects client side can be done using worktrees, and on the
> > serverside - I think - most of the git hosters use namespaces
> > and put a fork network into the same repository and use pack islands.
> >
> > Can you elaborate on why worktrees would not solve the problem?
> > (I initially was hesitant to use them as I liked going into .git and tempering
> > with files such as the config directly. But now I cannot `cd .git` any more;
> > it turns out the advantages outweigh this corner case that I was attached to)
>
> This was discussed recently on-list & I chimed in with details about
> that here:
> https://public-inbox.org/git/87po1waqyc.fsf@evledraar.gmail.com/ &
> https://public-inbox.org/git/87muwzc2kv.fsf@evledraar.gmail.com/
>
> In particular the "multiple devs" use-case described in the latter
> E-Mail is what I have in mind. Worktrees are inherently unsuitable for
> that.

Yeah, the separate ref namespace is something I would like to have on
the client side too. I did consider implementing it a couple times but
it's really no small task. Naively, I could achieve that pretty quick
if all refs are loose, but performance would tank once the number of
refs goes over say a hundred.
-- 
Duy

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: How de-duplicate similar repositories with alternates
  2018-11-29 14:59 How de-duplicate similar repositories with alternates Ævar Arnfjörð Bjarmason
  2018-11-29 16:09 ` Ævar Arnfjörð Bjarmason
  2018-11-29 18:55 ` Stefan Beller
@ 2018-12-04  6:59 ` Jeff King
  2018-12-04 10:43   ` Ævar Arnfjörð Bjarmason
  2018-12-04 13:35 ` Ævar Arnfjörð Bjarmason
  3 siblings, 1 reply; 20+ messages in thread
From: Jeff King @ 2018-12-04  6:59 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason
  Cc: git, Git for human beings, Christian Couder

On Thu, Nov 29, 2018 at 03:59:26PM +0100, Ævar Arnfjörð Bjarmason wrote:

> This is the thing I was wrong about, in retrospect probably because I'd
> been putting PATH_TO_REPO in objects/info/alternates, but we actually
> need PATH_TO_REPO/objects, and "git gc" won't warn about this (or "git
> fsck"). Probably a good idea to patch that at some point, i.e. whine
> about paths in alternates that don't have objects, or at the very least
> those that don't exist. #leftoverbits

We do complain about missing directories; see alt_odb_usable().
Pointing to a real directory that doesn't happen to contain any objects
is harder. If there are no loose objects, there might not be any hashed
object directories. For a "real" object database, there should always be
a "pack/" directory. But technically the object storage directory does
not even need to have that; it can just be a directory full of loose
objects that happens not to have any at this moment.

That said, I suspect if we issued a warning for "woah, it looks like
this doesn't have any objects in it, nor does it even have a pack
directory" that nobody would complain.

-Peff

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: How de-duplicate similar repositories with alternates
  2018-11-29 18:55 ` Stefan Beller
  2018-11-29 20:10   ` Ævar Arnfjörð Bjarmason
@ 2018-12-04  7:06   ` Jeff King
  2018-12-04 12:07     ` Derrick Stolee
  1 sibling, 1 reply; 20+ messages in thread
From: Jeff King @ 2018-12-04  7:06 UTC (permalink / raw)
  To: Stefan Beller
  Cc: Ævar Arnfjörð Bjarmason, git, git-users, Christian Couder

On Thu, Nov 29, 2018 at 10:55:49AM -0800, Stefan Beller wrote:

> On Thu, Nov 29, 2018 at 7:00 AM Ævar Arnfjörð Bjarmason
> <avarab@gmail.com> wrote:
> >
> > A co-worker asked me today how space could be saved when you have
> > multiple checkouts of the same repository (at different revs) on the
> > same machine. I said since these won't block-level de-duplicate well[1]
> > one way to do this is with alternates.
> 
> Another way is to use git-worktree, which would solve the gc issues
> mentioned below?
> 
> I view alternates as a historic artefact as the deduping
> of objects client side can be done using worktrees, and on the
> serverside - I think - most of the git hosters use namespaces
> and put a fork network into the same repository and use pack islands.

Nope, we definitely use alternates. The ref namespace support in Git is
not nearly complete enough to run a modern hosting site; it only kicks
in for upload-pack and receive-pack. Other commands (e.g., rev-list to
traverse for a history-view page) have no support at all. So we share
object storage, but not ref storage.

In theory the caller could namespace requests (e.g., the user asks for
"foo", the web site feeds "refs/forks/$id/refs/heads/foo" to git). But
any bugs are a lot more likely to lead to security problems (oops, you
accidentally wrote into somebody else's fork!). And ref storage has
traditionally been a sore point for scaling, so giving each fork its own
repo and refs helps break that up.

By contrast, object storage is pretty easy to share. It scales
reasonably well, and the security model is much simpler due to the
immutable nature of object names.

-Peff

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: How de-duplicate similar repositories with alternates
  2018-12-04  6:59 ` Jeff King
@ 2018-12-04 10:43   ` Ævar Arnfjörð Bjarmason
  2018-12-04 13:27     ` [PATCH 0/3] sha1-file: warn if alternate is a git repo (not object dir) Ævar Arnfjörð Bjarmason
                       ` (4 more replies)
  0 siblings, 5 replies; 20+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2018-12-04 10:43 UTC (permalink / raw)
  To: Jeff King; +Cc: git, Git for human beings, Christian Couder


On Tue, Dec 04 2018, Jeff King wrote:

> On Thu, Nov 29, 2018 at 03:59:26PM +0100, Ævar Arnfjörð Bjarmason wrote:
>
>> This is the thing I was wrong about, in retrospect probably because I'd
>> been putting PATH_TO_REPO in objects/info/alternates, but we actually
>> need PATH_TO_REPO/objects, and "git gc" won't warn about this (or "git
>> fsck"). Probably a good idea to patch that at some point, i.e. whine
>> about paths in alternates that don't have objects, or at the very least
>> those that don't exist. #leftoverbits
>
> We do complain about missing directories; see alt_odb_usable().
> Pointing to a real directory that doesn't happen to contain any objects
> is harder. If there are no loose objects, there might not be any hashed
> object directories. For a "real" object database, there should always be
> a "pack/" directory. But technically the object storage directory does
> not even need to have that; it can just be a directory full of loose
> objects that happens not to have any at this moment.
>
> That said, I suspect if we issued a warning for "woah, it looks like
> this doesn't have any objects in it, nor does it even have a pack
> directory" that nobody would complain.

Yeah, although see my <87sgzjyif2.fsf@evledraar.gmail.com>, I also ran
into a different issue.

I think a warning (or even error) like this would be more useful:

    test ! -d $objdir && error... # current behavior
    test -d $objdir/objects && error "Did you mean $objdir/objects, silly?" # new error

I.e. I suspect I'm not the only one who's not read the documentation
carefully enough and thought it was a path to the root of the repo and
wondered why it silently didn't work.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: How de-duplicate similar repositories with alternates
  2018-12-04  7:06   ` Jeff King
@ 2018-12-04 12:07     ` Derrick Stolee
  0 siblings, 0 replies; 20+ messages in thread
From: Derrick Stolee @ 2018-12-04 12:07 UTC (permalink / raw)
  To: Jeff King, Stefan Beller
  Cc: Ævar Arnfjörð Bjarmason, git, git-users, Christian Couder

On 12/4/2018 2:06 AM, Jeff King wrote:
> On Thu, Nov 29, 2018 at 10:55:49AM -0800, Stefan Beller wrote:
>
>> I view alternates as a historic artefact as the deduping
>> of objects client side can be done using worktrees, and on the
>> serverside - I think - most of the git hosters use namespaces
>> and put a fork network into the same repository and use pack islands.
> By contrast, object storage is pretty easy to share. It scales
> reasonably well, and the security model is much simpler due to the
> immutable nature of object names.

And for the client side, we use alternates as an important way to scale 
VFS for Git to multiple enlistments on the same machine. VFS for Git 
manages a "shared object cache" (the alternate) that is updated in the 
background (including multi-pack-index and commit-graph).

Using worktrees for the same effect would add complications to the user 
interactions, not only when creating an enlistment but the fact that two 
enlistments cannot check out the same ref will confuse users.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH 0/3] sha1-file: warn if alternate is a git repo (not object dir)
  2018-12-04 10:43   ` Ævar Arnfjörð Bjarmason
@ 2018-12-04 13:27     ` Ævar Arnfjörð Bjarmason
  2018-12-04 13:27     ` [PATCH 1/3] sha1-file: test the error behavior of alt_odb_usable() Ævar Arnfjörð Bjarmason
                       ` (3 subsequent siblings)
  4 siblings, 0 replies; 20+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2018-12-04 13:27 UTC (permalink / raw)
  To: git
  Cc: Junio C Hamano, Jeff King, Derrick Stolee, Christian Couder,
	Ævar Arnfjörð Bjarmason

This adds a warning for the issue discussed upthread. As noted in
these patches we've been emitting an "error" while not impacting the
exit code, should we die() instead? Maybe, but until there's consensus
on that let's change this to warning() while we're at it.

Ævar Arnfjörð Bjarmason (3):
  sha1-file: test the error behavior of alt_odb_usable()
  sha1-file: emit error if an alternate looks like a repository
  sha1-file: change alternate "error:" message to "warning:"

 sha1-file.c               | 16 ++++++++++++----
 t/t5613-info-alternate.sh | 21 +++++++++++++++++++++
 2 files changed, 33 insertions(+), 4 deletions(-)

-- 
2.20.0.rc2.403.gdbc3b29805


^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH 1/3] sha1-file: test the error behavior of alt_odb_usable()
  2018-12-04 10:43   ` Ævar Arnfjörð Bjarmason
  2018-12-04 13:27     ` [PATCH 0/3] sha1-file: warn if alternate is a git repo (not object dir) Ævar Arnfjörð Bjarmason
@ 2018-12-04 13:27     ` Ævar Arnfjörð Bjarmason
  2018-12-04 13:27     ` [PATCH 2/3] sha1-file: emit error if an alternate looks like a repository Ævar Arnfjörð Bjarmason
                       ` (2 subsequent siblings)
  4 siblings, 0 replies; 20+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2018-12-04 13:27 UTC (permalink / raw)
  To: git
  Cc: Junio C Hamano, Jeff King, Derrick Stolee, Christian Couder,
	Ævar Arnfjörð Bjarmason

Add a test for the error() case in alt_odb_usable() where an alternate
directory doesn't exist. This behavior has been the same since
26125f6b9b ("detect broken alternates.", 2006-02-22), but if that
error() was turned into die() the entire test suite would still pass.

Perhaps we should die() in that case, but let's start by adding a test
here to assert the long-standing existing behavior.

Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
---
 t/t5613-info-alternate.sh | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/t/t5613-info-alternate.sh b/t/t5613-info-alternate.sh
index 895f46bb91..d2964c57b7 100755
--- a/t/t5613-info-alternate.sh
+++ b/t/t5613-info-alternate.sh
@@ -136,4 +136,11 @@ test_expect_success CASE_INSENSITIVE_FS 'dup finding can be case-insensitive' '
 	test_cmp expect actual.alternates
 '
 
+test_expect_success 'print "error" on non-existing alternate' '
+	git init --bare I &&
+	echo DOES_NOT_EXIST >I/objects/info/alternates &&
+	git -C I fsck 2>stderr &&
+	test_i18ngrep "does not exist; check" stderr
+'
+
 test_done
-- 
2.20.0.rc2.403.gdbc3b29805


^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH 2/3] sha1-file: emit error if an alternate looks like a repository
  2018-12-04 10:43   ` Ævar Arnfjörð Bjarmason
  2018-12-04 13:27     ` [PATCH 0/3] sha1-file: warn if alternate is a git repo (not object dir) Ævar Arnfjörð Bjarmason
  2018-12-04 13:27     ` [PATCH 1/3] sha1-file: test the error behavior of alt_odb_usable() Ævar Arnfjörð Bjarmason
@ 2018-12-04 13:27     ` Ævar Arnfjörð Bjarmason
  2018-12-05  3:35       ` Junio C Hamano
  2018-12-04 13:27     ` [PATCH 3/3] sha1-file: change alternate "error:" message to "warning:" Ævar Arnfjörð Bjarmason
  2018-12-05  3:30     ` How de-duplicate similar repositories with alternates Junio C Hamano
  4 siblings, 1 reply; 20+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2018-12-04 13:27 UTC (permalink / raw)
  To: git
  Cc: Junio C Hamano, Jeff King, Derrick Stolee, Christian Couder,
	Ævar Arnfjörð Bjarmason

Since 26125f6b9b ("detect broken alternates.", 2006-02-22) we've
emitted an error if the alternates directory doesn't exist, but not
for the common misstep of adding a path to another git repository as
an alternate, as opposed to its "objects" directory.

Let's check for this, i.e. whether X/objects or X/.git/objects exists
if the user supplies X and print an error (which as a commit leading
up to this one shows doesn't change the exit code, just "warns").

This check is intentionally not implemented by e.g. requiring that any
of X/?? exists or X/info or X/pack exists. It's a legitimate use-case
to point to an existing alternate that hasn't been populated yet, but
pointing to one where an "X/objects" or "X/.git/objects" directory
exists is definitely a mistake we should warn the user about.

Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
---
 sha1-file.c               | 10 +++++++++-
 t/t5613-info-alternate.sh | 14 ++++++++++++++
 2 files changed, 23 insertions(+), 1 deletion(-)

diff --git a/sha1-file.c b/sha1-file.c
index 5bd11c85bc..f142f81658 100644
--- a/sha1-file.c
+++ b/sha1-file.c
@@ -376,12 +376,20 @@ static int alt_odb_usable(struct raw_object_store *o,
 {
 	struct alternate_object_database *alt;
 
-	/* Detect cases where alternate disappeared */
 	if (!is_directory(path->buf)) {
+		/* Detect cases where alternate disappeared */
 		error(_("object directory %s does not exist; "
 			"check .git/objects/info/alternates"),
 		      path->buf);
 		return 0;
+	} else if (is_directory(mkpath("%s/objects", path->buf)) ||
+		   is_directory(mkpath("%s/.git/objects", path->buf))) {
+		/* Detect cases where alternate is a git repository */
+		error(_("object directory %s looks like a git repository; "
+			"alternates must point to the 'objects' directory. "
+			"check .git/objects/info/alternates"),
+		      path->buf);
+		return 0;
 	}
 
 	/*
diff --git a/t/t5613-info-alternate.sh b/t/t5613-info-alternate.sh
index d2964c57b7..b959e21421 100755
--- a/t/t5613-info-alternate.sh
+++ b/t/t5613-info-alternate.sh
@@ -143,4 +143,18 @@ test_expect_success 'print "error" on non-existing alternate' '
 	test_i18ngrep "does not exist; check" stderr
 '
 
+test_expect_success 'print "error" on alternate that looks like a git repository' '
+	git init --bare J &&
+	git init --bare K &&
+
+	# H is bare, G is not
+	echo ../../H >J/objects/info/alternates &&
+	echo ../../G >K/objects/info/alternates &&
+
+	git -C J fsck 2>stderr &&
+	test_i18ngrep "looks like a git repository; alternates must" stderr &&
+	git -C K fsck 2>stderr &&
+	test_i18ngrep "looks like a git repository; alternates must" stderr
+'
+
 test_done
-- 
2.20.0.rc2.403.gdbc3b29805


^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH 3/3] sha1-file: change alternate "error:" message to "warning:"
  2018-12-04 10:43   ` Ævar Arnfjörð Bjarmason
                       ` (2 preceding siblings ...)
  2018-12-04 13:27     ` [PATCH 2/3] sha1-file: emit error if an alternate looks like a repository Ævar Arnfjörð Bjarmason
@ 2018-12-04 13:27     ` Ævar Arnfjörð Bjarmason
  2018-12-05  3:37       ` Junio C Hamano
  2018-12-05  3:30     ` How de-duplicate similar repositories with alternates Junio C Hamano
  4 siblings, 1 reply; 20+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2018-12-04 13:27 UTC (permalink / raw)
  To: git
  Cc: Junio C Hamano, Jeff King, Derrick Stolee, Christian Couder,
	Ævar Arnfjörð Bjarmason

Change the "error" message emitted by alt_odb_usable() to be a
"warning" instead. As noted in commits leading up to this one this has
never impacted the exit code ever since the check was initially added
in 26125f6b9b ("detect broken alternates.", 2006-02-22).

It's confusing to emit an "error" when e.g. "git fsck" will exit with
0, so let's emit a "warning:" instead.

Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
---
 sha1-file.c | 14 +++++++-------
 1 file changed, 7 insertions(+), 7 deletions(-)

diff --git a/sha1-file.c b/sha1-file.c
index f142f81658..4b9b63bdcb 100644
--- a/sha1-file.c
+++ b/sha1-file.c
@@ -378,17 +378,17 @@ static int alt_odb_usable(struct raw_object_store *o,
 
 	if (!is_directory(path->buf)) {
 		/* Detect cases where alternate disappeared */
-		error(_("object directory %s does not exist; "
-			"check .git/objects/info/alternates"),
-		      path->buf);
+		warning(_("object directory %s does not exist; "
+			  "check .git/objects/info/alternates"),
+			path->buf);
 		return 0;
 	} else if (is_directory(mkpath("%s/objects", path->buf)) ||
 		   is_directory(mkpath("%s/.git/objects", path->buf))) {
 		/* Detect cases where alternate is a git repository */
-		error(_("object directory %s looks like a git repository; "
-			"alternates must point to the 'objects' directory. "
-			"check .git/objects/info/alternates"),
-		      path->buf);
+		warning(_("object directory %s looks like a git repository; "
+			  "alternates must point to the 'objects' directory. "
+			  "check .git/objects/info/alternates"),
+			path->buf);
 		return 0;
 	}
 
-- 
2.20.0.rc2.403.gdbc3b29805


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: How de-duplicate similar repositories with alternates
  2018-11-29 14:59 How de-duplicate similar repositories with alternates Ævar Arnfjörð Bjarmason
                   ` (2 preceding siblings ...)
  2018-12-04  6:59 ` Jeff King
@ 2018-12-04 13:35 ` Ævar Arnfjörð Bjarmason
  2018-12-04 14:17   ` Derrick Stolee
  3 siblings, 1 reply; 20+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2018-12-04 13:35 UTC (permalink / raw)
  To: git, Git for human beings; +Cc: Christian Couder, Derrick Stolee


On Thu, Nov 29 2018, Ævar Arnfjörð Bjarmason wrote:

> A co-worker asked me today how space could be saved when you have
> multiple checkouts of the same repository (at different revs) on the
> same machine. I said since these won't block-level de-duplicate well[1]
> one way to do this is with alternates.
>
> However, once you have an existing clone I didn't know how to get the
> gains without a full re-clone, but I hadn't looked deeply into it. As it
> turns out I'm wrong about that, which I found when writing the following
> test-case which shows that it works:
>
>     (
>         cd /tmp &&
>         rm -rf /tmp/git-{master,pu,pu-alt}.git &&
>
>         # Normal clones
>         git clone --bare --no-tags --single-branch --branch master https://github.com/git/git.git /tmp/git-master.git &&
>         git clone --bare --no-tags --single-branch --branch pu https://github.com/git/git.git /tmp/git-pu.git &&
>
>         # An 'alternate' clone using 'master' objects from another repo
>         git --bare init /tmp/git-pu-alt.git &&
>         for git in git-pu.git git-pu-alt.git
>         do
>             echo /tmp/git-master.git/objects >/tmp/$git/objects/info/alternates
>         done &&
>         git -C git-pu-alt.git fetch --no-tags https://github.com/git/git.git pu:pu
>
>         # Respective sizes, 'alternate' clone much smaller
>         du -shc /tmp/git-*.git &&
>
>         # GC them all. Compacts the git-pu.git to git-pu-alt.git's size
>         for repo in git-*.git
>         do
>             git -C $repo gc
>         done &&
>         du -shc /tmp/git-*.git
>
>         # Add another big history (GFW) to git-{pu,master}.git (in that order!)
>         for repo in $(ls -d /tmp/git-*.git | sort -r)
>         do
>             git -C $repo fetch --no-tags https://github.com/git-for-windows/git master:master-gfw
>         done &&
>         du -shc /tmp/git-*.git &&
>
>         # Another GC. The objects now in git-master.git will be de-duped by all
>         for repo in git-*.git
>         do
>             git -C $repo gc
>         done &&
>         du -shc /tmp/git-*.git
>     )
>
> This shows a scenario where we clone git.git at "master" and "pu" in
> different places. After clone the relevant sizes are:
>
>     108M    /tmp/git-master.git
>     3.2M    /tmp/git-pu-alt.git
>     109M    /tmp/git-pu.git
>     219M    total
>
> I.e. git-pu-alt.git is much smaller since it points via alternates to
> git-master.git, and the history of "pu" shares most of the objects with
> "master". But then how do you get those gains for git-pu.git? Turns out
> you just "git gc"
>
>     111M    /tmp/git-master.git
>     2.1M    /tmp/git-pu-alt.git
>     2.1M    /tmp/git-pu.git
>     115M    total
>
> This is the thing I was wrong about, in retrospect probably because I'd
> been putting PATH_TO_REPO in objects/info/alternates, but we actually
> need PATH_TO_REPO/objects, and "git gc" won't warn about this (or "git
> fsck"). Probably a good idea to patch that at some point, i.e. whine
> about paths in alternates that don't have objects, or at the very least
> those that don't exist. #leftoverbits
>
> Then when we fetch git-for-windows:master to all the repos they all grow
> by the amount git-for-windows has diverged:
>
>     144M    /tmp/git-master.git
>     36M     /tmp/git-pu-alt.git
>     36M     /tmp/git-pu.git
>     214M    total
>
> Note that the "sort -r" is critical here. If we fetched git-master.git
> first (at this point the alternate for git-pu*.git) we wouldn't get the
> duplication in the first place, but instead:
>
>     144M    /tmp/git-master.git
>     2.1M    /tmp/git-pu-alt.git
>     2.1M    /tmp/git-pu.git
>     148M    total
>
> This shows the importance of keeping such an 'alternate' repo
> up-to-date, i.e. we don't get the duplication in the first place, but
> regardless (this from a run with sort -r) a "git gc" will coalesce them:
>
>     131M    /tmp/git-master.git
>     2.1M    /tmp/git-pu-alt.git
>     2.2M    /tmp/git-pu.git
>     135M    total
>
> If you find this interesting make sure to read my
> https://public-inbox.org/git/87k1s3bomt.fsf@evledraar.gmail.com/ and
> https://public-inbox.org/git/87in7nbi5b.fsf@evledraar.gmail.com/ for the
> caveats, i.e. if this is something intended for users then no ref in the
> alternate can ever be rewound, that'll potentially result in repository
> corruption.
>
> 1. https://public-inbox.org/git/87bmhiykvw.fsf@evledraar.gmail.com/

Maybe this is useful to someone. Here's a cronjob I wrote since I wrote
this thread that runs in daily cron on some of our systems.

It expects repositories in /var/lib/git_tree-for-alternates like
/var/lib/git_tree-for-alternates/git/git.git to exist, then scours /home
and /etc/puppet/environments (which we had a lot of) for "config" files
with the string in git/git (this saves us some work) and then tries to
find a git repository relative to that "config" file with "rev-parse
--absolute-git-dir".

If there is one, we check if the repository has a SHA-1 that the history
of our /var/lib/git_tree-for-alternates/git/git.git started with (if >1
we pick the oldest), if so this is a repository that can benefit from
using /var/lib/git_tree-for-alternates/git/git.git/objects as an
alternate, and we add the appropriate alternate info, unset
gc.bigPackThreshold so GC will actually do its work, and run "git gc"
sudo'd as the the user who owns the thing.

One one server the .git directories in /home went from ~2TB to ~100GB
using this script. On another from ~250G to ~5G. The leftover space
spent is the commit-grah (not de-duped like objects are), and whatever
accumulated divergence (topic branches mainly) exist in those repos
different than what the alternate store has in the HEAD branch.

#!/bin/bash

set -euo pipefail

ALTERNATES_STORE=/var/lib/git_tree-for-alternates

if ! test -d $ALTERNATES_STORE
then
    echo 'We have no alternates repositories here to point to!' >&2
    exit 0
fi


find_owning_user() {
    path=$1
    case $path in
        /home/*|/etc/puppet/environments/*)
            who=$(echo $path | perl -pe 's[^
                (?:
                    /home
                    |
                    /etc/puppet/environments
                )
                /
                ([^/]+)
                /
                .*
            ][$1]gx')
            if getent passwd $who >/dev/null
            then
                echo $who
            else
                echo "Know how to get user from path '$path', but '$who' is not a valid user!" >&2
            fi
            ;;
        *)
            echo "Don't know how to get user from path '$path' yet!" >&2
            ;;
    esac
}

find $ALTERNATES_STORE -type d -name '*.git' -printf "%P\n" |
while read alternate
do
    alternate_no_git=$(echo $alternate | sed 's/\.git//')
    ALTERNATES_STORE_OBJECTS=$ALTERNATES_STORE/$alternate/objects

    # If these repositories we're finding don't share a root commit
    # with the repo we have this is not going to work and we have the
    # wrong match. Note that we can have more than one root commit
    # and try to find the oldest one. Pretty sure bet that that's
    # the "real" root.
    root_commit=$(git -C $ALTERNATES_STORE/$alternate log --max-parents=0 --date-order --reverse --pretty=format:%H | head -n 1)
    echo "> Finding repositories on the system that share the $root_commit commit with $alternate" >&2

    find \
        /home \
        $(if test -d /etc/puppet/environments; then echo /etc/puppet/environments; fi) \
        -type f -name 'config' -exec grep -Hl $alternate_no_git {} \; 2>/dev/null |
    while read config
    do
        dirname=$(dirname $config)
        echo ">> Checking if $dirname is in a $alternate git repository..." >&2
        if git_dir=$(git -C $dirname rev-parse --absolute-git-dir) &&
                git -C $git_dir cat-file -e $root_commit
        then
            echo ">>> ...Yes it was, at $git_dir" >&2
            echo ">>>> Is it already migrated?..." >&2
            if test -e $git_dir/objects/info/alternates &&
                    grep -x -F -q $ALTERNATES_STORE_OBJECTS $git_dir/objects/info/alternates
            then
                echo ">>>> ...yes, nothing to do here" >&2
                continue
            else
                echo ">>>> ...no, doing migration" >&2

                who=$(find_owning_user $git_dir)
                if test -z "$who"
                then
                    echo ">>>>> unable to find who owns $git_dir" >&2
                    continue
                else
                    echo ">>>>> found that $who owns $git_dir" >&2
                fi

                if test "$DRY_RUN" = "1"
                then
                    echo ">>>>>> Would have ran commands migrating $git_dir"
                else
                    if ! sudo -u $who stat $git_dir >/dev/null 2>&1
                    then
                        echo ">>>>>> The '$who' user can't access his own '$git_dir'. Could be e.g. ex-employee. Using 'root'"
                        who=root
                    fi

                    echo ">>>>>> Migrating $git_dir is now $(sudo -u $who du -sh $git_dir | cut -f1)"
                    sudo -u $who git -C $git_dir config gc.bigPackThreshold 0
                    echo $ALTERNATES_STORE_OBJECTS | sudo tee -a $git_dir/objects/info/alternates >/dev/null
                    sudo -u $who git -C $git_dir gc
                    echo ">>>>>> Migrated $git_dir is now $(sudo -u $who du -sh $git_dir | cut -f1)"
                fi
            fi
        else
            echo ">>> No it isn't. Skipping it" >&2
            continue
        fi
    done
done

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: How de-duplicate similar repositories with alternates
  2018-12-04 13:35 ` Ævar Arnfjörð Bjarmason
@ 2018-12-04 14:17   ` Derrick Stolee
  0 siblings, 0 replies; 20+ messages in thread
From: Derrick Stolee @ 2018-12-04 14:17 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason, git, Git for human beings
  Cc: Christian Couder

On 12/4/2018 8:35 AM, Ævar Arnfjörð Bjarmason wrote:
>   The leftover space
> spent is the commit-grah (not de-duped like objects are), and...

The commit-graph could be shared, as the commits in each enlistment can 
be parsed from local with GENERATION_NUMBER_INFINITY, giving us similar 
speedups.

The issue is: who updates the file? As the commit-graph gets behind, 
performance will degrade. But it seems like you'd need similar 
maintenance on the alternate object store, anyway.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: How de-duplicate similar repositories with alternates
  2018-12-04 10:43   ` Ævar Arnfjörð Bjarmason
                       ` (3 preceding siblings ...)
  2018-12-04 13:27     ` [PATCH 3/3] sha1-file: change alternate "error:" message to "warning:" Ævar Arnfjörð Bjarmason
@ 2018-12-05  3:30     ` Junio C Hamano
  4 siblings, 0 replies; 20+ messages in thread
From: Junio C Hamano @ 2018-12-05  3:30 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason
  Cc: Jeff King, git, Git for human beings, Christian Couder

Ævar Arnfjörð Bjarmason <avarab@gmail.com> writes:

> I think a warning (or even error) like this would be more useful:
>
>     test ! -d $objdir && error... # current behavior
>     test -d $objdir/objects && error "Did you mean $objdir/objects, silly?" # new error

If it is an error common enough, perhaps we could even DWIM it, I
guess, that is...

	if test ! -d $objdir
	then
		error
	elif test -d $objdir/objects/pack
	then
		possibly warn
		objdir=$objdir/objects
	fi

> I.e. I suspect I'm not the only one who's not read the documentation
> carefully enough and thought it was a path to the root of the repo and
> wondered why it silently didn't work.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 2/3] sha1-file: emit error if an alternate looks like a repository
  2018-12-04 13:27     ` [PATCH 2/3] sha1-file: emit error if an alternate looks like a repository Ævar Arnfjörð Bjarmason
@ 2018-12-05  3:35       ` Junio C Hamano
  2018-12-05  6:10         ` Jeff King
  0 siblings, 1 reply; 20+ messages in thread
From: Junio C Hamano @ 2018-12-05  3:35 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason
  Cc: git, Jeff King, Derrick Stolee, Christian Couder

Ævar Arnfjörð Bjarmason  <avarab@gmail.com> writes:

> Since 26125f6b9b ("detect broken alternates.", 2006-02-22) we've
> emitted an error if the alternates directory doesn't exist, but not
> for the common misstep of adding a path to another git repository as
> an alternate, as opposed to its "objects" directory.
>
> Let's check for this, i.e. whether X/objects or X/.git/objects exists
> if the user supplies X and print an error (which as a commit leading
> up to this one shows doesn't change the exit code, just "warns").

I agree that "Let's check for this" is a good idea, but do not
necessarily agree with "i.e.".  Don't we have a helper that takes
the path to an existing directory and answers "Yup, it does look
like a Git repository"?  Using that is a lot more in line with what
you claimed to do in the title for this patch.

I haven't read 3/3 yet, but as I said, I suspect it is reasonable to
DWIM and use the object store associated with the directory we found
to be a repository.


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 3/3] sha1-file: change alternate "error:" message to "warning:"
  2018-12-04 13:27     ` [PATCH 3/3] sha1-file: change alternate "error:" message to "warning:" Ævar Arnfjörð Bjarmason
@ 2018-12-05  3:37       ` Junio C Hamano
  2018-12-05  5:54         ` Jeff King
  0 siblings, 1 reply; 20+ messages in thread
From: Junio C Hamano @ 2018-12-05  3:37 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason
  Cc: git, Jeff King, Derrick Stolee, Christian Couder

Ævar Arnfjörð Bjarmason  <avarab@gmail.com> writes:

> Change the "error" message emitted by alt_odb_usable() to be a
> "warning" instead. As noted in commits leading up to this one this has
> never impacted the exit code ever since the check was initially added
> in 26125f6b9b ("detect broken alternates.", 2006-02-22).
>
> It's confusing to emit an "error" when e.g. "git fsck" will exit with
> 0, so let's emit a "warning:" instead.

OK, that sounds sensible.  An alternative that may be more sensible
is to actually diagnose this as an error, but the purpose of this
message is to help users diagnose a possible misconfiguration and
keeping the repository "working" with the remaining set of object
stores by leaving it as a mere warning, like this patch does, is
probably a better approach.


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 3/3] sha1-file: change alternate "error:" message to "warning:"
  2018-12-05  3:37       ` Junio C Hamano
@ 2018-12-05  5:54         ` Jeff King
  0 siblings, 0 replies; 20+ messages in thread
From: Jeff King @ 2018-12-05  5:54 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Ævar Arnfjörð Bjarmason, git, Derrick Stolee,
	Christian Couder

On Wed, Dec 05, 2018 at 12:37:58PM +0900, Junio C Hamano wrote:

> Ævar Arnfjörð Bjarmason  <avarab@gmail.com> writes:
> 
> > Change the "error" message emitted by alt_odb_usable() to be a
> > "warning" instead. As noted in commits leading up to this one this has
> > never impacted the exit code ever since the check was initially added
> > in 26125f6b9b ("detect broken alternates.", 2006-02-22).
> >
> > It's confusing to emit an "error" when e.g. "git fsck" will exit with
> > 0, so let's emit a "warning:" instead.
> 
> OK, that sounds sensible.  An alternative that may be more sensible
> is to actually diagnose this as an error, but the purpose of this
> message is to help users diagnose a possible misconfiguration and
> keeping the repository "working" with the remaining set of object
> stores by leaving it as a mere warning, like this patch does, is
> probably a better approach.

Yeah, I think it's better to keep it as a warning. It's actually
reasonably likely to be benign (e.g., you did a "git repack -ad && rm
/path/to/alternate" to remove the dependency, but forgot to clean up the
alternates). And when it _is_ a problem, the object-reading code paths
will definitely let you know.

-Peff

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 2/3] sha1-file: emit error if an alternate looks like a repository
  2018-12-05  3:35       ` Junio C Hamano
@ 2018-12-05  6:10         ` Jeff King
  0 siblings, 0 replies; 20+ messages in thread
From: Jeff King @ 2018-12-05  6:10 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Ævar Arnfjörð Bjarmason, git, Derrick Stolee,
	Christian Couder

On Wed, Dec 05, 2018 at 12:35:08PM +0900, Junio C Hamano wrote:

> Ævar Arnfjörð Bjarmason  <avarab@gmail.com> writes:
> 
> > Since 26125f6b9b ("detect broken alternates.", 2006-02-22) we've
> > emitted an error if the alternates directory doesn't exist, but not
> > for the common misstep of adding a path to another git repository as
> > an alternate, as opposed to its "objects" directory.
> >
> > Let's check for this, i.e. whether X/objects or X/.git/objects exists
> > if the user supplies X and print an error (which as a commit leading
> > up to this one shows doesn't change the exit code, just "warns").
> 
> I agree that "Let's check for this" is a good idea, but do not
> necessarily agree with "i.e.".  Don't we have a helper that takes
> the path to an existing directory and answers "Yup, it does look
> like a Git repository"?  Using that is a lot more in line with what
> you claimed to do in the title for this patch.

Hmm. Yeah, one case this does not handle is when ".git" is a git-file
pointing elsewhere, which should trigger the condition, too.

I think we can afford to be a bit loose with this check if it's just
generating a warning for a case that would otherwise not work (and the
worst is that we might fail to correctly diagnose a broken setup). But
that I think points to another issue: this kicks in even if the path is
otherwise usable.

So if had, say, a git repository whose worktree was full of objects and
packfiles, it currently works for me to point to that as an alternate.
But after this patch, we'd complain "wait, this looks like a git repo!".

So I'd much rather see the logic check first for something usable, and
only when we fail to find it, start doing a loose diagnosis. Something
like:

  if !is_directory($path)
	complain that it does not exist, as now
  else if !is_directory($path/pack)
	/*
	 * it doesn't look like an object directory; technically it
	 * _could_ just have loose objects, and maybe we ought to check
	 * for directories matching [0-9a-f]{2}, though it seems
	 * somewhat unlikely these days.
	 */
	if is_directory($path/objects) || exists($path/.git)
		complain that it looks like a git dir
	else
		complain that it doesn't look like an object dir
  fi

Hmm. I managed to write a gross mix of C and shell for my pseudocode,
but hopefully you can read it. ;)

> I haven't read 3/3 yet, but as I said, I suspect it is reasonable to
> DWIM and use the object store associated with the directory we found
> to be a repository.

Yeah, I'm tempted by that, too, though I worry about corner cases. How
much effort should we put into discovery? I guess the rules from
enter_repo() would make sense, though the logic in that function would
need some refactoring to be reused elsewhere.

-Peff

^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, back to index

Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-11-29 14:59 How de-duplicate similar repositories with alternates Ævar Arnfjörð Bjarmason
2018-11-29 16:09 ` Ævar Arnfjörð Bjarmason
2018-11-29 18:55 ` Stefan Beller
2018-11-29 20:10   ` Ævar Arnfjörð Bjarmason
2018-11-29 20:43     ` Duy Nguyen
2018-12-04  7:06   ` Jeff King
2018-12-04 12:07     ` Derrick Stolee
2018-12-04  6:59 ` Jeff King
2018-12-04 10:43   ` Ævar Arnfjörð Bjarmason
2018-12-04 13:27     ` [PATCH 0/3] sha1-file: warn if alternate is a git repo (not object dir) Ævar Arnfjörð Bjarmason
2018-12-04 13:27     ` [PATCH 1/3] sha1-file: test the error behavior of alt_odb_usable() Ævar Arnfjörð Bjarmason
2018-12-04 13:27     ` [PATCH 2/3] sha1-file: emit error if an alternate looks like a repository Ævar Arnfjörð Bjarmason
2018-12-05  3:35       ` Junio C Hamano
2018-12-05  6:10         ` Jeff King
2018-12-04 13:27     ` [PATCH 3/3] sha1-file: change alternate "error:" message to "warning:" Ævar Arnfjörð Bjarmason
2018-12-05  3:37       ` Junio C Hamano
2018-12-05  5:54         ` Jeff King
2018-12-05  3:30     ` How de-duplicate similar repositories with alternates Junio C Hamano
2018-12-04 13:35 ` Ævar Arnfjörð Bjarmason
2018-12-04 14:17   ` Derrick Stolee

git@vger.kernel.org mailing list mirror (one of many)

Archives are clonable:
	git clone --mirror https://public-inbox.org/git
	git clone --mirror http://ou63pmih66umazou.onion/git
	git clone --mirror http://czquwvybam4bgbro.onion/git
	git clone --mirror http://hjrcffqmbrq6wope.onion/git

Newsgroups are available over NNTP:
	nntp://news.public-inbox.org/inbox.comp.version-control.git
	nntp://ou63pmih66umazou.onion/inbox.comp.version-control.git
	nntp://czquwvybam4bgbro.onion/inbox.comp.version-control.git
	nntp://hjrcffqmbrq6wope.onion/inbox.comp.version-control.git
	nntp://news.gmane.org/gmane.comp.version-control.git

 note: .onion URLs require Tor: https://www.torproject.org/
       or Tor2web: https://www.tor2web.org/

AGPL code for this site: git clone https://public-inbox.org/ public-inbox