git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
* icase pathspec magic support in ls-tree
@ 2022-09-30 12:04 Tao Klerks
  2022-09-30 13:53 ` Ævar Arnfjörð Bjarmason
  0 siblings, 1 reply; 16+ messages in thread
From: Tao Klerks @ 2022-09-30 12:04 UTC (permalink / raw)
  To: git

Hi folks,

I just found out today both that icase magic exists (awesome!), and
that it isn't supported in ls-tree (boo).

As far as I can tell, getting it supported would be the only way to
*efficiently* prevent caseless-duplicate files from being created in a
repo, in an "update" hook: I'd want to call ls-tree on the new head
commit for the branch, passing in an icase pathspec for all the files
being added since the previous state - and then sort and uniq.

Of course, for entirely new branches I'd have to do a full check of
the tree, and for very large changes that might be the fastest/best
thing to do anyway, rather than creating a silly-sized pathspec - but
checking the full tree costs me about 1 second, a price that I'm loath
to pay for everyday commit verifications of a handful of files, vs a
200,000-file full tree.

I tried changing ls-tree "naively" to just permit the icase magic,
without any logic changes, and found at least one case where it
doesn't work: when combining wildcards with case-insensitivity, like
an icased "T/*" patchspec in the git repo; ls-files finds all the
tests, and a naively updated ls-tree does not.

I think I see the last person to update this was Nguyễn Thái Ngọc Duy
in 2013, giving a hint as to what would need to be done to make this
be supported; is this an area anyone else might be looking at at the
moment?

Thanks,
Tao

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: icase pathspec magic support in ls-tree
  2022-09-30 12:04 icase pathspec magic support in ls-tree Tao Klerks
@ 2022-09-30 13:53 ` Ævar Arnfjörð Bjarmason
  2022-10-02 19:07   ` brian m. carlson
  0 siblings, 1 reply; 16+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2022-09-30 13:53 UTC (permalink / raw)
  To: Tao Klerks; +Cc: git


On Fri, Sep 30 2022, Tao Klerks wrote:

> Hi folks,
>
> I just found out today both that icase magic exists (awesome!), and
> that it isn't supported in ls-tree (boo).
>
> As far as I can tell, getting it supported would be the only way to
> *efficiently* prevent caseless-duplicate files from being created in a
> repo, in an "update" hook: I'd want to call ls-tree on the new head
> commit for the branch, passing in an icase pathspec for all the files
> being added since the previous state - and then sort and uniq.
>
> Of course, for entirely new branches I'd have to do a full check of
> the tree, and for very large changes that might be the fastest/best
> thing to do anyway, rather than creating a silly-sized pathspec - but
> checking the full tree costs me about 1 second, a price that I'm loath
> to pay for everyday commit verifications of a handful of files, vs a
> 200,000-file full tree.
>
> I tried changing ls-tree "naively" to just permit the icase magic,
> without any logic changes, and found at least one case where it
> doesn't work: when combining wildcards with case-insensitivity, like
> an icased "T/*" patchspec in the git repo; ls-files finds all the
> tests, and a naively updated ls-tree does not.
>
> I think I see the last person to update this was Nguyễn Thái Ngọc Duy
> in 2013, giving a hint as to what would need to be done to make this
> be supported; is this an area anyone else might be looking at at the
> moment?

You might find ASCII-only sufficient, but note that even if you get this
working you won't catch the more complex Unicode normalization rules
various filesystems perform, see the fsck code we carefully crafted to
make sure we don't get something those FS's will mistake for a ".git"
directory in-tree.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: icase pathspec magic support in ls-tree
  2022-09-30 13:53 ` Ævar Arnfjörð Bjarmason
@ 2022-10-02 19:07   ` brian m. carlson
  2022-10-13  6:35     ` Tao Klerks
  0 siblings, 1 reply; 16+ messages in thread
From: brian m. carlson @ 2022-10-02 19:07 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason; +Cc: Tao Klerks, git

[-- Attachment #1: Type: text/plain, Size: 1187 bytes --]

On 2022-09-30 at 13:53:16, Ævar Arnfjörð Bjarmason wrote:
> You might find ASCII-only sufficient, but note that even if you get this
> working you won't catch the more complex Unicode normalization rules
> various filesystems perform, see the fsck code we carefully crafted to
> make sure we don't get something those FS's will mistake for a ".git"
> directory in-tree.

What's even worse is that different OSes case-fold differently and the
behaviour differs based on the version of the OS that formatted the file
system (which is of course not exposed to userspace), so in general it's
impossible to know exactly how case folding works on a particular
system.

It might be possible to implement some general rules that are
overzealous (in that they will catch patterns that will case-fold on
_some_ system), but in general this is very difficult.  The rules will
also almost certainly change with newer versions of Unicode.

I'll also point out that there is no locale-independent way to correctly
case-fold Unicode text.  Correct case-folding is sensitive to the
language, script, and region.
-- 
brian m. carlson (he/him or they/them)
Toronto, Ontario, CA

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 263 bytes --]

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: icase pathspec magic support in ls-tree
  2022-10-02 19:07   ` brian m. carlson
@ 2022-10-13  6:35     ` Tao Klerks
  2022-10-14  4:51       ` Torsten Bögershausen
  2022-10-14  7:41       ` Elijah Newren
  0 siblings, 2 replies; 16+ messages in thread
From: Tao Klerks @ 2022-10-13  6:35 UTC (permalink / raw)
  To: brian m. carlson, Ævar Arnfjörð Bjarmason,
	Tao Klerks, git

On Sun, Oct 2, 2022 at 9:07 PM brian m. carlson
<sandals@crustytoothpaste.net> wrote:
>
> On 2022-09-30 at 13:53:16, Ævar Arnfjörð Bjarmason wrote:
> > You might find ASCII-only sufficient, but note that even if you get this
> > working you won't catch the more complex Unicode normalization rules
> > various filesystems perform, see the fsck code we carefully crafted to
> > make sure we don't get something those FS's will mistake for a ".git"
> > directory in-tree.
>
> What's even worse is that different OSes case-fold differently and the
> behaviour differs based on the version of the OS that formatted the file
> system (which is of course not exposed to userspace), so in general it's
> impossible to know exactly how case folding works on a particular
> system.
>
> It might be possible to implement some general rules that are
> overzealous (in that they will catch patterns that will case-fold on
> _some_ system), but in general this is very difficult.  The rules will
> also almost certainly change with newer versions of Unicode.
>
> I'll also point out that there is no locale-independent way to correctly
> case-fold Unicode text.  Correct case-folding is sensitive to the
> language, script, and region.

Thanks for the feedback!

If I'm understanding correctly, both of these responses were targeted
specifically at my motivation/usecase (preventing the submission of
case-insensitively duplicate files into a repository) rather than the
question of whether anyone has worked or is working on anything
relevant to adding icase pathspec magic support to ls-tree.

I understand that case-folding is a complex topic, and doing it
correctly in some universal sense is undoubtedly beyond me - but "my"
context certainly does not require a high standard of correctness:
There's a repo shared by some 1000 engineers, 200k files, lots of
activity, three different OSes of which two default to
case-insensitive filesystems, and every once in a while a user on
linux creates a case-insensitive-duplicate file with differing
content, which causes git on case-insensitive filesystems to lose the
plot (you end up with one file's content looking like changes to the
other file's content - "ghost changes" that appear as soon as you
check out, that prevent you from doing a simple "pull", and that you
just can't reset).

I don't imagine I can make a perfectly correct and universal fix to
this, but with case-insensitive matching on ls-tree in an update hook
I believe I could reduce the frequency of this already-infrequent
issue by at least 1000X, which would suit my purposes just fine. In my
case filenames are mostly ansi-based, and I don't expect we've ever
had Turkish filenames (turkish "i" being the most famous case-folding
gotcha I think?).

Coming at this from another angle, I guess we could teach git on
case-insensitive filesystems to detect this situation (where two files
in the index, with different contents, are pointing to the exact same
filesystem file) and more explicitly warn the user of what's wrong,
giving them clear help on how to fix it? And temporarily exclude those
two files from its change reconciliation processes altogether to avoid
ghost changes interfering with recovery actions like "pull"? Certainly
that would be better than the current "ghost changes" behavior... but
it would still be far less convenient than preventing (the vast
majority of) these issues altogether, be that with a custom hook or a
core option prohibiting clearly case-insensitive-duplicate files from
being pushed.

By the time a case-insensitive-FS-user encounters this issue in their
repo as they checkout or clone, it's likely that the problem is in
master/main and others are already affected, and both the cycle time
to fixing the issue, and the communication impact in the current state
("please wait, the issue is being addressed, once the remote branch is
fixed here's what you'll do to 'pull' successfully in spite of the
local repo thinking there are filesystem changes that really don't
exist and can't be reset") are... suboptimal.

It feels like adding case-insensitivity pathspec magic support to
ls-tree (however reliable or universal the subsequent
duplicate-detection is or isn't) *should* be much simpler than what it
would have taken to support it in ls-files in the first place - but at
a glance, I see the official pathspec-supporting function
"match_pathspec()" is deep in index-land, with an "index_state"
structure being passed around all over the place. If it really was
easy, someone would already have done it I guess? :)

I don't see this being something I can take on in my spare time, so
for now I suspect I'll have to do a full-tree duplicate-file-search on
every ref update, and simply accept the 1-second update hook
processing time/delay per pushed ref :(

I'm assuming the "ghost changes" behavior I allude to here (where two
different files in the index, with different contents, point to the
same single file in the case-insensitive filesystem, and one or the
other index file appears modified / the working tree looks "dirty") is
a known issue, but if there's any value in my opening a thread more
clearly/explicitly about this behavior, please let me know.

Thanks,
Tao

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: icase pathspec magic support in ls-tree
  2022-10-13  6:35     ` Tao Klerks
@ 2022-10-14  4:51       ` Torsten Bögershausen
  2022-10-14  8:31         ` Tao Klerks
  2022-10-14  7:41       ` Elijah Newren
  1 sibling, 1 reply; 16+ messages in thread
From: Torsten Bögershausen @ 2022-10-14  4:51 UTC (permalink / raw)
  To: Tao Klerks; +Cc: brian m. carlson, Ævar Arnfjörð Bjarmason, git

On Thu, Oct 13, 2022 at 08:35:11AM +0200, Tao Klerks wrote:
> On Sun, Oct 2, 2022 at 9:07 PM brian m. carlson
> <sandals@crustytoothpaste.net> wrote:
> >
> > On 2022-09-30 at 13:53:16, Ævar Arnfjörð Bjarmason wrote:
> > > You might find ASCII-only sufficient, but note that even if you get this
> > > working you won't catch the more complex Unicode normalization rules
> > > various filesystems perform, see the fsck code we carefully crafted to
> > > make sure we don't get something those FS's will mistake for a ".git"
> > > directory in-tree.
> >
> > What's even worse is that different OSes case-fold differently and the
> > behaviour differs based on the version of the OS that formatted the file
> > system (which is of course not exposed to userspace), so in general it's
> > impossible to know exactly how case folding works on a particular
> > system.
> >
> > It might be possible to implement some general rules that are
> > overzealous (in that they will catch patterns that will case-fold on
> > _some_ system), but in general this is very difficult.  The rules will
> > also almost certainly change with newer versions of Unicode.
> >
> > I'll also point out that there is no locale-independent way to correctly
> > case-fold Unicode text.  Correct case-folding is sensitive to the
> > language, script, and region.
>
> Thanks for the feedback!
>
> If I'm understanding correctly, both of these responses were targeted
> specifically at my motivation/usecase (preventing the submission of
> case-insensitively duplicate files into a repository) rather than the
> question of whether anyone has worked or is working on anything
> relevant to adding icase pathspec magic support to ls-tree.
>
> I understand that case-folding is a complex topic, and doing it
> correctly in some universal sense is undoubtedly beyond me - but "my"
> context certainly does not require a high standard of correctness:
> There's a repo shared by some 1000 engineers, 200k files, lots of
> activity, three different OSes of which two default to
> case-insensitive filesystems, and every once in a while a user on
> linux creates a case-insensitive-duplicate file with differing
> content, which causes git on case-insensitive filesystems to lose the
> plot (you end up with one file's content looking like changes to the
> other file's content - "ghost changes" that appear as soon as you
> check out, that prevent you from doing a simple "pull", and that you
> just can't reset).
>
> I don't imagine I can make a perfectly correct and universal fix to
> this, but with case-insensitive matching on ls-tree in an update hook
> I believe I could reduce the frequency of this already-infrequent
> issue by at least 1000X, which would suit my purposes just fine. In my
> case filenames are mostly ansi-based, and I don't expect we've ever
> had Turkish filenames (turkish "i" being the most famous case-folding
> gotcha I think?).
>
> Coming at this from another angle, I guess we could teach git on
> case-insensitive filesystems to detect this situation (where two files
> in the index, with different contents, are pointing to the exact same
> filesystem file) and more explicitly warn the user of what's wrong,
> giving them clear help on how to fix it? And temporarily exclude those
> two files from its change reconciliation processes altogether to avoid
> ghost changes interfering with recovery actions like "pull"? Certainly
> that would be better than the current "ghost changes" behavior... but
> it would still be far less convenient than preventing (the vast
> majority of) these issues altogether, be that with a custom hook or a
> core option prohibiting clearly case-insensitive-duplicate files from
> being pushed.
>
> By the time a case-insensitive-FS-user encounters this issue in their
> repo as they checkout or clone, it's likely that the problem is in
> master/main and others are already affected, and both the cycle time
> to fixing the issue, and the communication impact in the current state
> ("please wait, the issue is being addressed, once the remote branch is
> fixed here's what you'll do to 'pull' successfully in spite of the
> local repo thinking there are filesystem changes that really don't
> exist and can't be reset") are... suboptimal.
>
> It feels like adding case-insensitivity pathspec magic support to
> ls-tree (however reliable or universal the subsequent
> duplicate-detection is or isn't) *should* be much simpler than what it
> would have taken to support it in ls-files in the first place - but at
> a glance, I see the official pathspec-supporting function
> "match_pathspec()" is deep in index-land, with an "index_state"
> structure being passed around all over the place. If it really was
> easy, someone would already have done it I guess? :)
>
> I don't see this being something I can take on in my spare time, so
> for now I suspect I'll have to do a full-tree duplicate-file-search on
> every ref update, and simply accept the 1-second update hook
> processing time/delay per pushed ref :(
>
> I'm assuming the "ghost changes" behavior I allude to here (where two
> different files in the index, with different contents, point to the
> same single file in the case-insensitive filesystem, and one or the
> other index file appears modified / the working tree looks "dirty") is
> a known issue, but if there's any value in my opening a thread more
> clearly/explicitly about this behavior, please let me know.
>
> Thanks,
> Tao


Thanks for sharing your experience in detail.

Did you ever consider to write a shell script,
that can detect icase-collisions ?

For example, we can use Linux:
 git ls-files | tr 'A-Z' 'a-z' | sort | uniq -d ; echo $?
 include/uapi/linux/netfilter_ipv4/ipt_ecn.h
 include/uapi/linux/netfilter_ipv4/ipt_ttl.h
 [snip the other files]

The GNU versions of uniq allow an even shorter command,
(But the POSIX versions don't)

git ls-files  | sort | uniq -i -d

I think that a script like this could do the trick:

#!/bin/sh
ret=1
>/tmp/$$-exp
git ls-files  | sort | uniq -i -d >/tmp/$$-act &&
  cmp /tmp/$$-exp /tmp/$$-act &&
    ret=0
    rm -f /tmp/$$-exp /tmp/$$-act
    exit $ret


####################
The usage of files in /tmp is probably debatable,
I want just illustrate how a combination of shell
scripts in combination with existing commands can be used.

The biggest step may be to introduce a server-side hook
that does a check.
But once that is done and working, you probably do
not want to miss it.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: icase pathspec magic support in ls-tree
  2022-10-13  6:35     ` Tao Klerks
  2022-10-14  4:51       ` Torsten Bögershausen
@ 2022-10-14  7:41       ` Elijah Newren
  2022-10-14  8:03         ` Erik Cervin Edin
  2022-10-14  8:48         ` Tao Klerks
  1 sibling, 2 replies; 16+ messages in thread
From: Elijah Newren @ 2022-10-14  7:41 UTC (permalink / raw)
  To: Tao Klerks; +Cc: brian m. carlson, Ævar Arnfjörð Bjarmason, git

On Wed, Oct 12, 2022 at 11:49 PM Tao Klerks <tao@klerks.biz> wrote:
>
[...]
> I understand that case-folding is a complex topic, and doing it
> correctly in some universal sense is undoubtedly beyond me - but "my"
> context certainly does not require a high standard of correctness:
> There's a repo shared by some 1000 engineers, 200k files, lots of
> activity, three different OSes of which two default to
> case-insensitive filesystems, and every once in a while a user on
> linux creates a case-insensitive-duplicate file with differing
> content, which causes git on case-insensitive filesystems to lose the
> plot (you end up with one file's content looking like changes to the
> other file's content - "ghost changes" that appear as soon as you
> check out, that prevent you from doing a simple "pull", and that you
> just can't reset).

I've felt that same pain.

> I don't imagine I can make a perfectly correct and universal fix to
> this, but with case-insensitive matching on ls-tree in an update hook
> I believe I could reduce the frequency of this already-infrequent
> issue by at least 1000X, which would suit my purposes just fine. In my
> case filenames are mostly ansi-based, and I don't expect we've ever
> had Turkish filenames (turkish "i" being the most famous case-folding
> gotcha I think?).

How exactly would case-insensitive matching in ls-tree help you here?
Can't you write a hook without such capability that rejects such
collisions?

> I don't see this being something I can take on in my spare time, so
> for now I suspect I'll have to do a full-tree duplicate-file-search on
> every ref update, and simply accept the 1-second update hook
> processing time/delay per pushed ref :(

I don't see why you need to do full-tree with existing options, nor
why the ls-tree option you want would somehow make it easier to avoid.
I think you can avoid the full-tree search with something like:

git diff --diff-filter=A --no-renames --name-only $OLDHASH $NEWHASH |
sed -e s%/[^/]*$%/% | uniq | xargs git ls-tree --name-only $NEWHASH |
\
   sort | uniq -i -d

The final "sort | uniq -i -d" is taken from Torsten's suggestion.

The git diff ... xargs git ls-tree section on the first line will
provide a list of all files (& subdirs) in the same directory as any
added file.  (Although, it has a blind spot for paths in the toplevel
directory.)

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: icase pathspec magic support in ls-tree
  2022-10-14  7:41       ` Elijah Newren
@ 2022-10-14  8:03         ` Erik Cervin Edin
  2022-10-14  8:57           ` Tao Klerks
  2022-10-14  8:48         ` Tao Klerks
  1 sibling, 1 reply; 16+ messages in thread
From: Erik Cervin Edin @ 2022-10-14  8:03 UTC (permalink / raw)
  To: Elijah Newren
  Cc: Tao Klerks, brian m. carlson,
	Ævar Arnfjörð Bjarmason, git

> I don't imagine I can make a perfectly correct and universal fix to
> this, but with case-insensitive matching on ls-tree in an update hook
> I believe I could reduce the frequency of this already-infrequent
> issue by at least 1000X, which would suit my purposes just fine. In my
> case filenames are mostly ansi-based, and I don't expect we've ever
> had Turkish filenames (turkish "i" being the most famous case-folding
> gotcha I think?).

How about doing it in something that's not ls-tree? Sounds like you
already have a script, it just takes a bit too long?

Something similar to

On Fri, Oct 14, 2022 at 6:59 AM Torsten Bögershausen <tboegi@web.de> wrote:
>
> For example, we can use Linux:
>  git ls-files | tr 'A-Z' 'a-z' | sort | uniq -d ; echo $?

In a repo with many files, maybe use git diff --name-only and just run
it periodically as a part of a check-in hook or something?

  git diff --name-only HEAD~100..HEAD | tr 'A-Z' 'a-z' | sort | uniq -d


On Fri, Oct 14, 2022 at 9:59 AM Elijah Newren <newren@gmail.com> wrote:
>
> git diff --diff-filter=A --no-renames --name-only $OLDHASH $NEWHASH |
> sed -e s%/[^/]*$%/% | uniq | xargs git ls-tree --name-only $NEWHASH |
> \
>    sort | uniq -i -d

Or what Elijah just wrote

> Coming at this from another angle, I guess we could teach git on
> case-insensitive filesystems to detect this situation (where two files
> in the index, with different contents, are pointing to the exact same
> filesystem file) and more explicitly warn the user of what's wrong,
> giving them clear help on how to fix it? And temporarily exclude those
> two files from its change reconciliation processes altogether to avoid
> ghost changes interfering with recovery actions like "pull"? Certainly
> that would be better than the current "ghost changes" behavior... but
> it would still be far less convenient than preventing (the vast
> majority of) these issues altogether, be that with a custom hook or a
> core option prohibiting clearly case-insensitive-duplicate files from
> being pushed.

That's not to say this isn't a good idea but for now I'd advice an
automated scripted route.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: icase pathspec magic support in ls-tree
  2022-10-14  4:51       ` Torsten Bögershausen
@ 2022-10-14  8:31         ` Tao Klerks
  2022-10-14  8:37           ` Erik Cervin Edin
  0 siblings, 1 reply; 16+ messages in thread
From: Tao Klerks @ 2022-10-14  8:31 UTC (permalink / raw)
  To: Torsten Bögershausen
  Cc: brian m. carlson, Ævar Arnfjörð Bjarmason, git

On Fri, Oct 14, 2022 at 6:51 AM Torsten Bögershausen <tboegi@web.de> wrote:
>
> On Thu, Oct 13, 2022 at 08:35:11AM +0200, Tao Klerks wrote:
>
> Did you ever consider to write a shell script,
> that can detect icase-collisions ?
>
> For example, we can use Linux:
>  git ls-files | tr 'A-Z' 'a-z' | sort | uniq -d ; echo $?
>  include/uapi/linux/netfilter_ipv4/ipt_ecn.h
>  include/uapi/linux/netfilter_ipv4/ipt_ttl.h
>  [snip the other files]
>
> The GNU versions of uniq allow an even shorter command,
> (But the POSIX versions don't)
>
> git ls-files  | sort | uniq -i -d
>
> I think that a script like this could do the trick:
>
> #!/bin/sh
> ret=1
> >/tmp/$$-exp
> git ls-files  | sort | uniq -i -d >/tmp/$$-act &&
>   cmp /tmp/$$-exp /tmp/$$-act &&
>     ret=0
>     rm -f /tmp/$$-exp /tmp/$$-act
>     exit $ret
>
>
> ####################
> The usage of files in /tmp is probably debatable,
> I want just illustrate how a combination of shell
> scripts in combination with existing commands can be used.
>
> The biggest step may be to introduce a server-side hook
> that does a check.
> But once that is done and working, you probably do
> not want to miss it.

Thanks for the proposal! Sorry I was a bit vague in my "I suspect I'll
have to do a full-tree duplicate-file-search on every ref update", but
your suggestion is almost exactly what I meant.

On my machine, on this repo, a full-tree case-insensitive duplicate
search costs me about 800ms for 100k files, or 1,800ms for 200k files:

git ls-tree --name-only -r $NEWHASH | sort | uniq -i -d

I need to use ls-tree rather than ls-files because this is indeed a
command to run in an update hook, and there is no working tree - no
(relevant) index, in a server-side update hook.

The 800ms for 100k files are composed of 200ms of ls-tree, 600ms of
sort, and about 10ms of uniq.

My intent with supporting icase pathspec magic was to do something like:

git --icase-pathspecs ls-tree --name-only -r $NEWHASH -- PATHS OF
ADDED FILES | sort | uniq -i -d

Which would be near-instantaneous in the vast majority of cases (and
I'd have some file count limit past which I would fall back to doing
the full tree, to avoid excessive command lengths). Unfortunately,
"--icase-pathspecs" is not supported in ls-tree, hence this thread :)

But yes - ultimately, paying that "full dupe search" per-update server
hook processing time cost has seemed like the only sensible way of
doing this - until I thought about Elijah's suggestion a little harder
that is!

More in the next part of the thread.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: icase pathspec magic support in ls-tree
  2022-10-14  8:31         ` Tao Klerks
@ 2022-10-14  8:37           ` Erik Cervin Edin
  0 siblings, 0 replies; 16+ messages in thread
From: Erik Cervin Edin @ 2022-10-14  8:37 UTC (permalink / raw)
  To: Tao Klerks
  Cc: Torsten Bögershausen, brian m. carlson,
	Ævar Arnfjörð Bjarmason, git

On Fri, Oct 14, 2022 at 10:32 AM Tao Klerks <tao@klerks.biz> wrote:
>
> I need to use ls-tree rather than ls-files because this is indeed a
> command to run in an update hook, and there is no working tree - no
> (relevant) index, in a server-side update hook.

I believe
  git diff --name-only
doesn't need a working tree

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: icase pathspec magic support in ls-tree
  2022-10-14  7:41       ` Elijah Newren
  2022-10-14  8:03         ` Erik Cervin Edin
@ 2022-10-14  8:48         ` Tao Klerks
  2022-10-14  9:07           ` Tao Klerks
  2022-10-14 17:06           ` Elijah Newren
  1 sibling, 2 replies; 16+ messages in thread
From: Tao Klerks @ 2022-10-14  8:48 UTC (permalink / raw)
  To: Elijah Newren
  Cc: brian m. carlson, Ævar Arnfjörð Bjarmason, git

On Fri, Oct 14, 2022 at 9:41 AM Elijah Newren <newren@gmail.com> wrote:
>
>
> How exactly would case-insensitive matching in ls-tree help you here?

I just attempted to demonstrate in response to Torsten's email; the
assumption was that I could list the added files' paths in a
case-insensitive pathspec, and thereby get all duplicates fast and
efficiently, for reasonably-sized commits.

(new refs and very large commits would still need a full-tree dupe scan)

> Can't you write a hook without such capability that rejects such
> collisions?

It is possible, but far less convenient and I'm not confident that my
shell scripting abilities will get me to a good place.

That said, having thought about your point, my shell scripting
abilities are more likely to get me to a good place than attempting to
add icase pathspec magic support to ls-tree :)

>
> > I don't see this being something I can take on in my spare time, so
> > for now I suspect I'll have to do a full-tree duplicate-file-search on
> > every ref update, and simply accept the 1-second update hook
> > processing time/delay per pushed ref :(
>
> I don't see why you need to do full-tree with existing options, nor
> why the ls-tree option you want would somehow make it easier to avoid.
> I think you can avoid the full-tree search with something like:
>
> git diff --diff-filter=A --no-renames --name-only $OLDHASH $NEWHASH |
> sed -e s%/[^/]*$%/% | uniq | xargs git ls-tree --name-only $NEWHASH |
> \
>    sort | uniq -i -d
>
> The final "sort | uniq -i -d" is taken from Torsten's suggestion.
>
> The git diff ... xargs git ls-tree section on the first line will
> provide a list of all files (& subdirs) in the same directory as any
> added file.  (Although, it has a blind spot for paths in the toplevel
> directory.)

The theoretical problem with this approach is that it only addresses
case-insensitive-duplicate files, not directories.

Directories have been the problem, in "my" repo, around one-third of
the time - typically someone does a directory rename, and someone else
does a bad merge and reintroduces the old directory.

That said, what "icase pathspec magic" actually *does*, is break down
the pathspec into iteratively more complete paths, level by level,
looking for case-duplicates at each level. That's something I could
presumably do in shell scripting, collecting all the interesting
sub-paths first, and then getting ls-tree to tell me about the
immediate children for each sub-path, doing case-insensitive dupe
searches across children for each of these sub-paths.

ls-tree supporting icase pathspec magic would clearly be more
efficient (I wouldn't need N ls-tree git processes, where N is the
number of sub-paths in the diff), but this should be plenty efficient
for normal commits, with a fallback to the full search

This seems like a sensible direction, I'll have a play.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: icase pathspec magic support in ls-tree
  2022-10-14  8:03         ` Erik Cervin Edin
@ 2022-10-14  8:57           ` Tao Klerks
  0 siblings, 0 replies; 16+ messages in thread
From: Tao Klerks @ 2022-10-14  8:57 UTC (permalink / raw)
  To: Erik Cervin Edin
  Cc: Elijah Newren, brian m. carlson,
	Ævar Arnfjörð Bjarmason, git

On Fri, Oct 14, 2022 at 10:04 AM Erik Cervin Edin <erik@cervined.in> wrote:
>
>
> On Fri, Oct 14, 2022 at 6:59 AM Torsten Bögershausen <tboegi@web.de> wrote:
> >
> > For example, we can use Linux:
> >  git ls-files | tr 'A-Z' 'a-z' | sort | uniq -d ; echo $?
>
> In a repo with many files, maybe use git diff --name-only and just run
> it periodically as a part of a check-in hook or something?
>
>   git diff --name-only HEAD~100..HEAD | tr 'A-Z' 'a-z' | sort | uniq -d
>
>
[... next email...]
> I believe
>  git diff --name-only
> doesn't need a working tree

I don't understand this suggestion; doesn't it only catch duplicates
where both instances were introduced in the same 100-commit range?

That has often or typically not been the case, in my experience. Often
one version of the file or folder will have existed for some time
(days, months, years), and then a "duplicate" will be introduced.

As far as I can tell this "diff large range" approach is quite
expensive (37 seconds in a trivial test on this repo), and
non-comprehensive.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: icase pathspec magic support in ls-tree
  2022-10-14  8:48         ` Tao Klerks
@ 2022-10-14  9:07           ` Tao Klerks
  2022-10-14 12:00             ` Erik Cervin Edin
  2022-10-14 17:06           ` Elijah Newren
  1 sibling, 1 reply; 16+ messages in thread
From: Tao Klerks @ 2022-10-14  9:07 UTC (permalink / raw)
  To: Elijah Newren
  Cc: brian m. carlson, Ævar Arnfjörð Bjarmason, git

Small typo / confusion on my part:

On Fri, Oct 14, 2022 at 10:48 AM Tao Klerks <tao@klerks.biz> wrote:
>
> That said, what "icase pathspec magic" actually *does*, is break down
> the pathspec into iteratively more complete paths, level by level,
> looking for case-duplicates at each level.

I meant what it does is *collect path matches* at iteratively more
complete paths, level by level - it doesn't care about duplicates; any
duplicate-detection is what you can then do on the result.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: icase pathspec magic support in ls-tree
  2022-10-14  9:07           ` Tao Klerks
@ 2022-10-14 12:00             ` Erik Cervin Edin
  0 siblings, 0 replies; 16+ messages in thread
From: Erik Cervin Edin @ 2022-10-14 12:00 UTC (permalink / raw)
  To: Tao Klerks
  Cc: Elijah Newren, brian m. carlson,
	Ævar Arnfjörð Bjarmason, git

On Fri, Oct 14, 2022 at 10:58 AM Tao Klerks <tao@klerks.biz> wrote:
>
> I don't understand this suggestion; doesn't it only catch duplicates
> where both instances were introduced in the same 100-commit range?

Yes. It was a bit half-baked but the main idea was to limit the tree
to a smaller subset (and not the whole tree) and incrementally
checking for introduced duplicates instead of a full tree search. I
think that's basically Elijah's idea. Get all (added?) files
introduced in a certain revision range (last change, since yesterday
etc.) and then only check those against the tree for duplicates in a
manner of how you define duplicates

On Fri, Oct 14, 2022 at 10:50 AM Tao Klerks <tao@klerks.biz> wrote:
>
> Directories have been the problem, in "my" repo, around one-third of
> the time - typically someone does a directory rename, and someone else
> does a bad merge and reintroduces the old directory.

That adds a bit of complexity :/
but should still be doable.

Not perfect but maybe something along these lines? (caveat, possibly GNU only)

#!/bin/sh

# files added between revisions x y
added_files() {
    git diff --diff-filter=A --name-only --no-renames $1 $2 ;
}

# folders of those added files
added_folders() {
    added_files $1 $2 |
        sed -e '/[^\/]*/s@^@./@' -e 's@/[^/]*$@/@' |
         sort -u ;
}

# all files tracked by git in *those* folders at HEAD
possible_dupes() {
    added_folders $1 $2 |
        xargs git ls-tree --name-only HEAD ;
}

# case insensitive columns separated by \x1
# eg.
#path\x1PaTh
#path\x1path
case_insensitive() {
    sed -e 's@.*@\L\0\E\x1\0@' |
        sort ;
}


x=$1
y=$2
# Find all duplicates paths (case insensitive)
# in directories which were added between $x $y
possible_dupes $x $y |
    case_insensitive |
    awk -F '\x1' '
        # actual "duplicate" paths, column $2
        # as determined by case-insensitive column $1
        $1 in a { print a[$1]; print $2 }
        { a[$1]=$2 }
    '    | uniq

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: icase pathspec magic support in ls-tree
  2022-10-14  8:48         ` Tao Klerks
  2022-10-14  9:07           ` Tao Klerks
@ 2022-10-14 17:06           ` Elijah Newren
  2022-10-15 22:06             ` Tao Klerks
  1 sibling, 1 reply; 16+ messages in thread
From: Elijah Newren @ 2022-10-14 17:06 UTC (permalink / raw)
  To: Tao Klerks; +Cc: brian m. carlson, Ævar Arnfjörð Bjarmason, git

On Fri, Oct 14, 2022 at 1:48 AM Tao Klerks <tao@klerks.biz> wrote:
>
> On Fri, Oct 14, 2022 at 9:41 AM Elijah Newren <newren@gmail.com> wrote:
> >
[...]
> > I don't see why you need to do full-tree with existing options, nor
> > why the ls-tree option you want would somehow make it easier to avoid.
> > I think you can avoid the full-tree search with something like:
> >
> > git diff --diff-filter=A --no-renames --name-only $OLDHASH $NEWHASH |
> > sed -e s%/[^/]*$%/% | uniq | xargs git ls-tree --name-only $NEWHASH |
> > \
> >    sort | uniq -i -d
> >
> > The final "sort | uniq -i -d" is taken from Torsten's suggestion.
> >
> > The git diff ... xargs git ls-tree section on the first line will
> > provide a list of all files (& subdirs) in the same directory as any
> > added file.  (Although, it has a blind spot for paths in the toplevel
> > directory.)
>
> The theoretical problem with this approach is that it only addresses
> case-insensitive-duplicate files, not directories.

It'll catch some case-insensitive-duplicate directories too -- note
that I did call out that it'd print subdirs.  But to be more cautious,
you would need to carefully grab all leading directories of any added
path, not just the immediate leading directory.

> Directories have been the problem, in "my" repo, around one-third of
> the time - typically someone does a directory rename, and someone else
> does a bad merge and reintroduces the old directory.
>
> That said, what "icase pathspec magic" actually *does*, is break down
> the pathspec into iteratively more complete paths, level by level,
> looking for case-duplicates at each level. That's something I could
> presumably do in shell scripting, collecting all the interesting
> sub-paths first, and then getting ls-tree to tell me about the
> immediate children for each sub-path, doing case-insensitive dupe
> searches across children for each of these sub-paths.
>
> ls-tree supporting icase pathspec magic would clearly be more
> efficient (I wouldn't need N ls-tree git processes, where N is the
> number of sub-paths in the diff), but this should be plenty efficient
> for normal commits, with a fallback to the full search
>
> This seems like a sensible direction, I'll have a play.

If you create a script that gives you all leading directories of any
listed path (plus replacing the toplevel dir with ':/'), such as this
(which I'm calling 'all-leading-dirs.py'):

"""
#!/usr/bin/env python3

import os
import sys

paths = sys.stdin.read().splitlines()
dirs_seen = set()
for path in paths:
  dir = path
  while dir:
    dir = os.path.dirname(dir)
    if dir in dirs_seen:
      continue
    dirs_seen.add(dir)
if dirs_seen:
  # Replace top-level dir of "" with ":"; we'll add the trailing '/'
below when adding it to all other dirs
  dirs_seen.remove("")
  dirs_seen.add(':')
  for dir in dirs_seen:
    print(dir+'/')  # ls-tree wants the trailing '/' if we are going
to list contents within that tree rather than just the tree itself
"""

Then the following will catch duplicates files and directories for you:

git diff --diff-filter=A --no-renames --name-only HEAD~1 HEAD |
all-leading-dirs.py | xargs --no-run-if-empty git ls-tree --name-only
-t HEAD | sort | uniq -i -d

and it no longer has problems catching duplicates in the toplevel
directory either.  It does have (at most) two git invocations, but
only one invocation of ls-tree.  Here's a test script to prove it
works:

"""
#!/bin/bash

git init -b main nukeme
cd nukeme
mkdir -p dir1/subdir/whatever
mkdir -p dir2/subdir/whatever
>dir1/subdir/whatever/foo
>dir2/subdir/whatever/foo
git add .
git commit -m initial

mkdir -p dir1/SubDir/whatever
>dir1/SubDir/whatever/foo
git add .
git commit -m stuff

git diff --diff-filter=A --no-renames --name-only HEAD~1 HEAD |
all-leading-dirs.py | xargs --no-run-if-empty git ls-tree --name-only
-t HEAD | sort | uniq -i -d
"""

The output of this script is
"""
dir1/subdir
"""
which correctly notifies on the duplicate (dir1/SubDir being the
other; uniq is the one that picks which of the two duplicate names to
print)

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: icase pathspec magic support in ls-tree
  2022-10-14 17:06           ` Elijah Newren
@ 2022-10-15 22:06             ` Tao Klerks
  2022-10-17 15:46               ` Tao Klerks
  0 siblings, 1 reply; 16+ messages in thread
From: Tao Klerks @ 2022-10-15 22:06 UTC (permalink / raw)
  To: Elijah Newren
  Cc: brian m. carlson, Ævar Arnfjörð Bjarmason, git

This seems to be working, thank you!!!

Two updates I had to make, in case this is useful to anyone else:

1: I'm getting some weird behavior I can't explain yet, where some
paths are returned from the ls-tree call twice: The input to ls-tree
is all unique paths, but the output somehow includes a relatively
small subset of paths twice.
This mysterious issue is easily addressed by adding an extra "uniq"
call to remove the "trivial dupes" before hunting for the
"case-insensitive dupes" we're interested in:

git diff --diff-filter=A --no-renames --name-only HEAD~1 HEAD |
all-leading-dirs.py | xargs --no-run-if-empty git ls-tree --name-only
-t HEAD | sort | uniq | uniq -i -d

2: The xargs call has issues with paths with spaces in them. Adding
-d"\n" seems to be a clean way to fix this

git diff --diff-filter=A --no-renames --name-only HEAD~1 HEAD |
all-leading-dirs.py | xargs -d"\n" --no-run-if-empty git ls-tree --name-only
-t HEAD | sort | uniq | uniq -i -d


Not only does this approach seem to work well, but it also has far
better performance characteristics than I was expecting!

Simple small commit (10 files): 20ms
Reasonably large commit (10,000 files): 250ms
Diff from empty on a smaller branch (100,000 files): 2,800ms
Diff from empty on a larger branch (200,000 files): 5,400ms


It still makes sense to check the number of files/lines after doing
the diff, and do a "simple" 800ms full-tree (no-pathspec) dupe check
instead of this when your diff size goes past some file count
threshold, but it looks like that threshold would be quite high in my
environment - 30k files maybe?

I will have a go at writing a full update hook, and (without knowing
whether it will make sense from a performance perspective) I'd like to
try to implement the "all-leading-dirs" logic in bash 4 using
associative arrays, to remove the python dependency. If I make it work
I'll post back here.

This seems to cover what I needed icase pathspec magic for in ls-tree,
without having to implement it - so thanks again!

Tao

On Fri, Oct 14, 2022 at 7:06 PM Elijah Newren <newren@gmail.com> wrote:
>
> On Fri, Oct 14, 2022 at 1:48 AM Tao Klerks <tao@klerks.biz> wrote:
> >
> > On Fri, Oct 14, 2022 at 9:41 AM Elijah Newren <newren@gmail.com> wrote:
> > >
> [...]
> > > I don't see why you need to do full-tree with existing options, nor
> > > why the ls-tree option you want would somehow make it easier to avoid.
> > > I think you can avoid the full-tree search with something like:
> > >
> > > git diff --diff-filter=A --no-renames --name-only $OLDHASH $NEWHASH |
> > > sed -e s%/[^/]*$%/% | uniq | xargs git ls-tree --name-only $NEWHASH |
> > > \
> > >    sort | uniq -i -d
> > >
> > > The final "sort | uniq -i -d" is taken from Torsten's suggestion.
> > >
> > > The git diff ... xargs git ls-tree section on the first line will
> > > provide a list of all files (& subdirs) in the same directory as any
> > > added file.  (Although, it has a blind spot for paths in the toplevel
> > > directory.)
> >
> > The theoretical problem with this approach is that it only addresses
> > case-insensitive-duplicate files, not directories.
>
> It'll catch some case-insensitive-duplicate directories too -- note
> that I did call out that it'd print subdirs.  But to be more cautious,
> you would need to carefully grab all leading directories of any added
> path, not just the immediate leading directory.
>
> > Directories have been the problem, in "my" repo, around one-third of
> > the time - typically someone does a directory rename, and someone else
> > does a bad merge and reintroduces the old directory.
> >
> > That said, what "icase pathspec magic" actually *does*, is break down
> > the pathspec into iteratively more complete paths, level by level,
> > looking for case-duplicates at each level. That's something I could
> > presumably do in shell scripting, collecting all the interesting
> > sub-paths first, and then getting ls-tree to tell me about the
> > immediate children for each sub-path, doing case-insensitive dupe
> > searches across children for each of these sub-paths.
> >
> > ls-tree supporting icase pathspec magic would clearly be more
> > efficient (I wouldn't need N ls-tree git processes, where N is the
> > number of sub-paths in the diff), but this should be plenty efficient
> > for normal commits, with a fallback to the full search
> >
> > This seems like a sensible direction, I'll have a play.
>
> If you create a script that gives you all leading directories of any
> listed path (plus replacing the toplevel dir with ':/'), such as this
> (which I'm calling 'all-leading-dirs.py'):
>
> """
> #!/usr/bin/env python3
>
> import os
> import sys
>
> paths = sys.stdin.read().splitlines()
> dirs_seen = set()
> for path in paths:
>   dir = path
>   while dir:
>     dir = os.path.dirname(dir)
>     if dir in dirs_seen:
>       continue
>     dirs_seen.add(dir)
> if dirs_seen:
>   # Replace top-level dir of "" with ":"; we'll add the trailing '/'
> below when adding it to all other dirs
>   dirs_seen.remove("")
>   dirs_seen.add(':')
>   for dir in dirs_seen:
>     print(dir+'/')  # ls-tree wants the trailing '/' if we are going
> to list contents within that tree rather than just the tree itself
> """
>
> Then the following will catch duplicates files and directories for you:
>
> git diff --diff-filter=A --no-renames --name-only HEAD~1 HEAD |
> all-leading-dirs.py | xargs --no-run-if-empty git ls-tree --name-only
> -t HEAD | sort | uniq -i -d
>
> and it no longer has problems catching duplicates in the toplevel
> directory either.  It does have (at most) two git invocations, but
> only one invocation of ls-tree.  Here's a test script to prove it
> works:
>
> """
> #!/bin/bash
>
> git init -b main nukeme
> cd nukeme
> mkdir -p dir1/subdir/whatever
> mkdir -p dir2/subdir/whatever
> >dir1/subdir/whatever/foo
> >dir2/subdir/whatever/foo
> git add .
> git commit -m initial
>
> mkdir -p dir1/SubDir/whatever
> >dir1/SubDir/whatever/foo
> git add .
> git commit -m stuff
>
> git diff --diff-filter=A --no-renames --name-only HEAD~1 HEAD |
> all-leading-dirs.py | xargs --no-run-if-empty git ls-tree --name-only
> -t HEAD | sort | uniq -i -d
> """
>
> The output of this script is
> """
> dir1/subdir
> """
> which correctly notifies on the duplicate (dir1/SubDir being the
> other; uniq is the one that picks which of the two duplicate names to
> print)

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: icase pathspec magic support in ls-tree
  2022-10-15 22:06             ` Tao Klerks
@ 2022-10-17 15:46               ` Tao Klerks
  0 siblings, 0 replies; 16+ messages in thread
From: Tao Klerks @ 2022-10-17 15:46 UTC (permalink / raw)
  To: Elijah Newren
  Cc: brian m. carlson, Ævar Arnfjörð Bjarmason, git

[-- Attachment #1: Type: text/plain, Size: 7541 bytes --]

Thanks again for the code samples above!

I've spent some time productizing this for my environment, and I think
it's pretty close to optimal for my environment at least.

In case this can be of value to anyone else, I've attached the final thing.

This version is genericized a little - two specific specializations
for my environment not included here are:
* a specific strategy for determining the most-likely-correct upstream
branch to use as a default for "diff from" for new branches, to avoid
having to do a full-tree dupe search; in most environments choosing
"master" or "main" would likely be the right thing most of the time
* a specific list of branch paths that we should validate (vs others
that we should not)

Best regards,
Tao

On Sun, Oct 16, 2022 at 12:06 AM Tao Klerks <tao@klerks.biz> wrote:
>
> This seems to be working, thank you!!!
>
> Two updates I had to make, in case this is useful to anyone else:
>
> 1: I'm getting some weird behavior I can't explain yet, where some
> paths are returned from the ls-tree call twice: The input to ls-tree
> is all unique paths, but the output somehow includes a relatively
> small subset of paths twice.
> This mysterious issue is easily addressed by adding an extra "uniq"
> call to remove the "trivial dupes" before hunting for the
> "case-insensitive dupes" we're interested in:
>
> git diff --diff-filter=A --no-renames --name-only HEAD~1 HEAD |
> all-leading-dirs.py | xargs --no-run-if-empty git ls-tree --name-only
> -t HEAD | sort | uniq | uniq -i -d
>
> 2: The xargs call has issues with paths with spaces in them. Adding
> -d"\n" seems to be a clean way to fix this
>
> git diff --diff-filter=A --no-renames --name-only HEAD~1 HEAD |
> all-leading-dirs.py | xargs -d"\n" --no-run-if-empty git ls-tree --name-only
> -t HEAD | sort | uniq | uniq -i -d
>
>
> Not only does this approach seem to work well, but it also has far
> better performance characteristics than I was expecting!
>
> Simple small commit (10 files): 20ms
> Reasonably large commit (10,000 files): 250ms
> Diff from empty on a smaller branch (100,000 files): 2,800ms
> Diff from empty on a larger branch (200,000 files): 5,400ms
>
>
> It still makes sense to check the number of files/lines after doing
> the diff, and do a "simple" 800ms full-tree (no-pathspec) dupe check
> instead of this when your diff size goes past some file count
> threshold, but it looks like that threshold would be quite high in my
> environment - 30k files maybe?
>
> I will have a go at writing a full update hook, and (without knowing
> whether it will make sense from a performance perspective) I'd like to
> try to implement the "all-leading-dirs" logic in bash 4 using
> associative arrays, to remove the python dependency. If I make it work
> I'll post back here.
>
> This seems to cover what I needed icase pathspec magic for in ls-tree,
> without having to implement it - so thanks again!
>
> Tao
>
> On Fri, Oct 14, 2022 at 7:06 PM Elijah Newren <newren@gmail.com> wrote:
> >
> > On Fri, Oct 14, 2022 at 1:48 AM Tao Klerks <tao@klerks.biz> wrote:
> > >
> > > On Fri, Oct 14, 2022 at 9:41 AM Elijah Newren <newren@gmail.com> wrote:
> > > >
> > [...]
> > > > I don't see why you need to do full-tree with existing options, nor
> > > > why the ls-tree option you want would somehow make it easier to avoid.
> > > > I think you can avoid the full-tree search with something like:
> > > >
> > > > git diff --diff-filter=A --no-renames --name-only $OLDHASH $NEWHASH |
> > > > sed -e s%/[^/]*$%/% | uniq | xargs git ls-tree --name-only $NEWHASH |
> > > > \
> > > >    sort | uniq -i -d
> > > >
> > > > The final "sort | uniq -i -d" is taken from Torsten's suggestion.
> > > >
> > > > The git diff ... xargs git ls-tree section on the first line will
> > > > provide a list of all files (& subdirs) in the same directory as any
> > > > added file.  (Although, it has a blind spot for paths in the toplevel
> > > > directory.)
> > >
> > > The theoretical problem with this approach is that it only addresses
> > > case-insensitive-duplicate files, not directories.
> >
> > It'll catch some case-insensitive-duplicate directories too -- note
> > that I did call out that it'd print subdirs.  But to be more cautious,
> > you would need to carefully grab all leading directories of any added
> > path, not just the immediate leading directory.
> >
> > > Directories have been the problem, in "my" repo, around one-third of
> > > the time - typically someone does a directory rename, and someone else
> > > does a bad merge and reintroduces the old directory.
> > >
> > > That said, what "icase pathspec magic" actually *does*, is break down
> > > the pathspec into iteratively more complete paths, level by level,
> > > looking for case-duplicates at each level. That's something I could
> > > presumably do in shell scripting, collecting all the interesting
> > > sub-paths first, and then getting ls-tree to tell me about the
> > > immediate children for each sub-path, doing case-insensitive dupe
> > > searches across children for each of these sub-paths.
> > >
> > > ls-tree supporting icase pathspec magic would clearly be more
> > > efficient (I wouldn't need N ls-tree git processes, where N is the
> > > number of sub-paths in the diff), but this should be plenty efficient
> > > for normal commits, with a fallback to the full search
> > >
> > > This seems like a sensible direction, I'll have a play.
> >
> > If you create a script that gives you all leading directories of any
> > listed path (plus replacing the toplevel dir with ':/'), such as this
> > (which I'm calling 'all-leading-dirs.py'):
> >
> > """
> > #!/usr/bin/env python3
> >
> > import os
> > import sys
> >
> > paths = sys.stdin.read().splitlines()
> > dirs_seen = set()
> > for path in paths:
> >   dir = path
> >   while dir:
> >     dir = os.path.dirname(dir)
> >     if dir in dirs_seen:
> >       continue
> >     dirs_seen.add(dir)
> > if dirs_seen:
> >   # Replace top-level dir of "" with ":"; we'll add the trailing '/'
> > below when adding it to all other dirs
> >   dirs_seen.remove("")
> >   dirs_seen.add(':')
> >   for dir in dirs_seen:
> >     print(dir+'/')  # ls-tree wants the trailing '/' if we are going
> > to list contents within that tree rather than just the tree itself
> > """
> >
> > Then the following will catch duplicates files and directories for you:
> >
> > git diff --diff-filter=A --no-renames --name-only HEAD~1 HEAD |
> > all-leading-dirs.py | xargs --no-run-if-empty git ls-tree --name-only
> > -t HEAD | sort | uniq -i -d
> >
> > and it no longer has problems catching duplicates in the toplevel
> > directory either.  It does have (at most) two git invocations, but
> > only one invocation of ls-tree.  Here's a test script to prove it
> > works:
> >
> > """
> > #!/bin/bash
> >
> > git init -b main nukeme
> > cd nukeme
> > mkdir -p dir1/subdir/whatever
> > mkdir -p dir2/subdir/whatever
> > >dir1/subdir/whatever/foo
> > >dir2/subdir/whatever/foo
> > git add .
> > git commit -m initial
> >
> > mkdir -p dir1/SubDir/whatever
> > >dir1/SubDir/whatever/foo
> > git add .
> > git commit -m stuff
> >
> > git diff --diff-filter=A --no-renames --name-only HEAD~1 HEAD |
> > all-leading-dirs.py | xargs --no-run-if-empty git ls-tree --name-only
> > -t HEAD | sort | uniq -i -d
> > """
> >
> > The output of this script is
> > """
> > dir1/subdir
> > """
> > which correctly notifies on the duplicate (dir1/SubDir being the
> > other; uniq is the one that picks which of the two duplicate names to
> > print)

[-- Attachment #2: case-insensitive-duplicate-check-update-hook.bash --]
[-- Type: application/octet-stream, Size: 8047 bytes --]

#!/bin/bash

UPDATING_REF="$1"
FROM_HASH="$2"
TO_HASH="$3"

# GENERAL Problem Background:
# - In git, repo paths are inherently case-sensitive
# - In two of the three major dev workstation OSs (Windows, OSX), filesystems are typically case-insensitive
# - When files are added in the repo that are duplicates in the filesystem, but not in git, usability problems abound
# - A particularly onerous issue occurs when case-insensitive-duplicate files have differing content; after a checkout
#       of such a tree, the case-insensitive working tree always looks like it has a change to one or the other of the
#       VCS files. In such an "unavoidably/unfixably dirty tree" state, a casual user cannot even "pull" a fixed branch
#       state, as git will will refuse to pull in a dirty working tree.
# - Git currently offers no built-in mechanism to prevent this
# - It is preferable to avoid this issue in a server hook, rather than in CI validations (although the latter can also
#       work) - the earlier the issue is "nipped in the bud" the better.
# - There is a reasonably trivial shell scripting test for "Tree has duplicates", but validating simply on the basis of
#       this test presents two challenges:
#   - It might be too expensive to run on every commit in a large repo
#   - It might trigger "false positives" when creating new refs pointing to old already-problematic commits

# SPECIFIC Problem Repo Background:
# - 200K files, in deeply nested (java) folder structures (+50k directories)
# - Doing a "naive" full-tree case-insensitive duplicate search takes around 2s
#   - This is not a price we're willing to pay on every branch update
# - Case-insensitive duplicates are sometimes folders, sometimes files
# - Repo has lots of refs - 110k - so scanning them is costly
# - New refs are created *very* often - most often as "keeparound" refs in GitLab
#   - Validations don't make sense / don't add value for some or many classes of ref (those created by automated
#       systems, where a violation is not the "fault" of the pusher)
# - Some (active) branches are very distant from each other - there is no single "master"
 
# Test Cases / situations:
# - Single "normal" change, a dozen added files
# - Single "large-ish" change, a few hundred added files
# - on an existing branch (eg simple change, merge from upstream)
# - on a new branch, varying the original/upstream branch
# - on new refs that are excluded from validation by path/pattern

# Expected & tested performance characteristics of this hook script:
# - For non-branch or otherwise disqualified refs, close to no overhead.
# - For updates to existing refs, it is reasonably optimal
#   - There are four non-trivial operations, which all scale with the diff size:
#     - Finding the files added in the ref update (git diff)
#     - Finding all the ancestor-paths for these files (output_all_ancestor_gitpaths)
#       - This *can* be made slightly more efficient by using eg python
#     - Finding all items in these paths (git ls-tree)
#     - Sorting the resulting paths/items (sort)
#   - When the number of added files is too high, we fall back to a full-tree search
# - For new non-disqualified refs, there are additional overheads:
#   - Finding the best/closest "standard branch" to diff from, for a new ref, might be non-trivial in some repos
#     - (in the generic implementation, it is a trivial/cheap check for "master" and "main")
#   - Depending on the strategy used to find a "diff from" ref, diffs on new refs will likely be much larger/more expensive than they should ideally need to be

# Future work / possible improvements:
# - If git provided an easy/cheap way to detect "commits new to this repo" in an update hook, the diffs could be consistently much smaller/more efficient.
#   - (it is possible to find out whether a commit is net-new with something like "git rev-list MYREF --not --exclude MYREF --all", but this is *expensive* in a large repo)
# - If "git ls-tree" supported icase pathspec magic, we could pass in full added-file pathspecs, and subsequent dupe-detection could be more efficient
#   - (there would be a functional difference: the current strategy finds dupe directories even if they have non-overlapping contents; this other approach would not)

# Requirements:
# - Bash 4 for associative arrays


#---------
#FUNCTIONS - called via subshell
#---------

output_all_ancestor_gitpaths() {
  declare -A INTERESTING_PATHS
  while IFS= read -r INPUT_LINE; do
    CURRENT_PATH="$INPUT_LINE"
    while [ "$CURRENT_PATH" != ":" ]; do
      [[ "$CURRENT_PATH" == */* ]] || CURRENT_PATH=":/$CURRENT_PATH"
      PARENT_PATH="${CURRENT_PATH%/*}"
      if [ "${INTERESTING_PATHS["${PARENT_PATH}/"]}" = "1" ]; then
        break
      fi
      INTERESTING_PATHS["${PARENT_PATH}/"]="1"
      CURRENT_PATH="$PARENT_PATH"
    done
  done <<< "$1"
  for key in ${!INTERESTING_PATHS[@]}; do echo "$key"; done
  debug_log "FOUND ${#INTERESTING_PATHS[@]} INTERESTING PATHS" 
}

find_closest_standard_branch() {
  # Simplest implementation, look for "master" or "main" or give up.
  # there will likely be better repo-specific strategies in specific repos.
  git for-each-ref --format="%(refname)" --count=1 refs/heads/master refs/heads/main
}

START_TIME="$(date +%s%N)"
debug_log() {
  if [ -n "$DEBUG_DUPE_HOOK" ]; then
    CURRENT_TIME="$(date +%s%N)"
    echo "Debug at $(($(($CURRENT_TIME-$START_TIME))/1000000))ms: $1" >&2
  fi
}

#-----------
#MAIN SCRIPT
#-----------

if [ -z "$TO_HASH" ]; then
  # branch deletion, no possibility of new dupes
  debug_log "BRANCH DELETION - SKIPPING DUPE VALIDATION"
  exit 0
fi

VALIDATING_PATH_PREFIXES=("refs/heads/")
REF_IS_VALIDATABLE=0
for PATH_PREFIX in "${VALIDATING_PATH_PREFIXES[@]}"
do
  if [[ "$UPDATING_REF" = "$PATH_PREFIX"* ]]; then
    REF_IS_VALIDATABLE=1
  fi
done

if [[ "$REF_IS_VALIDATABLE" = "0" ]]; then
  # we're not interested in validating for dupes on this ref.
  debug_log "NON-VALIDATABLE REF - SKIPPING DUPE VALIDATION"
  exit 0
fi

FULL_SEARCH=false
if [ -z "$FROM_HASH" ]; then
  # New ref; in most repos new "interesting" (not-excluded-from-dupe-check)
  # refs will likely be closely related to master/main most of the time, so
  # diff from there rather than starting from scratch.
  debug_log "NEW REF - LOOKING FOR STANDARD BRANCH"
  FROM_HASH=$(find_closest_standard_branch "$TO_HASH")
fi

if [ -z "$FROM_HASH" ]; then
  # If this is a new ref and no "standard upstream ref" was found, fall back
  # to full dupe search.
  debug_log "NO STANDARD UPSTREAM REF FOUND; DOING FULL SEARCH"
  FULL_SEARCH=true
else
  debug_log "CHECKING DIFF: git diff --diff-filter=A --no-renames --name-only '$FROM_HASH' '$TO_HASH'"
  NEW_FILES=$(git diff --diff-filter=A --no-renames --name-only "$FROM_HASH" "$TO_HASH")
  NEW_FILE_LINE_COUNT=$(wc -l <<< "$NEW_FILES")
  
  if [ "$NEW_FILE_LINE_COUNT" -le "1" ] && [ -z "$NEW_FILES" ]; then
    # no newly added files, no possibility of new dupes
	# (count "1" can be an empty line)
    debug_log "EXITING WITH 0 ADDED FILES"
    exit 0
  fi

  if [ "$NEW_FILE_LINE_COUNT" -gt 10000 ]; then
    # If there are too many files, then generating all the sub-paths and
    # running ls-tree with a huge pathspec will be slower than simply
    # checking the whole tree.
    debug_log "FILE COUNT $NEW_FILE_LINE_COUNT OVER LIMIT; DOING FULL SEARCH"
    FULL_SEARCH=true
  fi
fi

if [ "$FULL_SEARCH" = "true" ]; then
  debug_log "DOING FULL DUPE SEARCH"
  DUPE_PATHS=$(git ls-tree --name-only -r "$TO_HASH" | sort | uniq -i -D)
else
  debug_log "DOING DIFF SEARCH OF $NEW_FILE_LINE_COUNT ADDED FILES BETWEEN $FROM_HASH AND $TO_HASH"
  DUPE_PATHS=$(output_all_ancestor_gitpaths "$NEW_FILES" | sort | xargs -d"\n" --no-run-if-empty git ls-tree --name-only -t "$TO_HASH" | sort | uniq | uniq -i -D)
fi

if [ -z "$DUPE_PATHS" ]; then
  debug_log "EXITING - NO DUPES FOUND"
  exit 0
else
  debug_log "EXITING WITH DUPES ERROR"
  echo "This ref cannot be accepted - duplicate folders or files found:"
  echo "$DUPE_PATHS"
  exit 1
fi

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2022-10-17 15:47 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-09-30 12:04 icase pathspec magic support in ls-tree Tao Klerks
2022-09-30 13:53 ` Ævar Arnfjörð Bjarmason
2022-10-02 19:07   ` brian m. carlson
2022-10-13  6:35     ` Tao Klerks
2022-10-14  4:51       ` Torsten Bögershausen
2022-10-14  8:31         ` Tao Klerks
2022-10-14  8:37           ` Erik Cervin Edin
2022-10-14  7:41       ` Elijah Newren
2022-10-14  8:03         ` Erik Cervin Edin
2022-10-14  8:57           ` Tao Klerks
2022-10-14  8:48         ` Tao Klerks
2022-10-14  9:07           ` Tao Klerks
2022-10-14 12:00             ` Erik Cervin Edin
2022-10-14 17:06           ` Elijah Newren
2022-10-15 22:06             ` Tao Klerks
2022-10-17 15:46               ` Tao Klerks

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).