git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: Tao Klerks <tao@klerks.biz>
To: Elijah Newren <newren@gmail.com>
Cc: "brian m. carlson" <sandals@crustytoothpaste.net>,
	"Ævar Arnfjörð Bjarmason" <avarab@gmail.com>,
	git <git@vger.kernel.org>
Subject: Re: icase pathspec magic support in ls-tree
Date: Sun, 16 Oct 2022 00:06:50 +0200	[thread overview]
Message-ID: <CAPMMpoictABXUhVCASZOiYZ4nydGtqiDYRpAELEBvbPn_5jRWA@mail.gmail.com> (raw)
In-Reply-To: <CABPp-BEFoco4wg4BVE_aPz95cfMwAmbwU19eFyUOJTKj3q6wJA@mail.gmail.com>

This seems to be working, thank you!!!

Two updates I had to make, in case this is useful to anyone else:

1: I'm getting some weird behavior I can't explain yet, where some
paths are returned from the ls-tree call twice: The input to ls-tree
is all unique paths, but the output somehow includes a relatively
small subset of paths twice.
This mysterious issue is easily addressed by adding an extra "uniq"
call to remove the "trivial dupes" before hunting for the
"case-insensitive dupes" we're interested in:

git diff --diff-filter=A --no-renames --name-only HEAD~1 HEAD |
all-leading-dirs.py | xargs --no-run-if-empty git ls-tree --name-only
-t HEAD | sort | uniq | uniq -i -d

2: The xargs call has issues with paths with spaces in them. Adding
-d"\n" seems to be a clean way to fix this

git diff --diff-filter=A --no-renames --name-only HEAD~1 HEAD |
all-leading-dirs.py | xargs -d"\n" --no-run-if-empty git ls-tree --name-only
-t HEAD | sort | uniq | uniq -i -d


Not only does this approach seem to work well, but it also has far
better performance characteristics than I was expecting!

Simple small commit (10 files): 20ms
Reasonably large commit (10,000 files): 250ms
Diff from empty on a smaller branch (100,000 files): 2,800ms
Diff from empty on a larger branch (200,000 files): 5,400ms


It still makes sense to check the number of files/lines after doing
the diff, and do a "simple" 800ms full-tree (no-pathspec) dupe check
instead of this when your diff size goes past some file count
threshold, but it looks like that threshold would be quite high in my
environment - 30k files maybe?

I will have a go at writing a full update hook, and (without knowing
whether it will make sense from a performance perspective) I'd like to
try to implement the "all-leading-dirs" logic in bash 4 using
associative arrays, to remove the python dependency. If I make it work
I'll post back here.

This seems to cover what I needed icase pathspec magic for in ls-tree,
without having to implement it - so thanks again!

Tao

On Fri, Oct 14, 2022 at 7:06 PM Elijah Newren <newren@gmail.com> wrote:
>
> On Fri, Oct 14, 2022 at 1:48 AM Tao Klerks <tao@klerks.biz> wrote:
> >
> > On Fri, Oct 14, 2022 at 9:41 AM Elijah Newren <newren@gmail.com> wrote:
> > >
> [...]
> > > I don't see why you need to do full-tree with existing options, nor
> > > why the ls-tree option you want would somehow make it easier to avoid.
> > > I think you can avoid the full-tree search with something like:
> > >
> > > git diff --diff-filter=A --no-renames --name-only $OLDHASH $NEWHASH |
> > > sed -e s%/[^/]*$%/% | uniq | xargs git ls-tree --name-only $NEWHASH |
> > > \
> > >    sort | uniq -i -d
> > >
> > > The final "sort | uniq -i -d" is taken from Torsten's suggestion.
> > >
> > > The git diff ... xargs git ls-tree section on the first line will
> > > provide a list of all files (& subdirs) in the same directory as any
> > > added file.  (Although, it has a blind spot for paths in the toplevel
> > > directory.)
> >
> > The theoretical problem with this approach is that it only addresses
> > case-insensitive-duplicate files, not directories.
>
> It'll catch some case-insensitive-duplicate directories too -- note
> that I did call out that it'd print subdirs.  But to be more cautious,
> you would need to carefully grab all leading directories of any added
> path, not just the immediate leading directory.
>
> > Directories have been the problem, in "my" repo, around one-third of
> > the time - typically someone does a directory rename, and someone else
> > does a bad merge and reintroduces the old directory.
> >
> > That said, what "icase pathspec magic" actually *does*, is break down
> > the pathspec into iteratively more complete paths, level by level,
> > looking for case-duplicates at each level. That's something I could
> > presumably do in shell scripting, collecting all the interesting
> > sub-paths first, and then getting ls-tree to tell me about the
> > immediate children for each sub-path, doing case-insensitive dupe
> > searches across children for each of these sub-paths.
> >
> > ls-tree supporting icase pathspec magic would clearly be more
> > efficient (I wouldn't need N ls-tree git processes, where N is the
> > number of sub-paths in the diff), but this should be plenty efficient
> > for normal commits, with a fallback to the full search
> >
> > This seems like a sensible direction, I'll have a play.
>
> If you create a script that gives you all leading directories of any
> listed path (plus replacing the toplevel dir with ':/'), such as this
> (which I'm calling 'all-leading-dirs.py'):
>
> """
> #!/usr/bin/env python3
>
> import os
> import sys
>
> paths = sys.stdin.read().splitlines()
> dirs_seen = set()
> for path in paths:
>   dir = path
>   while dir:
>     dir = os.path.dirname(dir)
>     if dir in dirs_seen:
>       continue
>     dirs_seen.add(dir)
> if dirs_seen:
>   # Replace top-level dir of "" with ":"; we'll add the trailing '/'
> below when adding it to all other dirs
>   dirs_seen.remove("")
>   dirs_seen.add(':')
>   for dir in dirs_seen:
>     print(dir+'/')  # ls-tree wants the trailing '/' if we are going
> to list contents within that tree rather than just the tree itself
> """
>
> Then the following will catch duplicates files and directories for you:
>
> git diff --diff-filter=A --no-renames --name-only HEAD~1 HEAD |
> all-leading-dirs.py | xargs --no-run-if-empty git ls-tree --name-only
> -t HEAD | sort | uniq -i -d
>
> and it no longer has problems catching duplicates in the toplevel
> directory either.  It does have (at most) two git invocations, but
> only one invocation of ls-tree.  Here's a test script to prove it
> works:
>
> """
> #!/bin/bash
>
> git init -b main nukeme
> cd nukeme
> mkdir -p dir1/subdir/whatever
> mkdir -p dir2/subdir/whatever
> >dir1/subdir/whatever/foo
> >dir2/subdir/whatever/foo
> git add .
> git commit -m initial
>
> mkdir -p dir1/SubDir/whatever
> >dir1/SubDir/whatever/foo
> git add .
> git commit -m stuff
>
> git diff --diff-filter=A --no-renames --name-only HEAD~1 HEAD |
> all-leading-dirs.py | xargs --no-run-if-empty git ls-tree --name-only
> -t HEAD | sort | uniq -i -d
> """
>
> The output of this script is
> """
> dir1/subdir
> """
> which correctly notifies on the duplicate (dir1/SubDir being the
> other; uniq is the one that picks which of the two duplicate names to
> print)

  reply	other threads:[~2022-10-15 22:07 UTC|newest]

Thread overview: 16+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-09-30 12:04 icase pathspec magic support in ls-tree Tao Klerks
2022-09-30 13:53 ` Ævar Arnfjörð Bjarmason
2022-10-02 19:07   ` brian m. carlson
2022-10-13  6:35     ` Tao Klerks
2022-10-14  4:51       ` Torsten Bögershausen
2022-10-14  8:31         ` Tao Klerks
2022-10-14  8:37           ` Erik Cervin Edin
2022-10-14  7:41       ` Elijah Newren
2022-10-14  8:03         ` Erik Cervin Edin
2022-10-14  8:57           ` Tao Klerks
2022-10-14  8:48         ` Tao Klerks
2022-10-14  9:07           ` Tao Klerks
2022-10-14 12:00             ` Erik Cervin Edin
2022-10-14 17:06           ` Elijah Newren
2022-10-15 22:06             ` Tao Klerks [this message]
2022-10-17 15:46               ` Tao Klerks

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CAPMMpoictABXUhVCASZOiYZ4nydGtqiDYRpAELEBvbPn_5jRWA@mail.gmail.com \
    --to=tao@klerks.biz \
    --cc=avarab@gmail.com \
    --cc=git@vger.kernel.org \
    --cc=newren@gmail.com \
    --cc=sandals@crustytoothpaste.net \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).