git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: Elijah Newren <newren@gmail.com>
To: Tao Klerks <tao@klerks.biz>
Cc: "brian m. carlson" <sandals@crustytoothpaste.net>,
	"Ævar Arnfjörð Bjarmason" <avarab@gmail.com>,
	git <git@vger.kernel.org>
Subject: Re: icase pathspec magic support in ls-tree
Date: Fri, 14 Oct 2022 10:06:23 -0700	[thread overview]
Message-ID: <CABPp-BEFoco4wg4BVE_aPz95cfMwAmbwU19eFyUOJTKj3q6wJA@mail.gmail.com> (raw)
In-Reply-To: <CAPMMpojvDj7Yc27HKQU4seSqg5Tx61RY3LOgMfkK=a0J25QYQw@mail.gmail.com>

On Fri, Oct 14, 2022 at 1:48 AM Tao Klerks <tao@klerks.biz> wrote:
>
> On Fri, Oct 14, 2022 at 9:41 AM Elijah Newren <newren@gmail.com> wrote:
> >
[...]
> > I don't see why you need to do full-tree with existing options, nor
> > why the ls-tree option you want would somehow make it easier to avoid.
> > I think you can avoid the full-tree search with something like:
> >
> > git diff --diff-filter=A --no-renames --name-only $OLDHASH $NEWHASH |
> > sed -e s%/[^/]*$%/% | uniq | xargs git ls-tree --name-only $NEWHASH |
> > \
> >    sort | uniq -i -d
> >
> > The final "sort | uniq -i -d" is taken from Torsten's suggestion.
> >
> > The git diff ... xargs git ls-tree section on the first line will
> > provide a list of all files (& subdirs) in the same directory as any
> > added file.  (Although, it has a blind spot for paths in the toplevel
> > directory.)
>
> The theoretical problem with this approach is that it only addresses
> case-insensitive-duplicate files, not directories.

It'll catch some case-insensitive-duplicate directories too -- note
that I did call out that it'd print subdirs.  But to be more cautious,
you would need to carefully grab all leading directories of any added
path, not just the immediate leading directory.

> Directories have been the problem, in "my" repo, around one-third of
> the time - typically someone does a directory rename, and someone else
> does a bad merge and reintroduces the old directory.
>
> That said, what "icase pathspec magic" actually *does*, is break down
> the pathspec into iteratively more complete paths, level by level,
> looking for case-duplicates at each level. That's something I could
> presumably do in shell scripting, collecting all the interesting
> sub-paths first, and then getting ls-tree to tell me about the
> immediate children for each sub-path, doing case-insensitive dupe
> searches across children for each of these sub-paths.
>
> ls-tree supporting icase pathspec magic would clearly be more
> efficient (I wouldn't need N ls-tree git processes, where N is the
> number of sub-paths in the diff), but this should be plenty efficient
> for normal commits, with a fallback to the full search
>
> This seems like a sensible direction, I'll have a play.

If you create a script that gives you all leading directories of any
listed path (plus replacing the toplevel dir with ':/'), such as this
(which I'm calling 'all-leading-dirs.py'):

"""
#!/usr/bin/env python3

import os
import sys

paths = sys.stdin.read().splitlines()
dirs_seen = set()
for path in paths:
  dir = path
  while dir:
    dir = os.path.dirname(dir)
    if dir in dirs_seen:
      continue
    dirs_seen.add(dir)
if dirs_seen:
  # Replace top-level dir of "" with ":"; we'll add the trailing '/'
below when adding it to all other dirs
  dirs_seen.remove("")
  dirs_seen.add(':')
  for dir in dirs_seen:
    print(dir+'/')  # ls-tree wants the trailing '/' if we are going
to list contents within that tree rather than just the tree itself
"""

Then the following will catch duplicates files and directories for you:

git diff --diff-filter=A --no-renames --name-only HEAD~1 HEAD |
all-leading-dirs.py | xargs --no-run-if-empty git ls-tree --name-only
-t HEAD | sort | uniq -i -d

and it no longer has problems catching duplicates in the toplevel
directory either.  It does have (at most) two git invocations, but
only one invocation of ls-tree.  Here's a test script to prove it
works:

"""
#!/bin/bash

git init -b main nukeme
cd nukeme
mkdir -p dir1/subdir/whatever
mkdir -p dir2/subdir/whatever
>dir1/subdir/whatever/foo
>dir2/subdir/whatever/foo
git add .
git commit -m initial

mkdir -p dir1/SubDir/whatever
>dir1/SubDir/whatever/foo
git add .
git commit -m stuff

git diff --diff-filter=A --no-renames --name-only HEAD~1 HEAD |
all-leading-dirs.py | xargs --no-run-if-empty git ls-tree --name-only
-t HEAD | sort | uniq -i -d
"""

The output of this script is
"""
dir1/subdir
"""
which correctly notifies on the duplicate (dir1/SubDir being the
other; uniq is the one that picks which of the two duplicate names to
print)

  parent reply	other threads:[~2022-10-14 17:06 UTC|newest]

Thread overview: 16+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-09-30 12:04 icase pathspec magic support in ls-tree Tao Klerks
2022-09-30 13:53 ` Ævar Arnfjörð Bjarmason
2022-10-02 19:07   ` brian m. carlson
2022-10-13  6:35     ` Tao Klerks
2022-10-14  4:51       ` Torsten Bögershausen
2022-10-14  8:31         ` Tao Klerks
2022-10-14  8:37           ` Erik Cervin Edin
2022-10-14  7:41       ` Elijah Newren
2022-10-14  8:03         ` Erik Cervin Edin
2022-10-14  8:57           ` Tao Klerks
2022-10-14  8:48         ` Tao Klerks
2022-10-14  9:07           ` Tao Klerks
2022-10-14 12:00             ` Erik Cervin Edin
2022-10-14 17:06           ` Elijah Newren [this message]
2022-10-15 22:06             ` Tao Klerks
2022-10-17 15:46               ` Tao Klerks

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CABPp-BEFoco4wg4BVE_aPz95cfMwAmbwU19eFyUOJTKj3q6wJA@mail.gmail.com \
    --to=newren@gmail.com \
    --cc=avarab@gmail.com \
    --cc=git@vger.kernel.org \
    --cc=sandals@crustytoothpaste.net \
    --cc=tao@klerks.biz \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).