git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
* Efficiently detecting paths that differ from each other only in case
@ 2010-10-08  6:13 Dun Peal
  2010-10-08 13:50 ` Jeff King
  0 siblings, 1 reply; 14+ messages in thread
From: Dun Peal @ 2010-10-08  6:13 UTC (permalink / raw)
  To: git

Hi,

I'm writing a Git hook, for a bare central repo, to reject pushes of
paths that differ from existing tracked paths only by letter case.

So, for example, if the following path is tracked in the repo:

foo/bar.py

Commits adding any of the following paths would be rejected:

Foo/bar.py
foo/Bar.py
fOO/BAR.py

Etc. I know how to do it by listing paths with ls-files, but my repo
contains many thousands of files, so I was wondering if there was a
more efficient way than for every commit:

1. Get a list of all paths in the repo from ls-files.
2. Lowercase all paths.
3. Check for repeats.

Thanks, D

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Efficiently detecting paths that differ from each other only in case
  2010-10-08  6:13 Efficiently detecting paths that differ from each other only in case Dun Peal
@ 2010-10-08 13:50 ` Jeff King
  2010-10-08 19:44   ` Dun Peal
  0 siblings, 1 reply; 14+ messages in thread
From: Jeff King @ 2010-10-08 13:50 UTC (permalink / raw)
  To: Dun Peal; +Cc: git

On Fri, Oct 08, 2010 at 01:13:07AM -0500, Dun Peal wrote:

> Etc. I know how to do it by listing paths with ls-files, but my repo
> contains many thousands of files, so I was wondering if there was a
> more efficient way than for every commit:
> 
> 1. Get a list of all paths in the repo from ls-files.
> 2. Lowercase all paths.
> 3. Check for repeats.

You can do:

  git ls-files | tr A-Z a-z | sort | uniq -d

but:

  1. It will print only the lowercased version, not all versions.

  2. It doesn't handle filenames with embedded newlines.

You could fix both with something like:

  git ls-files -z | perl -0ne '
    push @{$h{lc($_)}}, $_;
    END {
      print join("\n", @{$h{$_}}) . "\n\n"
        foreach grep { @{$h{$_}} > 1 } keys(%h);
    }
  '

-Peff

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Efficiently detecting paths that differ from each other only in case
  2010-10-08 13:50 ` Jeff King
@ 2010-10-08 19:44   ` Dun Peal
  2010-10-08 19:51     ` Jeff King
  2010-10-08 19:56     ` Jonathan Nieder
  0 siblings, 2 replies; 14+ messages in thread
From: Dun Peal @ 2010-10-08 19:44 UTC (permalink / raw)
  To: Jeff King; +Cc: git

On Fri, Oct 8, 2010 at 8:50 AM, Jeff King <peff@peff.net> wrote:
> You can do:
>
>  git ls-files | tr A-Z a-z | sort | uniq -d

Thanks, but the main issue is that this is a very large repository
with tens of thousands of paths (files and directories).

git ls-files thus takes a long time, almost a second. Since this is a
commit-heavy repo, I'd rather avoid that overhead.

Incidentally, there's an SVN hook that does the exact same thing that
I want to do:

http://svn.apache.org/repos/asf/subversion/trunk/contrib/hook-scripts/case-insensitive.py

D

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Efficiently detecting paths that differ from each other only in case
  2010-10-08 19:44   ` Dun Peal
@ 2010-10-08 19:51     ` Jeff King
  2010-10-08 19:57       ` Dun Peal
  2010-10-08 19:56     ` Jonathan Nieder
  1 sibling, 1 reply; 14+ messages in thread
From: Jeff King @ 2010-10-08 19:51 UTC (permalink / raw)
  To: Dun Peal; +Cc: git

On Fri, Oct 08, 2010 at 02:44:05PM -0500, Dun Peal wrote:

> On Fri, Oct 8, 2010 at 8:50 AM, Jeff King <peff@peff.net> wrote:
> > You can do:
> >
> >  git ls-files | tr A-Z a-z | sort | uniq -d
> 
> Thanks, but the main issue is that this is a very large repository
> with tens of thousands of paths (files and directories).
> 
> git ls-files thus takes a long time, almost a second. Since this is a
> commit-heavy repo, I'd rather avoid that overhead.

I don't think you will find a solution that is faster. ls-files is just
reading the list of files from git's index file. I'm not seeing anything
near that long to do an ls-files on my machine:

  $ cd linux-2.6
  $ git ls-files | wc -l
  34296
  $ time git ls-files >/dev/null
  real    0m0.034s
  user    0m0.028s
  sys     0m0.012s

One thing to consider, though, is if this is a hook running on the
server, you probably don't want to look at the index. You probably want
to look for duplicates in one tree entry (fed to the hook). So you would
be using git ls-tree, which probably is a bit slower.

-Peff

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Efficiently detecting paths that differ from each other only in case
  2010-10-08 19:44   ` Dun Peal
  2010-10-08 19:51     ` Jeff King
@ 2010-10-08 19:56     ` Jonathan Nieder
  1 sibling, 0 replies; 14+ messages in thread
From: Jonathan Nieder @ 2010-10-08 19:56 UTC (permalink / raw)
  To: Dun Peal; +Cc: Jeff King, git

Dun Peal wrote:

> git ls-files thus takes a long time, almost a second. Since this is a
> commit-heavy repo, I'd rather avoid that overhead.
> 
> Incidentally, there's an SVN hook that does the exact same thing that
> I want to do:
> 
> http://svn.apache.org/repos/asf/subversion/trunk/contrib/hook-scripts/case-insensitive.py

Well, can't you do the same thing that hook does?

1. Decide on the desired semantics.

Should a broken push of multiple branches be entirely rejected, or
just the broken branches?  The answer determines whether you should
be using a pre-receive or an update hook; see githooks(5).

2. Get a list of added files with git-diff-tree(1) --diff-filter.
3. Break file names into directory + basename.
4. For each directory with new files or subdirectories:
    - List its children in the new version with git-ls-tree(1)
    - Canonicalize path names
    - Find clashes

If this should be general-purpose, take care to handle:

 - new branches (<old-value> = 00000000...)
 - new subdirectory whose name clashes with an existing file
 - filename clash within a new subdirectory

Good luck,
Jonathan

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Efficiently detecting paths that differ from each other only in case
  2010-10-08 19:51     ` Jeff King
@ 2010-10-08 19:57       ` Dun Peal
  2010-10-08 20:06         ` Jeff King
  0 siblings, 1 reply; 14+ messages in thread
From: Dun Peal @ 2010-10-08 19:57 UTC (permalink / raw)
  To: Jeff King; +Cc: git

On Fri, Oct 8, 2010 at 2:51 PM, Jeff King <peff@peff.net> wrote:
> One thing to consider, though, is if this is a hook running on the
> server, you probably don't want to look at the index. You probably want
> to look for duplicates in one tree entry (fed to the hook). So you would
> be using git ls-tree, which probably is a bit slower.

Thanks, but why is that?  Why can't I use ls-files, and must use use
ls-tree, which you say would be slower?

For extra details: is the central Git repository, running under
Gitorious, on a fairly powerful server. It's a bare repo, naturally.

D

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Efficiently detecting paths that differ from each other only in case
  2010-10-08 19:57       ` Dun Peal
@ 2010-10-08 20:06         ` Jeff King
  2010-10-08 22:57           ` Dun Peal
  0 siblings, 1 reply; 14+ messages in thread
From: Jeff King @ 2010-10-08 20:06 UTC (permalink / raw)
  To: Dun Peal; +Cc: git

On Fri, Oct 08, 2010 at 02:57:49PM -0500, Dun Peal wrote:

> On Fri, Oct 8, 2010 at 2:51 PM, Jeff King <peff@peff.net> wrote:
> > One thing to consider, though, is if this is a hook running on the
> > server, you probably don't want to look at the index. You probably want
> > to look for duplicates in one tree entry (fed to the hook). So you would
> > be using git ls-tree, which probably is a bit slower.
> 
> Thanks, but why is that?  Why can't I use ls-files, and must use use
> ls-tree, which you say would be slower?

For two reasons:

  1. Bare repos generally don't _have_ an index, as it is about
     maintaining the state of the working tree.

  2. Even if you did have an index, it would presumably represent the
     contents of HEAD. But if you are feeding a commit to a hook, then
     that hook will get some sha1 of the to-be-pushed commit. So you
     need to look at the paths that are in that hook.

Re-reading your original message, I have a few more thoughts.

One is that you don't need to do this per-commit. You probably want to
do it per-updated-ref, each of which may be pushing many commits. And
then you either reject the new ref value or not.

Also, you could try not looking at the whole tree by doing something
like:

  git diff-tree --diff-filter=A --name-only $old $new

and then only considering duplicates for newly introduced files. But
that means for each introduced file, you need to enumerate the tree to
find case-sensitive matches. You can avoid looking at the whole tree
only be manually expanding each level (e.g., you see that foo/bar is
new. So you look at the root tree, looking for "foo" or case-insensitive
equivalents. For each one you find, you look further down for "bar" or a
case-insensitive equivalent). But that means many git ls-tree calls. So
I don't think it buys you anything in practice.

-Peff

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Efficiently detecting paths that differ from each other only in case
  2010-10-08 20:06         ` Jeff King
@ 2010-10-08 22:57           ` Dun Peal
  2010-10-09  8:47             ` Jakub Narebski
  2010-10-11  3:07             ` Jeff King
  0 siblings, 2 replies; 14+ messages in thread
From: Dun Peal @ 2010-10-08 22:57 UTC (permalink / raw)
  To: Jeff King; +Cc: git

On Fri, Oct 8, 2010 at 3:06 PM, Jeff King <peff@peff.net> wrote:
> Re-reading your original message, I have a few more thoughts.
>
> One is that you don't need to do this per-commit. You probably want to
> do it per-updated-ref, each of which may be pushing many commits. And
> then you either reject the new ref value or not.

I think I do, actually, because let's say the developer pushes two
commits, 1<-2. Suppose commit 1 violates the rule, but commit 2
reverts the violation. One might think that we don't care, since the
head will now be on 2, which is a correct state. But in fact we do,
because this is Git, and anyone may branch of from 1 in the future,
and voila we have a head in an incorrect state.

> Also, you could try not looking at the whole tree by [...]
> only be manually expanding each level [...]
> But that means many git ls-tree calls.

Yeah, that's a pretty good idea, if not for the many ls-tree calls.
With their overhead, I strongly suspect it may be slower than the
solution you seem to propose, which is:

git ls-tree -r <commit>

which should give the full list of all paths in a commit, upon which I
can decide to accept or reject.

Thanks, D

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Efficiently detecting paths that differ from each other only in case
  2010-10-08 22:57           ` Dun Peal
@ 2010-10-09  8:47             ` Jakub Narebski
  2010-10-09 22:00               ` Dun Peal
  2010-10-11  3:07             ` Jeff King
  1 sibling, 1 reply; 14+ messages in thread
From: Jakub Narebski @ 2010-10-09  8:47 UTC (permalink / raw)
  To: Dun Peal; +Cc: Jeff King, git

Dun Peal <dunpealer@gmail.com> writes:
> On Fri, Oct 8, 2010 at 3:06 PM, Jeff King <peff@peff.net> wrote:

> > Re-reading your original message, I have a few more thoughts.
> >
> > One is that you don't need to do this per-commit. You probably want to
> > do it per-updated-ref, each of which may be pushing many commits. And
> > then you either reject the new ref value or not.
> 
> I think I do, actually, because let's say the developer pushes two
> commits, 1<-2. Suppose commit 1 violates the rule, but commit 2
> reverts the violation. One might think that we don't care, since the
> head will now be on 2, which is a correct state. But in fact we do,
> because this is Git, and anyone may branch of from 1 in the future,
> and voila we have a head in an incorrect state.

Sidenote: why not detect violation earlier, by having pre-commit hook
in each developer repository check for such violation?  It is not as
time sensitive on the local side (when creating a commit, you have to
take some time to write commit message anyway).

See how example pre-commit hook check for "non-ascii" file names.
-- 
Jakub Narebski
Poland
ShadeHawk on #git

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Efficiently detecting paths that differ from each other only in case
  2010-10-09  8:47             ` Jakub Narebski
@ 2010-10-09 22:00               ` Dun Peal
  2010-10-09 22:31                 ` Jakub Narebski
  0 siblings, 1 reply; 14+ messages in thread
From: Dun Peal @ 2010-10-09 22:00 UTC (permalink / raw)
  To: Jakub Narebski; +Cc: Jeff King, git

On Sat, Oct 9, 2010 at 3:47 AM, Jakub Narebski <jnareb@gmail.com> wrote:
> Sidenote: why not detect violation earlier, by having pre-commit hook
> in each developer repository check for such violation?  It is not as
> time sensitive on the local side (when creating a commit, you have to
> take some time to write commit message anyway).

That's a solution, but as you surely know hooks are not automatically
cloned with the repository. There's no way to ensure that all of our
many users install that hook, and we can't allow an incorrect state on
the central node.

A pre-receive hook on that node seems like the only way to really ensure that.

D

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Efficiently detecting paths that differ from each other only in case
  2010-10-09 22:00               ` Dun Peal
@ 2010-10-09 22:31                 ` Jakub Narebski
  0 siblings, 0 replies; 14+ messages in thread
From: Jakub Narebski @ 2010-10-09 22:31 UTC (permalink / raw)
  To: Dun Peal; +Cc: Jeff King, git

On Sun, Oct 10, 2010, Dun Peal wrote:
> On Sat, Oct 9, 2010 at 3:47 AM, Jakub Narebski <jnareb@gmail.com> wrote:

> > Sidenote: why not detect violation earlier, by having pre-commit hook
> > in each developer repository check for such violation?  It is not as
> > time sensitive on the local side (when creating a commit, you have to
> > take some time to write commit message anyway).
> 
> That's a solution, but as you surely know hooks are not automatically
> cloned with the repository. There's no way to ensure that all of our
> many users install that hook, and we can't allow an incorrect state on
> the central node.
> 
> A pre-receive hook on that node seems like the only way to really ensure that.

Well, if you can control git installation on client side, you can
modify git templates (usually in /usr/share/git-core/templates) to
have appropriate pre-commit hok enabled in all repositories created
there... but I agree that it is not an universal solution.

-- 
Jakub Narebski
Poland

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Efficiently detecting paths that differ from each other only in case
  2010-10-08 22:57           ` Dun Peal
  2010-10-09  8:47             ` Jakub Narebski
@ 2010-10-11  3:07             ` Jeff King
  2010-10-16 22:37               ` Dun Peal
  1 sibling, 1 reply; 14+ messages in thread
From: Jeff King @ 2010-10-11  3:07 UTC (permalink / raw)
  To: Dun Peal; +Cc: git

On Fri, Oct 08, 2010 at 05:57:16PM -0500, Dun Peal wrote:

> On Fri, Oct 8, 2010 at 3:06 PM, Jeff King <peff@peff.net> wrote:
> > Re-reading your original message, I have a few more thoughts.
> >
> > One is that you don't need to do this per-commit. You probably want to
> > do it per-updated-ref, each of which may be pushing many commits. And
> > then you either reject the new ref value or not.
> 
> I think I do, actually, because let's say the developer pushes two
> commits, 1<-2. Suppose commit 1 violates the rule, but commit 2
> reverts the violation. One might think that we don't care, since the
> head will now be on 2, which is a correct state. But in fact we do,
> because this is Git, and anyone may branch of from 1 in the future,
> and voila we have a head in an incorrect state.

Yeah, though it is not an especially likely state to branch from, since
you have to specify it manually. However, a much more likely scenario is
checkout out a past commit for testing, especially in bisection. So yes,
if you want to be thorough, you need to check every commit.

> Yeah, that's a pretty good idea, if not for the many ls-tree calls.
> With their overhead, I strongly suspect it may be slower than the
> solution you seem to propose, which is:
> 
> git ls-tree -r <commit>
> 
> which should give the full list of all paths in a commit, upon which I
> can decide to accept or reject.

Yeah, that is what I am proposing.

One other thing you could try is to "ls-tree -r" the known-good state of
the current HEAD at the beginning of the push, and then run "git log
-diff-filter=AD --name-status $old..$new". For each commit in the log
output, look for new entries that are in case-insensitive conflict with
the existing tree, and then update your tree state appropriately with
added and removed files. You only invoke two git commands, which saves
on invocation overhead, and you only ls-tree once per push, not per
commit. Git's internal diff shouldn't look at parts of the tree that
aren't relevant.

The downside is that the tree state you are keeping internally is not
entirely accurate. For example, when receiving a merge between two
parallel lines of development, you would process them linearly, when in
fact there are two simultaneous different states. So there is a case
where branch X removes "foo.txt" and branch Y adds "FOO.TXT", and then
they merge. It looks OK because linearly, they did not both exist at the
same time. But pre-merge, the commit in branch Y is broken.

So really the straightforward approach of checking the tree state for
each commit is probably simplest. If it's really too slow, you could try
jgit or linking against git itself, which would eliminate the external
process overhead.

-Peff

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Efficiently detecting paths that differ from each other only in case
  2010-10-11  3:07             ` Jeff King
@ 2010-10-16 22:37               ` Dun Peal
  2010-10-17  4:25                 ` Jeff King
  0 siblings, 1 reply; 14+ messages in thread
From: Dun Peal @ 2010-10-16 22:37 UTC (permalink / raw)
  To: Jeff King; +Cc: git

Thanks a lot everyone, and especially Jeff.

Running:

    git ls-tree -r HEAD

produces almost 100k lines of output, representing 100k total file +
directory count, which is a pretty big tree. I hope and expect to be
able to split out some branches in the future, but even with
everything included, the operation still only takes ~0.5s real (user +
sys).

So I think even with our relatively high user count and repository
traffic, that cost per commit should not completely break our
workflow.

Thus, I'm going to implement the hook to run the above command and
analyze the output per commit. Assumption is that with the script
overhead, we should be able to finish in less than a second, which is
reasonable latency.

I'll report back if there are any unexpected issues, or this doesn't
work as well as currently expected.

Thanks, D

On Sun, Oct 10, 2010 at 10:07 PM, Jeff King <peff@peff.net> wrote:
> On Fri, Oct 08, 2010 at 05:57:16PM -0500, Dun Peal wrote:
>
>> On Fri, Oct 8, 2010 at 3:06 PM, Jeff King <peff@peff.net> wrote:
>> > Re-reading your original message, I have a few more thoughts.
>> >
>> > One is that you don't need to do this per-commit. You probably want to
>> > do it per-updated-ref, each of which may be pushing many commits. And
>> > then you either reject the new ref value or not.
>>
>> I think I do, actually, because let's say the developer pushes two
>> commits, 1<-2. Suppose commit 1 violates the rule, but commit 2
>> reverts the violation. One might think that we don't care, since the
>> head will now be on 2, which is a correct state. But in fact we do,
>> because this is Git, and anyone may branch of from 1 in the future,
>> and voila we have a head in an incorrect state.
>
> Yeah, though it is not an especially likely state to branch from, since
> you have to specify it manually. However, a much more likely scenario is
> checkout out a past commit for testing, especially in bisection. So yes,
> if you want to be thorough, you need to check every commit.
>
>> Yeah, that's a pretty good idea, if not for the many ls-tree calls.
>> With their overhead, I strongly suspect it may be slower than the
>> solution you seem to propose, which is:
>>
>> git ls-tree -r <commit>
>>
>> which should give the full list of all paths in a commit, upon which I
>> can decide to accept or reject.
>
> Yeah, that is what I am proposing.
>
> One other thing you could try is to "ls-tree -r" the known-good state of
> the current HEAD at the beginning of the push, and then run "git log
> -diff-filter=AD --name-status $old..$new". For each commit in the log
> output, look for new entries that are in case-insensitive conflict with
> the existing tree, and then update your tree state appropriately with
> added and removed files. You only invoke two git commands, which saves
> on invocation overhead, and you only ls-tree once per push, not per
> commit. Git's internal diff shouldn't look at parts of the tree that
> aren't relevant.
>
> The downside is that the tree state you are keeping internally is not
> entirely accurate. For example, when receiving a merge between two
> parallel lines of development, you would process them linearly, when in
> fact there are two simultaneous different states. So there is a case
> where branch X removes "foo.txt" and branch Y adds "FOO.TXT", and then
> they merge. It looks OK because linearly, they did not both exist at the
> same time. But pre-merge, the commit in branch Y is broken.
>
> So really the straightforward approach of checking the tree state for
> each commit is probably simplest. If it's really too slow, you could try
> jgit or linking against git itself, which would eliminate the external
> process overhead.
>
> -Peff
>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Efficiently detecting paths that differ from each other only in case
  2010-10-16 22:37               ` Dun Peal
@ 2010-10-17  4:25                 ` Jeff King
  0 siblings, 0 replies; 14+ messages in thread
From: Jeff King @ 2010-10-17  4:25 UTC (permalink / raw)
  To: Dun Peal; +Cc: git

On Sat, Oct 16, 2010 at 05:37:53PM -0500, Dun Peal wrote:

> Running:
> 
>     git ls-tree -r HEAD
> 
> produces almost 100k lines of output, representing 100k total file +
> directory count, which is a pretty big tree. I hope and expect to be
> able to split out some branches in the future, but even with
> everything included, the operation still only takes ~0.5s real (user +
> sys).

Out of curiosity, is your bare repo fully packed? My linux-2.6 repo is
only ~35000 entries, but it can ls-tree -r with a warm cache in about
.09s. If it scaled linearly with the number of files, that would be
almost twice as fast as yours.

The distribution of files within your directories can impact exactly how
many trees you have, and thus the lookup performance. Or maybe your
machine is simply a little bit slower. But since you seem
performance-conscious, it might be worth double-checking.

-Peff

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2010-10-17  4:25 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-10-08  6:13 Efficiently detecting paths that differ from each other only in case Dun Peal
2010-10-08 13:50 ` Jeff King
2010-10-08 19:44   ` Dun Peal
2010-10-08 19:51     ` Jeff King
2010-10-08 19:57       ` Dun Peal
2010-10-08 20:06         ` Jeff King
2010-10-08 22:57           ` Dun Peal
2010-10-09  8:47             ` Jakub Narebski
2010-10-09 22:00               ` Dun Peal
2010-10-09 22:31                 ` Jakub Narebski
2010-10-11  3:07             ` Jeff King
2010-10-16 22:37               ` Dun Peal
2010-10-17  4:25                 ` Jeff King
2010-10-08 19:56     ` Jonathan Nieder

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).