Index files autocompletion too slow in big repositories (w / suggestion for improvement)

git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed

* Index files autocompletion too slow in big repositories (w / suggestion for improvement)
@ 2017-04-14 20:06 Carlos Pita
  2017-04-14 22:08 ` Carlos Pita
  0 siblings, 1 reply; 10+ messages in thread
From: Carlos Pita @ 2017-04-14 20:06 UTC (permalink / raw)
  To: git

Hi all,

I'm currently using git annex to manage my entire file collection
(including tons of music and books) and I noticed how slow
autocompletion has become for files in the index (say for git add).
The main offender is a while-read-case-echo bash loop in
__git_index_files that can be readily substituted with a much faster
sed invocation, although I guess you didn't want the sed dependency in
the first place. Anyway, here is my benchmark:

__git_index_files ()
{
    local dir="$(__gitdir)" root="${2-.}" file;
    if [ -d "$dir" ]; then
        __git_ls_files_helper "$root" "$1" | while read -r file; do
            case "$file" in
                ?*/*)
                    echo "${file%%/*}"
                ;;
                *)
                    echo "$file"
                ;;
            esac;
        done | sort | uniq;
    fi
}

time __git_index_files > /dev/null

__git_index_files ()
{
    local dir="$(__gitdir)" root="${2-.}" file;
    if [ -d "$dir" ]; then
        __git_ls_files_helper "$root" "$1" | \
            sed -r 's@^"?([^/]+)/.*$@\1@' | sort | uniq
    fi
}

time __git_index_files > /dev/null

real    0m0.830s
user    0m0.597s
sys    0m0.310s

real    0m0.345s
user    0m0.357s
sys    0m0.000s

Notice I'm also excluding the beginning double quote that appears in
escaped path names.

Best regards
--
Carlos

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Index files autocompletion too slow in big repositories (w / suggestion for improvement)
  2017-04-14 20:06 Index files autocompletion too slow in big repositories (w / suggestion for improvement) Carlos Pita
@ 2017-04-14 22:08 ` Carlos Pita
  2017-04-14 22:33   ` Ævar Arnfjörð Bjarmason
  0 siblings, 1 reply; 10+ messages in thread
From: Carlos Pita @ 2017-04-14 22:08 UTC (permalink / raw)
  To: “git@vger.kernel.org”

This is much faster (below 0.1s):

__git_index_files ()
{
    local dir="$(__gitdir)" root="${2-.}" file;
    if [ -d "$dir" ]; then
        __git_ls_files_helper "$root" "$1" | \
            sed -r 's@/.*@@' | uniq | sort | uniq
    fi
}

time __git_index_files

real    0m0.075s
user    0m0.083s
sys    0m0.010s

Most of the improvement is due to the simpler, non-grouping, regex.
Since I expect most of the common prefixes to arrive consecutively,
running uniq before sort also improves things a bit. I'm not removing
leading double quotes anymore (this isn't being done by the current
version, anyway) but this doesn't seem to hurt.

Despite the dependence on sed this is ten times faster than the
original, maybe an option to enable fast index completion or something
like that might be desirable.

Best regards
--
Carlos

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Index files autocompletion too slow in big repositories (w / suggestion for improvement)
  2017-04-14 22:08 ` Carlos Pita
@ 2017-04-14 22:33   ` Ævar Arnfjörð Bjarmason
  2017-04-15  1:37     ` Jacob Keller
                       ` (2 more replies)
  0 siblings, 3 replies; 10+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2017-04-14 22:33 UTC (permalink / raw)
  To: Carlos Pita; +Cc: “git@vger.kernel.org”

On Sat, Apr 15, 2017 at 12:08 AM, Carlos Pita <carlosjosepita@gmail.com> wrote:
> This is much faster (below 0.1s):
>
> __git_index_files ()
> {
>     local dir="$(__gitdir)" root="${2-.}" file;
>     if [ -d "$dir" ]; then
>         __git_ls_files_helper "$root" "$1" | \
>             sed -r 's@/.*@@' | uniq | sort | uniq
>     fi
> }
>
> time __git_index_files
>
> real    0m0.075s
> user    0m0.083s
> sys    0m0.010s
>
> Most of the improvement is due to the simpler, non-grouping, regex.
> Since I expect most of the common prefixes to arrive consecutively,
> running uniq before sort also improves things a bit. I'm not removing
> leading double quotes anymore (this isn't being done by the current
> version, anyway) but this doesn't seem to hurt.
>
> Despite the dependence on sed this is ten times faster than the
> original, maybe an option to enable fast index completion or something
> like that might be desirable.
>
> Best regards

It's fine to depend on sed, these shell-scripts are POSIX compatible,
and so is sed, we use sed in a lot of the built-in shellscripts.

I think you should submit this as a patch, see Documentation/SubmittingPatches.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Index files autocompletion too slow in big repositories (w / suggestion for improvement)
  2017-04-14 22:33   ` Ævar Arnfjörð Bjarmason
@ 2017-04-15  1:37     ` Jacob Keller
  2017-04-15  7:52       ` Junio C Hamano
  2017-04-15 11:59     ` Johannes Sixt
  2017-04-15 12:30     ` Johannes Sixt
  2 siblings, 1 reply; 10+ messages in thread
From: Jacob Keller @ 2017-04-15  1:37 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason
  Cc: Carlos Pita, “git@vger.kernel.org”

On Fri, Apr 14, 2017 at 3:33 PM, Ævar Arnfjörð Bjarmason
<avarab@gmail.com> wrote:
> On Sat, Apr 15, 2017 at 12:08 AM, Carlos Pita <carlosjosepita@gmail.com> wrote:
>> This is much faster (below 0.1s):
>>
>> __git_index_files ()
>> {
>>     local dir="$(__gitdir)" root="${2-.}" file;
>>     if [ -d "$dir" ]; then
>>         __git_ls_files_helper "$root" "$1" | \
>>             sed -r 's@/.*@@' | uniq | sort | uniq
>>     fi
>> }
>>
>> time __git_index_files
>>
>> real    0m0.075s
>> user    0m0.083s
>> sys    0m0.010s
>>
>> Most of the improvement is due to the simpler, non-grouping, regex.
>> Since I expect most of the common prefixes to arrive consecutively,
>> running uniq before sort also improves things a bit. I'm not removing
>> leading double quotes anymore (this isn't being done by the current
>> version, anyway) but this doesn't seem to hurt.
>>
>> Despite the dependence on sed this is ten times faster than the
>> original, maybe an option to enable fast index completion or something
>> like that might be desirable.
>>
>> Best regards
>
> It's fine to depend on sed, these shell-scripts are POSIX compatible,
> and so is sed, we use sed in a lot of the built-in shellscripts.
>
> I think you should submit this as a patch, see Documentation/SubmittingPatches.

Yea it should be fine to use sed.

Thanks,
Jake

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Index files autocompletion too slow in big repositories (w / suggestion for improvement)
  2017-04-15  1:37     ` Jacob Keller
@ 2017-04-15  7:52       ` Junio C Hamano
  0 siblings, 0 replies; 10+ messages in thread
From: Junio C Hamano @ 2017-04-15  7:52 UTC (permalink / raw)
  To: Jacob Keller
  Cc: Ævar Arnfjörð Bjarmason, Carlos Pita,
	“git@vger.kernel.org”

Jacob Keller <jacob.keller@gmail.com> writes:

> On Fri, Apr 14, 2017 at 3:33 PM, Ævar Arnfjörð Bjarmason
> <avarab@gmail.com> wrote:
>> On Sat, Apr 15, 2017 at 12:08 AM, Carlos Pita <carlosjosepita@gmail.com> wrote:
>>> This is much faster (below 0.1s):
>>>
>>> __git_index_files ()
>>> {
>>>     local dir="$(__gitdir)" root="${2-.}" file;
>>>     if [ -d "$dir" ]; then
>>>         __git_ls_files_helper "$root" "$1" | \
>>>             sed -r 's@/.*@@' | uniq | sort | uniq
>>>     fi
>>> }
>>>
>>> time __git_index_files
>>>
>>> real    0m0.075s
>>> user    0m0.083s
>>> sys    0m0.010s
>>>
>>> Most of the improvement is due to the simpler, non-grouping, regex.
>>> Since I expect most of the common prefixes to arrive consecutively,
>>> running uniq before sort also improves things a bit. I'm not removing
>>> leading double quotes anymore (this isn't being done by the current
>>> version, anyway) but this doesn't seem to hurt.
>>>
>>> Despite the dependence on sed this is ten times faster than the
>>> original, maybe an option to enable fast index completion or something
>>> like that might be desirable.
>>>
>>> Best regards
>>
>> It's fine to depend on sed, these shell-scripts are POSIX compatible,
>> and so is sed, we use sed in a lot of the built-in shellscripts.
>>
>> I think you should submit this as a patch, see Documentation/SubmittingPatches.
>
> Yea it should be fine to use sed.

As long as the use of "sed" is in line with POSIX.1; I do not think
you need the non-portable "-r" merely to strip out everything that
follow the first slash, so perhaps "s|-r|-e|" with the above (and do
not write backslash after pipe at the end of the line---shell knows
you haven't finished talking to it yet if you end a line with a
pipe, and there is no need for backslash), you'd be golden.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Index files autocompletion too slow in big repositories (w / suggestion for improvement)
  2017-04-14 22:33   ` Ævar Arnfjörð Bjarmason
  2017-04-15  1:37     ` Jacob Keller
@ 2017-04-15 11:59     ` Johannes Sixt
  2017-04-16  0:31       ` Jacob Keller
  2017-04-17  4:05       ` Junio C Hamano
  2017-04-15 12:30     ` Johannes Sixt
  2 siblings, 2 replies; 10+ messages in thread
From: Johannes Sixt @ 2017-04-15 11:59 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason, Carlos Pita
  Cc: “git@vger.kernel.org”, SZEDER Gábor

Cc Gábor.

Am 15.04.2017 um 00:33 schrieb Ævar Arnfjörð Bjarmason:
> On Sat, Apr 15, 2017 at 12:08 AM, Carlos Pita <carlosjosepita@gmail.com> wrote:
>> This is much faster (below 0.1s):
>>
>> __git_index_files ()
>> {
>>     local dir="$(__gitdir)" root="${2-.}" file;
>>     if [ -d "$dir" ]; then
>>         __git_ls_files_helper "$root" "$1" | \
>>             sed -r 's@/.*@@' | uniq | sort | uniq
>>     fi
>> }
>>
>> time __git_index_files
>>
>> real    0m0.075s
>> user    0m0.083s
>> sys    0m0.010s
>>
>> Most of the improvement is due to the simpler, non-grouping, regex.
>> Since I expect most of the common prefixes to arrive consecutively,
>> running uniq before sort also improves things a bit. I'm not removing
>> leading double quotes anymore (this isn't being done by the current
>> version, anyway) but this doesn't seem to hurt.
>>
>> Despite the dependence on sed this is ten times faster than the
>> original, maybe an option to enable fast index completion or something
>> like that might be desirable.
>
> It's fine to depend on sed, these shell-scripts are POSIX compatible,
> and so is sed, we use sed in a lot of the built-in shellscripts.

This is about command line completion. We go a long way to avoid forking 
processes there. What is 10x faster on Linux despite of forking a 
process may not be so on Windows.

(I'm not using bash command line completion on Windows, so I can't tell 
what the effect of your suggested change is on Windows. I hope Gábor can 
comment on it.)

-- Hannes


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Index files autocompletion too slow in big repositories (w / suggestion for improvement)
  2017-04-14 22:33   ` Ævar Arnfjörð Bjarmason
  2017-04-15  1:37     ` Jacob Keller
  2017-04-15 11:59     ` Johannes Sixt
@ 2017-04-15 12:30     ` Johannes Sixt
  2 siblings, 0 replies; 10+ messages in thread
From: Johannes Sixt @ 2017-04-15 12:30 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason, Carlos Pita
  Cc: “git@vger.kernel.org”, SZEDER Gábor

Cc Gábor, resent with working email (hopefully); please follow-up on 
this mail.

Am 15.04.2017 um 00:33 schrieb Ævar Arnfjörð Bjarmason:
> On Sat, Apr 15, 2017 at 12:08 AM, Carlos Pita <carlosjosepita@gmail.com> wrote:
>> This is much faster (below 0.1s):
>>
>> __git_index_files ()
>> {
>>     local dir="$(__gitdir)" root="${2-.}" file;
>>     if [ -d "$dir" ]; then
>>         __git_ls_files_helper "$root" "$1" | \
>>             sed -r 's@/.*@@' | uniq | sort | uniq
>>     fi
>> }
>>
>> time __git_index_files
>>
>> real    0m0.075s
>> user    0m0.083s
>> sys    0m0.010s
>>
>> Most of the improvement is due to the simpler, non-grouping, regex.
>> Since I expect most of the common prefixes to arrive consecutively,
>> running uniq before sort also improves things a bit. I'm not removing
>> leading double quotes anymore (this isn't being done by the current
>> version, anyway) but this doesn't seem to hurt.
>>
>> Despite the dependence on sed this is ten times faster than the
>> original, maybe an option to enable fast index completion or something
>> like that might be desirable.
>
> It's fine to depend on sed, these shell-scripts are POSIX compatible,
> and so is sed, we use sed in a lot of the built-in shellscripts.

This is about command line completion. We go a long way to avoid forking 
processes there. What is 10x faster on Linux despite of forking a 
process may not be so on Windows.

(I'm not using bash command line completion on Windows, so I can't tell 
what the effect of your suggested change is on Windows. I hope Gábor can 
comment on it.)

-- Hannes

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Index files autocompletion too slow in big repositories (w / suggestion for improvement)
  2017-04-15 11:59     ` Johannes Sixt
@ 2017-04-16  0:31       ` Jacob Keller
  2017-04-17  4:05       ` Junio C Hamano
  1 sibling, 0 replies; 10+ messages in thread
From: Jacob Keller @ 2017-04-16  0:31 UTC (permalink / raw)
  To: Johannes Sixt
  Cc: Ævar Arnfjörð Bjarmason, Carlos Pita,
	“git@vger.kernel.org”, SZEDER Gábor

On Sat, Apr 15, 2017 at 4:59 AM, Johannes Sixt <j6t@kdbg.org> wrote:
> Cc Gábor.
>
> Am 15.04.2017 um 00:33 schrieb Ævar Arnfjörð Bjarmason:
>>
>> On Sat, Apr 15, 2017 at 12:08 AM, Carlos Pita <carlosjosepita@gmail.com>
>> wrote:
>>>
>>> This is much faster (below 0.1s):
>>>
>>> __git_index_files ()
>>> {
>>>     local dir="$(__gitdir)" root="${2-.}" file;
>>>     if [ -d "$dir" ]; then
>>>         __git_ls_files_helper "$root" "$1" | \
>>>             sed -r 's@/.*@@' | uniq | sort | uniq
>>>     fi
>>> }
>>>
>>> time __git_index_files
>>>
>>> real    0m0.075s
>>> user    0m0.083s
>>> sys    0m0.010s
>>>
>>> Most of the improvement is due to the simpler, non-grouping, regex.
>>> Since I expect most of the common prefixes to arrive consecutively,
>>> running uniq before sort also improves things a bit. I'm not removing
>>> leading double quotes anymore (this isn't being done by the current
>>> version, anyway) but this doesn't seem to hurt.
>>>
>>> Despite the dependence on sed this is ten times faster than the
>>> original, maybe an option to enable fast index completion or something
>>> like that might be desirable.
>>
>>
>> It's fine to depend on sed, these shell-scripts are POSIX compatible,
>> and so is sed, we use sed in a lot of the built-in shellscripts.
>
>
> This is about command line completion. We go a long way to avoid forking
> processes there. What is 10x faster on Linux despite of forking a process
> may not be so on Windows.
>
> (I'm not using bash command line completion on Windows, so I can't tell what
> the effect of your suggested change is on Windows. I hope Gábor can comment
> on it.)
>
> -- Hannes
>

In cases like this, might it be worth somehow splitting it so Linux
can use the best thing, and Windows can continue using what's best for
it, since it is a pretty significant advantage on Linux.

Thanks,
Jake

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Index files autocompletion too slow in big repositories (w / suggestion for improvement)
  2017-04-15 11:59     ` Johannes Sixt
  2017-04-16  0:31       ` Jacob Keller
@ 2017-04-17  4:05       ` Junio C Hamano
  2017-04-17  8:03         ` Johannes Sixt
  1 sibling, 1 reply; 10+ messages in thread
From: Junio C Hamano @ 2017-04-17  4:05 UTC (permalink / raw)
  To: Johannes Sixt
  Cc: Ævar Arnfjörð Bjarmason, Carlos Pita,
	“git@vger.kernel.org”, SZEDER Gábor

Johannes Sixt <j6t@kdbg.org> writes:

> Cc Gábor.
>
> Am 15.04.2017 um 00:33 schrieb Ævar Arnfjörð Bjarmason:
>> On Sat, Apr 15, 2017 at 12:08 AM, Carlos Pita <carlosjosepita@gmail.com> wrote:
>>> This is much faster (below 0.1s):
>>>
>>> __git_index_files ()
>>> {
>>>     local dir="$(__gitdir)" root="${2-.}" file;
>>>     if [ -d "$dir" ]; then
>>>         __git_ls_files_helper "$root" "$1" | \
>>>             sed -r 's@/.*@@' | uniq | sort | uniq
>>>     fi
>>> }
>>>
>>> time __git_index_files
>>>
>>> real    0m0.075s
>>> user    0m0.083s
>>> sys    0m0.010s
>>>
>>> Most of the improvement is due to the simpler, non-grouping, regex.
>>> Since I expect most of the common prefixes to arrive consecutively,
>>> running uniq before sort also improves things a bit. I'm not removing
>>> leading double quotes anymore (this isn't being done by the current
>>> version, anyway) but this doesn't seem to hurt.
>>>
>>> Despite the dependence on sed this is ten times faster than the
>>> original, maybe an option to enable fast index completion or something
>>> like that might be desirable.
>>
>> It's fine to depend on sed, these shell-scripts are POSIX compatible,
>> and so is sed, we use sed in a lot of the built-in shellscripts.
>
> This is about command line completion. We go a long way to avoid
> forking processes there. What is 10x faster on Linux despite of
> forking a process may not be so on Windows.

Doesn't this depend on how many paths there are?  If there are only
a few paths, the loop in shell would beat a pipe into sed even on
Linux, I suspect, and if there are tons of paths, at some number,
loop in shell would become slower than a single spawning of sed on
platforms with slower fork, no?

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Index files autocompletion too slow in big repositories (w / suggestion for improvement)
  2017-04-17  4:05       ` Junio C Hamano
@ 2017-04-17  8:03         ` Johannes Sixt
  0 siblings, 0 replies; 10+ messages in thread
From: Johannes Sixt @ 2017-04-17  8:03 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Ævar Arnfjörð Bjarmason, Carlos Pita,
	“git@vger.kernel.org”, SZEDER Gábor

Am 17.04.2017 um 06:05 schrieb Junio C Hamano:
> Johannes Sixt <j6t@kdbg.org> writes:
>> This is about command line completion. We go a long way to avoid
>> forking processes there. What is 10x faster on Linux despite of
>> forking a process may not be so on Windows.
>
> Doesn't this depend on how many paths there are?  If there are only
> a few paths, the loop in shell would beat a pipe into sed even on
> Linux, I suspect, and if there are tons of paths, at some number,
> loop in shell would become slower than a single spawning of sed on
> platforms with slower fork, no?

Absolutely. I just want to make sure a suggested change takes into 
account the situation on Windows, not only the "YESSSS!" and "VERY 
WELL!" votes of Linux users ;)

-- Hannes


^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2017-04-17  8:04 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-04-14 20:06 Index files autocompletion too slow in big repositories (w / suggestion for improvement) Carlos Pita
2017-04-14 22:08 ` Carlos Pita
2017-04-14 22:33   ` Ævar Arnfjörð Bjarmason
2017-04-15  1:37     ` Jacob Keller
2017-04-15  7:52       ` Junio C Hamano
2017-04-15 11:59     ` Johannes Sixt
2017-04-16  0:31       ` Jacob Keller
2017-04-17  4:05       ` Junio C Hamano
2017-04-17  8:03         ` Johannes Sixt
2017-04-15 12:30     ` Johannes Sixt

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).