git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: John Cai <johncai86@gmail.com>
To: phillip.wood@dunelm.org.uk
Cc: "Ævar Arnfjörð Bjarmason" <avarab@gmail.com>,
	"John Cai via GitGitGadget" <gitgitgadget@gmail.com>,
	git@vger.kernel.org, "Patrick Steinhardt" <ps@pks.im>
Subject: Re: [PATCH 2/2] diff: teach diff to read gitattribute diff-algorithm
Date: Tue, 07 Feb 2023 12:00:49 -0500	[thread overview]
Message-ID: <65129323-326F-4E4A-B6F8-06DC3BBE7B58@gmail.com> (raw)
In-Reply-To: <d18a5c32-2f15-93ad-ccbf-e8f048edb311@dunelm.org.uk>

Hi Phillip,

On 7 Feb 2023, at 9:55, Phillip Wood wrote:

> Hi John
>
> On 06/02/2023 20:37, John Cai wrote:
>> On 6 Feb 2023, at 11:39, Ævar Arnfjörð Bjarmason wrote:
>>> On Sun, Feb 05 2023, John Cai via GitGitGadget wrote:
>>> For some non-nitpicking, I do worry about exposing this as a DoS vector,
>>> e.g. here's a diff between two distant points in git.git with the
>>> various algorithms:
>>>
>>> 	$ hyperfine -r 1 -L a patience,minimal,histogram,myers 'git diff --diff-algorithm={a} v2.0.0 v2.28.0'
>>> 	Benchmark 1: git diff --diff-algorithm=patience v2.0.0 v2.28.0
>>> 	  Time (abs ≡):        42.121 s               [User: 41.879 s, System: 0.144 s]
>>> 	
>>> 	Benchmark 2: git diff --diff-algorithm=minimal v2.0.0 v2.28.0
>>> 	  Time (abs ≡):        35.634 s               [User: 35.473 s, System: 0.160 s]
>>> 	
>>> 	Benchmark 3: git diff --diff-algorithm=histogram v2.0.0 v2.28.0
>>> 	  Time (abs ≡):        46.912 s               [User: 46.657 s, System: 0.228 s]
>>> 	
>>> 	Benchmark 4: git diff --diff-algorithm=myers v2.0.0 v2.28.0
>>> 	  Time (abs ≡):        33.233 s               [User: 33.072 s, System: 0.160 s]
>>> 	
>>> 	Summary
>>> 	  'git diff --diff-algorithm=myers v2.0.0 v2.28.0' ran
>>> 	    1.07 times faster than 'git diff --diff-algorithm=minimal v2.0.0 v2.28.0'
>>> 	    1.27 times faster than 'git diff --diff-algorithm=patience v2.0.0 v2.28.0'
>>> 	    1.41 times faster than 'git diff --diff-algorithm=histogram v2.0.0 v2.28.0'
>>
>> Thanks for this analysis. To clarify, .gitconfig's diff.algorithm setting is
>> already an attack vector right?
>
> .gitconfig is under the user's control though whereas .gitattributes is attacker controlled if one clones a malicious repository. Having said the worst results above are for the historgram algorithm that merge-ort uses internally and no one has complained about ort's performance.
>
>> I see how this would be adding another one.
>>
>> That being said, here's a separate issue. I benchmarked the usage of
>> .gitattributes as introduced in this patch series, and indeed it does look like
>> there is additional latency:
>>
>> $ echo "* diff-algorithm=patience >> .gitattributes
>> $ hyperfine -r 5 'git-bin-wrapper diff --diff-algorithm=patience v2.0.0 v2.28.0'                      ✭
>> Benchmark 1: git-bin-wrapper diff --diff-algorithm=patience v2.0.0 v2.28.0
>>    Time (mean ± σ):     889.4 ms ± 113.8 ms    [User: 715.7 ms, System: 65.3 ms]
>>    Range (min … max):   764.1 ms … 1029.3 ms    5 runs
>>
>> $ hyperfine -r 5 'git-bin-wrapper diff v2.0.0 v2.28.0'                                                ✭
>> Benchmark 1: git-bin-wrapper diff v2.0.0 v2.28.0
>>    Time (mean ± σ):      2.146 s ±  0.368 s    [User: 0.827 s, System: 0.243 s]
>>    Range (min … max):    1.883 s …  2.795 s    5 runs
>>
>> and I imagine the latency scales with the size of .gitattributes. Although I'm
>> not familiar with other parts of the codebase and how it deals with the latency
>> introduced by reading attributes files.
>
> Ouch! Thanks for benchmarking that is a suspiciously large slow down. I've a feeling the attributes code parses all the .gitattribute files from scratch for each path that's queried, so there may be scope for making it a bit smarter. I see some slow down but no where near as much

Yes, definitely worth looking into how this can be sped up.

>
>
> $ hyperfine -r 3 'bin-wrappers/git diff --diff-algorithm=patience --no-color --no-color-moved v2.0.0 v2.28.0'
> Benchmark 1: bin-wrappers/git diff --diff-algorithm=patience --no-color --no-color-moved v2.0.0 v2.28.0
>   Time (mean ± σ):      1.996 s ±  0.008 s    [User: 1.706 s, System: 0.286 s]
>   Range (min … max):    1.989 s …  2.004 s    3 runs
>
> $ hyperfine -r 5 'bin-wrappers/git diff --no-color --no-color-moved v2.0.0 v2.28.0'
> Benchmark 1: bin-wrappers/git diff --no-color --no-color-moved v2.0.0 v2.28.0
>   Time (mean ± σ):      2.238 s ±  0.037 s    [User: 1.880 s, System: 0.350 s]
>   Range (min … max):    2.216 s …  2.303 s    5 runs
>
>
> Perhaps I'm over simplifying but having read the issue you linked to I couldn't help feeling that the majority of users might be satisfied by just changing gitlab to use the patience algorithm when generating diffs.

Right, I recognize this is a judgment call that may be best left up to the list.

We don't have a way in GitLab to change the diff algorithm currently. Of course
that can be implemented outside of Git, but having it as part of Git may have a
wider benefit for users who would appreciate the convenience of automatically
having certain files use one diff algorithm and other files use another, without
having to pass in the diff algorithm through the command line each time.

>
> The idea to use .gitattributes seems to have come from Patrick rather than a user request.
>
> This is slightly off topic but one thing I'd really like is a way to tell diff use automatically use --diff-words on some files (e.g. Documentation/*)

I think it's a bit similar to the spirit of this desired feature.

thanks
John

>
> Best Wishes
>
> Phillip
>
>>> Now, all of those are very slow overall, but some much more than
>>> others. I seem to recall that the non-default ones also had some
>>> pathological cases.
>>>
>>> Another thing to think about is that we've so far considered the diff
>>> algorithm to be purely about presentation, with some notable exceptions
>>> such as "patch-id".
>>>
>>> I've advocated for us getting to the point of having an in-repo
>>> .gitconfig or .gitattributes before with a whitelist of settings like
>>> diff.context for certain paths, or a diff.orderFile.
>>>
>>> But those seem easy to promise future behavior for, v.s. an entire diff
>>> algorithm (which we of course had before, but now we'd have it in
>>> repository data).
>>>
>>> Maybe that's not a distinction worth worrying about, just putting that
>>> out there.
>>>
>>> I think if others are concerned about the above something that would
>>> neatly side-step those is to have it opt-in via the .git/config somehow,
>>> similar to e.g. how you can commit *.gpg content, put this in
>>> .gitattributes:
>>>
>>> 	*.gpg diff=gpg
>>>
>>> But not have it do anything until this is in the repo's .git/config (or
>>> similar):
>>>
>>> 	[diff "gpg"]
>>>          	textconv = gpg --no-tty --decrypt
>>>
>>> For that you could still keep the exact .gitattributes format you have
>>> here, i.e.:
>>>
>>> 	file* diff-algorithm=$STRATEGY
>>>
>>> But we to pick it up we'd need either:
>>>
>>> 	[diff-algorithm]
>>>          	histogram = myers
>>>
>>> Or:
>>>
>>> 	[diff-algorithm "histogram"]
>>>          	allow = true
>>
>> This would help address slowness from the diff algorithm itself. I'm not opposed
>> to adding this config if this attack vector is concerning to people.
>>
>> However, it wouldn't help address the additional latency of scanning
>> .gitattributes to find the diff algorithm.
>>
>> Would a separate config to allow gitattributes be helpful here?
>>
>> [diff-algorithm]
>> 	attributes = true
>>
>>>
>>> The former form being one that would allow you to map the .gitattributes
>>> of the repo (but maybe that would be redundant to
>>> .git/info/attributes)...

  reply	other threads:[~2023-02-07 17:00 UTC|newest]

Thread overview: 78+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-02-05  3:46 [PATCH 0/2] Teach diff to honor diff algorithms set through git attributes John Cai via GitGitGadget
2023-02-05  3:46 ` [PATCH 1/2] diff: consolidate diff algorithm option parsing John Cai via GitGitGadget
2023-02-06 16:20   ` Phillip Wood
2023-02-05  3:46 ` [PATCH 2/2] diff: teach diff to read gitattribute diff-algorithm John Cai via GitGitGadget
2023-02-05 17:50   ` Eric Sunshine
2023-02-06 13:10     ` John Cai
2023-02-06 16:27   ` Phillip Wood
2023-02-06 18:14     ` Eric Sunshine
2023-02-06 19:50     ` John Cai
2023-02-09  8:26       ` Elijah Newren
2023-02-09 10:31         ` "bad" diffs (was: [PATCH 2/2] diff: teach diff to read gitattribute diff-algorithm) Ævar Arnfjörð Bjarmason
2023-02-09 16:37         ` [PATCH 2/2] diff: teach diff to read gitattribute diff-algorithm John Cai
2023-02-06 16:39   ` Ævar Arnfjörð Bjarmason
2023-02-06 20:37     ` John Cai
2023-02-07 14:55       ` Phillip Wood
2023-02-07 17:00         ` John Cai [this message]
2023-02-09  9:09           ` Elijah Newren
2023-02-09 14:44             ` Phillip Wood
2023-02-10  9:57               ` Elijah Newren
2023-02-11 17:39                 ` Phillip Wood
2023-02-11  1:59               ` Jeff King
2023-02-15  2:35                 ` Elijah Newren
2023-02-15  4:21                   ` Jeff King
2023-02-15  5:20                     ` Junio C Hamano
2023-02-15 14:44                 ` Phillip Wood
2023-02-15 15:00                   ` Jeff King
2023-02-07 17:27         ` Ævar Arnfjörð Bjarmason
2023-02-15 14:47           ` Phillip Wood
2023-02-09  8:44       ` Elijah Newren
2023-02-14 21:16         ` John Cai
2023-02-15  3:41           ` Elijah Newren
2023-02-09  7:50     ` Elijah Newren
2023-02-09  9:41       ` Ævar Arnfjörð Bjarmason
2023-02-11  2:04         ` Jeff King
2023-02-07 17:56   ` Jeff King
2023-02-07 20:18     ` Ævar Arnfjörð Bjarmason
2023-02-07 20:47       ` Junio C Hamano
2023-02-07 21:05         ` Ævar Arnfjörð Bjarmason
2023-02-07 21:28           ` Junio C Hamano
2023-02-07 21:44             ` Ævar Arnfjörð Bjarmason
2023-02-09 16:34     ` John Cai
2023-02-11  1:39       ` Jeff King
2023-02-14 21:40 ` [PATCH v2 0/2] Teach diff to honor diff algorithms set through git attributes John Cai via GitGitGadget
2023-02-14 21:40   ` [PATCH v2 1/2] diff: consolidate diff algorithm option parsing John Cai via GitGitGadget
2023-02-15  2:38     ` Junio C Hamano
2023-02-15 23:34       ` John Cai
2023-02-15 23:42         ` Junio C Hamano
2023-02-16  2:14           ` Jeff King
2023-02-16  2:57             ` Junio C Hamano
2023-02-16 20:34               ` John Cai
2023-02-14 21:40   ` [PATCH v2 2/2] diff: teach diff to read gitattribute diff-algorithm John Cai via GitGitGadget
2023-02-15  2:56     ` Junio C Hamano
2023-02-15  3:20       ` Junio C Hamano
2023-02-16 20:37         ` John Cai
2023-02-17 20:21   ` [PATCH v3 0/2] Teach diff to honor diff algorithms set through git attributes John Cai via GitGitGadget
2023-02-17 20:21     ` [PATCH v3 1/2] diff: consolidate diff algorithm option parsing John Cai via GitGitGadget
2023-02-17 21:27       ` Junio C Hamano
2023-02-18  1:36       ` Elijah Newren
2023-02-17 20:21     ` [PATCH v3 2/2] diff: teach diff to read algorithm from diff driver John Cai via GitGitGadget
2023-02-17 21:50       ` Junio C Hamano
2023-02-18  2:56       ` Elijah Newren
2023-02-20 15:32         ` John Cai
2023-02-20 16:21           ` Elijah Newren
2023-02-20 16:49             ` John Cai
2023-02-20 17:32               ` Elijah Newren
2023-02-20 20:53                 ` John Cai
2023-02-22 19:47                 ` Jeff King
2023-02-24 17:44                   ` John Cai
2023-02-18  1:16     ` [PATCH v3 0/2] Teach diff to honor diff algorithms set through git attributes Elijah Newren
2023-02-20 13:37       ` John Cai
2023-02-20 21:04     ` [PATCH v4 " John Cai via GitGitGadget
2023-02-20 21:04       ` [PATCH v4 1/2] diff: consolidate diff algorithm option parsing John Cai via GitGitGadget
2023-02-20 21:04       ` [PATCH v4 2/2] diff: teach diff to read algorithm from diff driver John Cai via GitGitGadget
2023-02-21 17:34       ` [PATCH v4 0/2] Teach diff to honor diff algorithms set through git attributes Junio C Hamano
2023-02-21 18:05         ` Elijah Newren
2023-02-21 18:51           ` Junio C Hamano
2023-02-21 19:36             ` John Cai
2023-02-21 20:16               ` Elijah Newren

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=65129323-326F-4E4A-B6F8-06DC3BBE7B58@gmail.com \
    --to=johncai86@gmail.com \
    --cc=avarab@gmail.com \
    --cc=git@vger.kernel.org \
    --cc=gitgitgadget@gmail.com \
    --cc=phillip.wood@dunelm.org.uk \
    --cc=ps@pks.im \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).