git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
* [PATCH 0/2] Teach log -G to ignore binary files
@ 2018-11-21 20:52 Thomas Braun
  2018-11-21 20:52 ` [PATCH v1 1/2] log -G: Ignore " Thomas Braun
  0 siblings, 1 reply; 30+ messages in thread
From: Thomas Braun @ 2018-11-21 20:52 UTC (permalink / raw)
  To: git; +Cc: gitster, peff

Based on the previous discussion in [1] I've prepared patches which teach 
log -G to ignore binary files. log -S keeps its behaviour but got a test to ensure that.

Feedback welcome!

[1]: https://public-inbox.org/git/7a0992eb-adb9-a7a1-cfaa-3384bc4d3e5c@virtuell-zuhause.de/

Thomas Braun (2):
  log -G: Ignore binary files
  log -S: Add test which searches in binary files

 Documentation/gitdiffcore.txt |  2 +-
 diffcore-pickaxe.c            |  5 +++++
 t/t4209-log-pickaxe.sh        | 21 +++++++++++++++++++++
 3 files changed, 27 insertions(+), 1 deletion(-)

-- 
2.19.0.271.gfe8321ec05.dirty


^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH v1 1/2] log -G: Ignore binary files
  2018-11-21 20:52 [PATCH 0/2] Teach log -G to ignore binary files Thomas Braun
@ 2018-11-21 20:52 ` Thomas Braun
  2018-11-21 20:52   ` [PATCH v1 2/2] log -S: Add test which searches in " Thomas Braun
                     ` (4 more replies)
  0 siblings, 5 replies; 30+ messages in thread
From: Thomas Braun @ 2018-11-21 20:52 UTC (permalink / raw)
  To: git; +Cc: gitster, peff, Thomas Braun

The -G <regex> option of log looks for the differences whose patch text
contains added/removed lines that match regex.

The concept of differences only makes sense for text files, therefore
we need to ignore binary files when searching with -G <regex> as well.

Signed-off-by: Thomas Braun <thomas.braun@virtuell-zuhause.de>
---
 Documentation/gitdiffcore.txt |  2 +-
 diffcore-pickaxe.c            |  5 +++++
 t/t4209-log-pickaxe.sh        | 22 ++++++++++++++++++++++
 3 files changed, 28 insertions(+), 1 deletion(-)

diff --git a/Documentation/gitdiffcore.txt b/Documentation/gitdiffcore.txt
index c0a60f3158..059ddd3431 100644
--- a/Documentation/gitdiffcore.txt
+++ b/Documentation/gitdiffcore.txt
@@ -242,7 +242,7 @@ textual diff has an added or a deleted line that matches the given
 regular expression.  This means that it will detect in-file (or what
 rename-detection considers the same file) moves, which is noise.  The
 implementation runs diff twice and greps, and this can be quite
-expensive.
+expensive.  Binary files without textconv filter are ignored.
 
 When `-S` or `-G` are used without `--pickaxe-all`, only filepairs
 that match their respective criterion are kept in the output.  When
diff --git a/diffcore-pickaxe.c b/diffcore-pickaxe.c
index 69fc55ea1e..8c2558b07d 100644
--- a/diffcore-pickaxe.c
+++ b/diffcore-pickaxe.c
@@ -144,6 +144,11 @@ static int pickaxe_match(struct diff_filepair *p, struct diff_options *o,
 		textconv_two = get_textconv(o->repo->index, p->two);
 	}
 
+	if ((o->pickaxe_opts & DIFF_PICKAXE_KIND_G) &&
+	    ((!textconv_one && diff_filespec_is_binary(o->repo, p->one)) ||
+	     (!textconv_two && diff_filespec_is_binary(o->repo, p->two))))
+		return 0;
+
 	/*
 	 * If we have an unmodified pair, we know that the count will be the
 	 * same and don't even have to load the blobs. Unless textconv is in
diff --git a/t/t4209-log-pickaxe.sh b/t/t4209-log-pickaxe.sh
index 844df760f7..42cc8afd8b 100755
--- a/t/t4209-log-pickaxe.sh
+++ b/t/t4209-log-pickaxe.sh
@@ -106,4 +106,26 @@ test_expect_success 'log -S --no-textconv (missing textconv tool)' '
 	rm .gitattributes
 '
 
+test_expect_success 'log -G ignores binary files' '
+	rm -rf .git &&
+	git init &&
+	printf "a\0b" >data.bin &&
+	git add data.bin &&
+	git commit -m "message" &&
+	git log -G a >result &&
+	test_must_be_empty result
+'
+
+test_expect_success 'log -G looks into binary files with textconv filter' '
+	rm -rf .git &&
+	git init &&
+	echo "* diff=bin" > .gitattributes &&
+	printf "a\0b" >data.bin &&
+	git add data.bin &&
+	git commit -m "message" &&
+	git -c diff.bin.textconv=cat log -G a >actual &&
+	git log >expected &&
+	test_cmp actual expected
+'
+
 test_done
-- 
2.19.0.271.gfe8321ec05.dirty


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH v1 2/2] log -S: Add test which searches in binary files
  2018-11-21 20:52 ` [PATCH v1 1/2] log -G: Ignore " Thomas Braun
@ 2018-11-21 20:52   ` Thomas Braun
  2018-11-21 21:00     ` [PATCH 0/2] Teach log -G to ignore " Thomas Braun
                       ` (2 more replies)
  2018-11-22  1:29   ` [PATCH v1 1/2] log -G: Ignore " Junio C Hamano
                     ` (3 subsequent siblings)
  4 siblings, 3 replies; 30+ messages in thread
From: Thomas Braun @ 2018-11-21 20:52 UTC (permalink / raw)
  To: git; +Cc: gitster, peff, Thomas Braun

The -S <regex> option of log looks for differences that changes the
number of occurrences of the specified string (i.e. addition/deletion)
in a file.

Add a test to ensure that we keep looking into binary files with -S
as changing that would break backwards compatibility in unexpected ways.

Signed-off-by: Thomas Braun <thomas.braun@virtuell-zuhause.de>
---
 t/t4209-log-pickaxe.sh | 11 +++++++++++
 1 file changed, 11 insertions(+)

diff --git a/t/t4209-log-pickaxe.sh b/t/t4209-log-pickaxe.sh
index 42cc8afd8b..d430f6f2f9 100755
--- a/t/t4209-log-pickaxe.sh
+++ b/t/t4209-log-pickaxe.sh
@@ -128,4 +128,15 @@ test_expect_success 'log -G looks into binary files with textconv filter' '
 	test_cmp actual expected
 '
 
+test_expect_success 'log -S looks into binary files' '
+	rm -rf .git &&
+	git init &&
+	printf "a\0b" >data.bin &&
+	git add data.bin &&
+	git commit -m "message" &&
+	git log -S a >actual &&
+	git log >expected &&
+	test_cmp actual expected
+'
+
 test_done
-- 
2.19.0.271.gfe8321ec05.dirty


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH 0/2] Teach log -G to ignore binary files
  2018-11-21 20:52   ` [PATCH v1 2/2] log -S: Add test which searches in " Thomas Braun
@ 2018-11-21 21:00     ` Thomas Braun
  2018-11-28 11:32       ` [PATCH v2] log -G: Ignore " Thomas Braun
  2018-12-14 18:49       ` [PATCH v3] log -G: ignore " Thomas Braun
  2018-11-22  1:34     ` [PATCH v1 2/2] log -S: Add test which searches in " Junio C Hamano
  2018-11-22  9:14     ` Ævar Arnfjörð Bjarmason
  2 siblings, 2 replies; 30+ messages in thread
From: Thomas Braun @ 2018-11-21 21:00 UTC (permalink / raw)
  To: git; +Cc: gitster, peff

Based on the previous discussion in [1] I've prepared patches which teach
log -G to ignore binary files. log -S keeps its behaviour but got a test to ensure that.

Feedback welcome!

[1]: https://public-inbox.org/git/7a0992eb-adb9-a7a1-cfaa-3384bc4d3e5c@virtuell-zuhause.de/

PS: This is the (possibly missing) cover letter.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v1 1/2] log -G: Ignore binary files
  2018-11-21 20:52 ` [PATCH v1 1/2] log -G: Ignore " Thomas Braun
  2018-11-21 20:52   ` [PATCH v1 2/2] log -S: Add test which searches in " Thomas Braun
@ 2018-11-22  1:29   ` Junio C Hamano
  2018-11-28 11:31     ` Thomas Braun
  2018-11-22 10:16   ` Ævar Arnfjörð Bjarmason
                     ` (2 subsequent siblings)
  4 siblings, 1 reply; 30+ messages in thread
From: Junio C Hamano @ 2018-11-22  1:29 UTC (permalink / raw)
  To: Thomas Braun; +Cc: git, peff

Thomas Braun <thomas.braun@virtuell-zuhause.de> writes:

> The -G <regex> option of log looks for the differences whose patch text
> contains added/removed lines that match regex.
>
> The concept of differences only makes sense for text files, therefore
> we need to ignore binary files when searching with -G <regex> as well.
>
> Signed-off-by: Thomas Braun <thomas.braun@virtuell-zuhause.de>
> ---
>  Documentation/gitdiffcore.txt |  2 +-
>  diffcore-pickaxe.c            |  5 +++++
>  t/t4209-log-pickaxe.sh        | 22 ++++++++++++++++++++++
>  3 files changed, 28 insertions(+), 1 deletion(-)

OK.

> diff --git a/Documentation/gitdiffcore.txt b/Documentation/gitdiffcore.txt
> index c0a60f3158..059ddd3431 100644
> --- a/Documentation/gitdiffcore.txt
> +++ b/Documentation/gitdiffcore.txt
> @@ -242,7 +242,7 @@ textual diff has an added or a deleted line that matches the given
>  regular expression.  This means that it will detect in-file (or what
>  rename-detection considers the same file) moves, which is noise.  The
>  implementation runs diff twice and greps, and this can be quite
> -expensive.
> +expensive.  Binary files without textconv filter are ignored.

OK.

> diff --git a/diffcore-pickaxe.c b/diffcore-pickaxe.c
> index 69fc55ea1e..8c2558b07d 100644
> --- a/diffcore-pickaxe.c
> +++ b/diffcore-pickaxe.c
> @@ -144,6 +144,11 @@ static int pickaxe_match(struct diff_filepair *p, struct diff_options *o,
>  		textconv_two = get_textconv(o->repo->index, p->two);
>  	}
>  
> +	if ((o->pickaxe_opts & DIFF_PICKAXE_KIND_G) &&
> +	    ((!textconv_one && diff_filespec_is_binary(o->repo, p->one)) ||
> +	     (!textconv_two && diff_filespec_is_binary(o->repo, p->two))))
> +		return 0;
> +
>  	/*
>  	 * If we have an unmodified pair, we know that the count will be the
>  	 * same and don't even have to load the blobs. Unless textconv is in

Shouldn't this new test come after the existing optimization, which
allows us to leave without loading the blob contents (which is
needed once you call diff_filespec_is_binary())?

> diff --git a/t/t4209-log-pickaxe.sh b/t/t4209-log-pickaxe.sh
> index 844df760f7..42cc8afd8b 100755
> --- a/t/t4209-log-pickaxe.sh
> +++ b/t/t4209-log-pickaxe.sh
> @@ -106,4 +106,26 @@ test_expect_success 'log -S --no-textconv (missing textconv tool)' '
>  	rm .gitattributes
>  '
>  
> +test_expect_success 'log -G ignores binary files' '
> +	rm -rf .git &&
> +	git init &&

Please never never ever do the above two unless you are writing a
test that checks low-level repository details.

If you want a clean history that has specific lineage of commits
without getting affected by commits that have been made by the
previous test pieces, it is OK to "checkout --orphan" to create an
empty history to work with.

> +	printf "a\0b" >data.bin &&
> +	git add data.bin &&
> +	git commit -m "message" &&
> +	git log -G a >result &&
> +	test_must_be_empty result
> +'
> +
> +test_expect_success 'log -G looks into binary files with textconv filter' '
> +	rm -rf .git &&
> +	git init &&
> +	echo "* diff=bin" > .gitattributes &&
> +	printf "a\0b" >data.bin &&
> +	git add data.bin &&
> +	git commit -m "message" &&
> +	git -c diff.bin.textconv=cat log -G a >actual &&
> +	git log >expected &&
> +	test_cmp actual expected
> +'
> +
>  test_done

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v1 2/2] log -S: Add test which searches in binary files
  2018-11-21 20:52   ` [PATCH v1 2/2] log -S: Add test which searches in " Thomas Braun
  2018-11-21 21:00     ` [PATCH 0/2] Teach log -G to ignore " Thomas Braun
@ 2018-11-22  1:34     ` Junio C Hamano
  2018-11-28 11:31       ` Thomas Braun
  2018-11-22  9:14     ` Ævar Arnfjörð Bjarmason
  2 siblings, 1 reply; 30+ messages in thread
From: Junio C Hamano @ 2018-11-22  1:34 UTC (permalink / raw)
  To: Thomas Braun; +Cc: git, peff

Thomas Braun <thomas.braun@virtuell-zuhause.de> writes:

> The -S <regex> option of log looks for differences that changes the
> number of occurrences of the specified string (i.e. addition/deletion)
> in a file.

s/-S <regex>/-S<block of text>/ and
s/the specified string/the specified block of text/ would make it
more in line with how Documentation/gitdiffcore.txt explains it.
The original discussion from early 2017 also explains with a pointer
why the primary mode of -S is not <regex> but is <block of text>.

> diff --git a/t/t4209-log-pickaxe.sh b/t/t4209-log-pickaxe.sh
> index 42cc8afd8b..d430f6f2f9 100755
> --- a/t/t4209-log-pickaxe.sh
> +++ b/t/t4209-log-pickaxe.sh
> @@ -128,4 +128,15 @@ test_expect_success 'log -G looks into binary files with textconv filter' '
>  	test_cmp actual expected
>  '
>  
> +test_expect_success 'log -S looks into binary files' '
> +	rm -rf .git &&
> +	git init &&

Same comment as the one for 1/2 applies here.

> +	printf "a\0b" >data.bin &&
> +	git add data.bin &&
> +	git commit -m "message" &&
> +	git log -S a >actual &&
> +	git log >expected &&
> +	test_cmp actual expected
> +'
> +
>  test_done

Other than these, I think both patches look sensible.  Thanks for
resurrecting the old topic and reigniting it.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v1 2/2] log -S: Add test which searches in binary files
  2018-11-21 20:52   ` [PATCH v1 2/2] log -S: Add test which searches in " Thomas Braun
  2018-11-21 21:00     ` [PATCH 0/2] Teach log -G to ignore " Thomas Braun
  2018-11-22  1:34     ` [PATCH v1 2/2] log -S: Add test which searches in " Junio C Hamano
@ 2018-11-22  9:14     ` Ævar Arnfjörð Bjarmason
  2018-11-24  2:27       ` Junio C Hamano
  2018-11-28 11:31       ` Thomas Braun
  2 siblings, 2 replies; 30+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2018-11-22  9:14 UTC (permalink / raw)
  To: Thomas Braun; +Cc: git, gitster, peff


On Wed, Nov 21 2018, Thomas Braun wrote:

> The -S <regex> option of log looks for differences that changes the
> number of occurrences of the specified string (i.e. addition/deletion)
> in a file.
>
> Add a test to ensure that we keep looking into binary files with -S
> as changing that would break backwards compatibility in unexpected ways.
>
> Signed-off-by: Thomas Braun <thomas.braun@virtuell-zuhause.de>
> ---
>  t/t4209-log-pickaxe.sh | 11 +++++++++++
>  1 file changed, 11 insertions(+)
>
> diff --git a/t/t4209-log-pickaxe.sh b/t/t4209-log-pickaxe.sh
> index 42cc8afd8b..d430f6f2f9 100755
> --- a/t/t4209-log-pickaxe.sh
> +++ b/t/t4209-log-pickaxe.sh
> @@ -128,4 +128,15 @@ test_expect_success 'log -G looks into binary files with textconv filter' '
>  	test_cmp actual expected
>  '
>
> +test_expect_success 'log -S looks into binary files' '
> +	rm -rf .git &&
> +	git init &&
> +	printf "a\0b" >data.bin &&
> +	git add data.bin &&
> +	git commit -m "message" &&
> +	git log -S a >actual &&
> +	git log >expected &&
> +	test_cmp actual expected
> +'
> +
>  test_done

This should just be part of 1/2 since the behavior is changed there &
the commit message should describe both cases.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v1 1/2] log -G: Ignore binary files
  2018-11-21 20:52 ` [PATCH v1 1/2] log -G: Ignore " Thomas Braun
  2018-11-21 20:52   ` [PATCH v1 2/2] log -S: Add test which searches in " Thomas Braun
  2018-11-22  1:29   ` [PATCH v1 1/2] log -G: Ignore " Junio C Hamano
@ 2018-11-22 10:16   ` Ævar Arnfjörð Bjarmason
  2018-11-22 16:27     ` Jeff King
                       ` (2 more replies)
  2018-11-22 16:20   ` Jeff King
  2018-11-26 20:19   ` Stefan Beller
  4 siblings, 3 replies; 30+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2018-11-22 10:16 UTC (permalink / raw)
  To: Thomas Braun; +Cc: git, gitster, peff

>
On Wed, Nov 21 2018, Thomas Braun wrote:

> The -G <regex> option of log looks for the differences whose patch text
> contains added/removed lines that match regex.
>
> The concept of differences only makes sense for text files, therefore
> we need to ignore binary files when searching with -G <regex> as well.
>
> Signed-off-by: Thomas Braun <thomas.braun@virtuell-zuhause.de>
> ---
>  Documentation/gitdiffcore.txt |  2 +-
>  diffcore-pickaxe.c            |  5 +++++
>  t/t4209-log-pickaxe.sh        | 22 ++++++++++++++++++++++
>  3 files changed, 28 insertions(+), 1 deletion(-)
>
> diff --git a/Documentation/gitdiffcore.txt b/Documentation/gitdiffcore.txt
> index c0a60f3158..059ddd3431 100644
> --- a/Documentation/gitdiffcore.txt
> +++ b/Documentation/gitdiffcore.txt
> @@ -242,7 +242,7 @@ textual diff has an added or a deleted line that matches the given
>  regular expression.  This means that it will detect in-file (or what
>  rename-detection considers the same file) moves, which is noise.  The
>  implementation runs diff twice and greps, and this can be quite
> -expensive.
> +expensive.  Binary files without textconv filter are ignored.
>
>  When `-S` or `-G` are used without `--pickaxe-all`, only filepairs
>  that match their respective criterion are kept in the output.  When
> diff --git a/diffcore-pickaxe.c b/diffcore-pickaxe.c
> index 69fc55ea1e..8c2558b07d 100644
> --- a/diffcore-pickaxe.c
> +++ b/diffcore-pickaxe.c
> @@ -144,6 +144,11 @@ static int pickaxe_match(struct diff_filepair *p, struct diff_options *o,
>  		textconv_two = get_textconv(o->repo->index, p->two);
>  	}
>
> +	if ((o->pickaxe_opts & DIFF_PICKAXE_KIND_G) &&
> +	    ((!textconv_one && diff_filespec_is_binary(o->repo, p->one)) ||
> +	     (!textconv_two && diff_filespec_is_binary(o->repo, p->two))))
> +		return 0;
> +
>  	/*
>  	 * If we have an unmodified pair, we know that the count will be the
>  	 * same and don't even have to load the blobs. Unless textconv is in
> diff --git a/t/t4209-log-pickaxe.sh b/t/t4209-log-pickaxe.sh
> index 844df760f7..42cc8afd8b 100755
> --- a/t/t4209-log-pickaxe.sh
> +++ b/t/t4209-log-pickaxe.sh
> @@ -106,4 +106,26 @@ test_expect_success 'log -S --no-textconv (missing textconv tool)' '
>  	rm .gitattributes
>  '
>
> +test_expect_success 'log -G ignores binary files' '
> +	rm -rf .git &&
> +	git init &&
> +	printf "a\0b" >data.bin &&
> +	git add data.bin &&
> +	git commit -m "message" &&
> +	git log -G a >result &&

Would be less confusing as "-Ga" since that's the invocation we
document, even though I see (but wasn't aware that...) "-G a" works too.

> +	test_must_be_empty result
> +'
> +
> +test_expect_success 'log -G looks into binary files with textconv filter' '
> +	rm -rf .git &&
> +	git init &&
> +	echo "* diff=bin" > .gitattributes &&
> +	printf "a\0b" >data.bin &&
> +	git add data.bin &&
> +	git commit -m "message" &&
> +	git -c diff.bin.textconv=cat log -G a >actual &&
> +	git log >expected &&
> +	test_cmp actual expected
> +'
> +
>  test_done

This patch seems like the wrong direction to me. In particular the
assertion that "the concept of differences only makes sense for text
files". That's just not true. This patch breaks this:

    (
        rm -rf /tmp/g-test &&
        git init /tmp/g-test &&
        cd /tmp/g-test &&
        for i in {1..10}; do
            echo "Always matching thensome 5" >file &&
            printf "a thensome %d binary \0" $i >>file &&
            git add file &&
            git commit -m"Bump $i"
        done &&
        git log -Gthensome.*5
    )

Right now this will emit 3/10 patches, and the right ones! I.e. "Bump
[156]". The 1st one because it introduces the "Always matching thensome
5". Then 5/6 because the add/remove the string "a thensome 5 binary",
respectively. Which matches /thensome.*5/.

I.e. in the first one we do a regex match against the content here
because we don't have both sides:
https://github.com/git/git/blob/v2.19.2/diffcore-pickaxe.c#L48-L53

And then for the later ones where we have both sides we end up in
diffgrep_consume():
https://github.com/git/git/blob/v2.19.2/diffcore-pickaxe.c#L27-L36

I think there may be a real issue here to address, which might be some
combination of:

 a) Even though the diffcore can do a binary diff internally, this is
    not what it exposes with "-p", we just say "Binary files differ".

    I don't know how to emit the raw version we'll end up passing to
    diffgrep_consume() in this case. Is it just --binary without the
    encoding? I don't know...

 b) Your test case shows that you're matching a string at a \0
    boundary. Is this perhaps something you ran into? I.e. that we don't
    have some -F version of -G so we can't supply regexes that match
    past a \0? I had some related work on grep for this that hasn't been
    carried over to the diffcore:

        git log --grep='grep:.*\\0' --author=Ævar

 c) Is this binary diff we end up matching against just bad in some
    cases? I haven't dug but that wouldn't surprise me, i.e. that it's
    trying to be line-based so we'll overmatch in many cases.

So maybe this is something that should be passed down as a flag? See a
recent discussion at
https://public-inbox.org/git/87lg77cmr1.fsf@evledraar.gmail.com/ for how
that could be done.

Also if we don't have some tests already that were failing with this
patch we really should have those as "let's test the current behavior
first". Unfortunately tests in this area are really lacking, see
e.g. my:

    git log --author=Junio --min-parents=2 --grep=ab/.*grep

For some series of patches to grep where to get one patch in I needed to
often lead with 5-10 test patches to convince reviewers that I knew what
I was changing, and also to be comfortable that I'd covered all the edge
cases we currently supported, but weren't testing for.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v1 1/2] log -G: Ignore binary files
  2018-11-21 20:52 ` [PATCH v1 1/2] log -G: Ignore " Thomas Braun
                     ` (2 preceding siblings ...)
  2018-11-22 10:16   ` Ævar Arnfjörð Bjarmason
@ 2018-11-22 16:20   ` Jeff King
  2018-11-24  2:32     ` Junio C Hamano
  2018-11-28 11:31     ` Thomas Braun
  2018-11-26 20:19   ` Stefan Beller
  4 siblings, 2 replies; 30+ messages in thread
From: Jeff King @ 2018-11-22 16:20 UTC (permalink / raw)
  To: Thomas Braun; +Cc: git, gitster

On Wed, Nov 21, 2018 at 09:52:27PM +0100, Thomas Braun wrote:

> diff --git a/diffcore-pickaxe.c b/diffcore-pickaxe.c
> index 69fc55ea1e..8c2558b07d 100644
> --- a/diffcore-pickaxe.c
> +++ b/diffcore-pickaxe.c
> @@ -144,6 +144,11 @@ static int pickaxe_match(struct diff_filepair *p, struct diff_options *o,
>  		textconv_two = get_textconv(o->repo->index, p->two);
>  	}
>  
> +	if ((o->pickaxe_opts & DIFF_PICKAXE_KIND_G) &&
> +	    ((!textconv_one && diff_filespec_is_binary(o->repo, p->one)) ||
> +	     (!textconv_two && diff_filespec_is_binary(o->repo, p->two))))
> +		return 0;

If the user passes "-a" to treat binary files as text, we should
probably skip the binary check. I think we'd need to check
"o->flags.text" here.

> diff --git a/t/t4209-log-pickaxe.sh b/t/t4209-log-pickaxe.sh
> index 844df760f7..42cc8afd8b 100755
> --- a/t/t4209-log-pickaxe.sh
> +++ b/t/t4209-log-pickaxe.sh
> @@ -106,4 +106,26 @@ test_expect_success 'log -S --no-textconv (missing textconv tool)' '
> [...]
> +test_expect_success 'log -G ignores binary files' '
> [...]
> +test_expect_success 'log -G looks into binary files with textconv filter' '

And likewise add a test here similar to the textconv one.

-Peff

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v1 1/2] log -G: Ignore binary files
  2018-11-22 10:16   ` Ævar Arnfjörð Bjarmason
@ 2018-11-22 16:27     ` Jeff King
  2018-11-28 11:31     ` Thomas Braun
  2018-11-28 11:31     ` Thomas Braun
  2 siblings, 0 replies; 30+ messages in thread
From: Jeff King @ 2018-11-22 16:27 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason; +Cc: Thomas Braun, git, gitster

On Thu, Nov 22, 2018 at 11:16:38AM +0100, Ævar Arnfjörð Bjarmason wrote:

> > +test_expect_success 'log -G looks into binary files with textconv filter' '
> > +	rm -rf .git &&
> > +	git init &&
> > +	echo "* diff=bin" > .gitattributes &&
> > +	printf "a\0b" >data.bin &&
> > +	git add data.bin &&
> > +	git commit -m "message" &&
> > +	git -c diff.bin.textconv=cat log -G a >actual &&
> > +	git log >expected &&
> > +	test_cmp actual expected
> > +'
> > +
> >  test_done
> 
> This patch seems like the wrong direction to me. In particular the
> assertion that "the concept of differences only makes sense for text
> files". That's just not true. This patch breaks this:

But "-G" is defined as "look for differences whose patch text contains
added/removed lines that match <regex>". We don't have patch text here,
let alone added/removed lines.

For binary files, "-Sfoo" is better defined. I think we _could_ define
"search for <pattern> in the added/removed bytes of a binary file".  But
I don't think that's what the current code does (it really does a line
diff on a binary file, which is likely to put tons of unchanged crap
into the "added and removed" lines, because the line divisions aren't
meaningful in the first place).

>     (
>         rm -rf /tmp/g-test &&
>         git init /tmp/g-test &&
>         cd /tmp/g-test &&
>         for i in {1..10}; do
>             echo "Always matching thensome 5" >file &&
>             printf "a thensome %d binary \0" $i >>file &&
>             git add file &&
>             git commit -m"Bump $i"
>         done &&
>         git log -Gthensome.*5
>     )
> 
> Right now this will emit 3/10 patches, and the right ones! I.e. "Bump
> [156]". The 1st one because it introduces the "Always matching thensome
> 5". Then 5/6 because the add/remove the string "a thensome 5 binary",
> respectively. Which matches /thensome.*5/.

Right, this will sometimes do the right thing. But it will also often do
the wrong thing. It's also very expensive (we specifically avoid feeding
large binary files to xdiff, but I think "-G" will happily do so -- I
didn't double check, though).

-Peff

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v1 2/2] log -S: Add test which searches in binary files
  2018-11-22  9:14     ` Ævar Arnfjörð Bjarmason
@ 2018-11-24  2:27       ` Junio C Hamano
  2018-11-28 11:31       ` Thomas Braun
  1 sibling, 0 replies; 30+ messages in thread
From: Junio C Hamano @ 2018-11-24  2:27 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason; +Cc: Thomas Braun, git, peff

Ævar Arnfjörð Bjarmason <avarab@gmail.com> writes:

> On Wed, Nov 21 2018, Thomas Braun wrote:
>
>> The -S <regex> option of log looks for differences that changes the
>> number of occurrences of the specified string (i.e. addition/deletion)
>> in a file.
>>
> ...
> This should just be part of 1/2 since the behavior is changed there &
> the commit message should describe both cases.

Sensible suggestion.  Thanks.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v1 1/2] log -G: Ignore binary files
  2018-11-22 16:20   ` Jeff King
@ 2018-11-24  2:32     ` Junio C Hamano
  2018-11-28 11:31     ` Thomas Braun
  1 sibling, 0 replies; 30+ messages in thread
From: Junio C Hamano @ 2018-11-24  2:32 UTC (permalink / raw)
  To: Jeff King; +Cc: Thomas Braun, git

Jeff King <peff@peff.net> writes:

>> +	if ((o->pickaxe_opts & DIFF_PICKAXE_KIND_G) &&
>> +	    ((!textconv_one && diff_filespec_is_binary(o->repo, p->one)) ||
>> +	     (!textconv_two && diff_filespec_is_binary(o->repo, p->two))))
>> +		return 0;
>
> If the user passes "-a" to treat binary files as text, we should
> probably skip the binary check. I think we'd need to check
> "o->flags.text" here.

Yeah, I forgot about that option.  It would give an escape hatch
that has a sane explanation.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v1 1/2] log -G: Ignore binary files
  2018-11-21 20:52 ` [PATCH v1 1/2] log -G: Ignore " Thomas Braun
                     ` (3 preceding siblings ...)
  2018-11-22 16:20   ` Jeff King
@ 2018-11-26 20:19   ` Stefan Beller
  2018-11-27  0:51     ` Junio C Hamano
  4 siblings, 1 reply; 30+ messages in thread
From: Stefan Beller @ 2018-11-26 20:19 UTC (permalink / raw)
  To: Thomas Braun; +Cc: git, Junio C Hamano, Jeff King

On Wed, Nov 21, 2018 at 1:08 PM Thomas Braun
<thomas.braun@virtuell-zuhause.de> wrote:
>
> The -G <regex> option of log looks for the differences whose patch text
> contains added/removed lines that match regex.
>
> The concept of differences only makes sense for text files, therefore
> we need to ignore binary files when searching with -G <regex> as well.

What about partial text/partial binary files?

I recall using text searching tools (not necessarily git machinery,
my memory is fuzzy) to check for strings in pdf files, which are
usually marked binary in context of git, such that we do not
see their diffs in `log -p`.

But I would expect a search with -G or -S to still work...
until I find the exception in the docs, only to wonder if
there is a switch to turn off this optimisation for this
corner case.

Stefan

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v1 1/2] log -G: Ignore binary files
  2018-11-26 20:19   ` Stefan Beller
@ 2018-11-27  0:51     ` Junio C Hamano
  2018-11-28 11:31       ` Thomas Braun
  0 siblings, 1 reply; 30+ messages in thread
From: Junio C Hamano @ 2018-11-27  0:51 UTC (permalink / raw)
  To: Stefan Beller; +Cc: Thomas Braun, git, Jeff King

Stefan Beller <sbeller@google.com> writes:

> On Wed, Nov 21, 2018 at 1:08 PM Thomas Braun
> <thomas.braun@virtuell-zuhause.de> wrote:
>>
>> The -G <regex> option of log looks for the differences whose patch text
>> contains added/removed lines that match regex.
>>
>> The concept of differences only makes sense for text files, therefore
>> we need to ignore binary files when searching with -G <regex> as well.
>
> What about partial text/partial binary files?

Good point. You'd use "-a" (or "--text") to tell the diff machinery
to treat the contents as text, and the new logic must pay attention
to that command line option.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v1 1/2] log -G: Ignore binary files
  2018-11-22  1:29   ` [PATCH v1 1/2] log -G: Ignore " Junio C Hamano
@ 2018-11-28 11:31     ` Thomas Braun
  0 siblings, 0 replies; 30+ messages in thread
From: Thomas Braun @ 2018-11-28 11:31 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git, peff

> Junio C Hamano <gitster@pobox.com> hat am 22. November 2018 um 02:29 geschrieben:
> 
> 
> Thomas Braun <thomas.braun@virtuell-zuhause.de> writes:
> 
> > The -G <regex> option of log looks for the differences whose patch text
> > contains added/removed lines that match regex.
> >
> > The concept of differences only makes sense for text files, therefore
> > we need to ignore binary files when searching with -G <regex> as well.
> >
> > Signed-off-by: Thomas Braun <thomas.braun@virtuell-zuhause.de>
> > ---
> >  Documentation/gitdiffcore.txt |  2 +-
> >  diffcore-pickaxe.c            |  5 +++++
> >  t/t4209-log-pickaxe.sh        | 22 ++++++++++++++++++++++
> >  3 files changed, 28 insertions(+), 1 deletion(-)
> 
> OK.
> 
> > diff --git a/Documentation/gitdiffcore.txt b/Documentation/gitdiffcore.txt
> > index c0a60f3158..059ddd3431 100644
> > --- a/Documentation/gitdiffcore.txt
> > +++ b/Documentation/gitdiffcore.txt
> > @@ -242,7 +242,7 @@ textual diff has an added or a deleted line that matches the given
> >  regular expression.  This means that it will detect in-file (or what
> >  rename-detection considers the same file) moves, which is noise.  The
> >  implementation runs diff twice and greps, and this can be quite
> > -expensive.
> > +expensive.  Binary files without textconv filter are ignored.
> 
> OK.
> 
> > diff --git a/diffcore-pickaxe.c b/diffcore-pickaxe.c
> > index 69fc55ea1e..8c2558b07d 100644
> > --- a/diffcore-pickaxe.c
> > +++ b/diffcore-pickaxe.c
> > @@ -144,6 +144,11 @@ static int pickaxe_match(struct diff_filepair *p, struct diff_options *o,
> >  		textconv_two = get_textconv(o->repo->index, p->two);
> >  	}
> >  
> > +	if ((o->pickaxe_opts & DIFF_PICKAXE_KIND_G) &&
> > +	    ((!textconv_one && diff_filespec_is_binary(o->repo, p->one)) ||
> > +	     (!textconv_two && diff_filespec_is_binary(o->repo, p->two))))
> > +		return 0;
> > +
> >  	/*
> >  	 * If we have an unmodified pair, we know that the count will be the
> >  	 * same and don't even have to load the blobs. Unless textconv is in
> 
> Shouldn't this new test come after the existing optimization, which
> allows us to leave without loading the blob contents (which is
> needed once you call diff_filespec_is_binary())?

Yes, good point.

> > diff --git a/t/t4209-log-pickaxe.sh b/t/t4209-log-pickaxe.sh
> > index 844df760f7..42cc8afd8b 100755
> > --- a/t/t4209-log-pickaxe.sh
> > +++ b/t/t4209-log-pickaxe.sh
> > @@ -106,4 +106,26 @@ test_expect_success 'log -S --no-textconv (missing textconv tool)' '
> >  	rm .gitattributes
> >  '
> >  
> > +test_expect_success 'log -G ignores binary files' '
> > +	rm -rf .git &&
> > +	git init &&
> 
> Please never never ever do the above two unless you are writing a
> test that checks low-level repository details.
> 
> If you want a clean history that has specific lineage of commits
> without getting affected by commits that have been made by the
> previous test pieces, it is OK to "checkout --orphan" to create an
> empty history to work with.

Thanks for the hint. I thought I had seen a less intrusive way for getting an empty history. 
Changed.

> > +	printf "a\0b" >data.bin &&
> > +	git add data.bin &&
> > +	git commit -m "message" &&
> > +	git log -G a >result &&
> > +	test_must_be_empty result
> > +'
> > +
> > +test_expect_success 'log -G looks into binary files with textconv filter' '
> > +	rm -rf .git &&
> > +	git init &&
> > +	echo "* diff=bin" > .gitattributes &&
> > +	printf "a\0b" >data.bin &&
> > +	git add data.bin &&
> > +	git commit -m "message" &&
> > +	git -c diff.bin.textconv=cat log -G a >actual &&
> > +	git log >expected &&
> > +	test_cmp actual expected
> > +'
> > +
> >  test_done
>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v1 2/2] log -S: Add test which searches in binary files
  2018-11-22  1:34     ` [PATCH v1 2/2] log -S: Add test which searches in " Junio C Hamano
@ 2018-11-28 11:31       ` Thomas Braun
  0 siblings, 0 replies; 30+ messages in thread
From: Thomas Braun @ 2018-11-28 11:31 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git, peff

> Junio C Hamano <gitster@pobox.com> hat am 22. November 2018 um 02:34 geschrieben:
> 
> 
> Thomas Braun <thomas.braun@virtuell-zuhause.de> writes:
> 
> > The -S <regex> option of log looks for differences that changes the
> > number of occurrences of the specified string (i.e. addition/deletion)
> > in a file.
> 
> s/-S <regex>/-S<block of text>/ and
> s/the specified string/the specified block of text/ would make it
> more in line with how Documentation/gitdiffcore.txt explains it.
> The original discussion from early 2017 also explains with a pointer
> why the primary mode of -S is not <regex> but is <block of text>.

Thanks for the pointer. I've updated the commit message.
 
> > diff --git a/t/t4209-log-pickaxe.sh b/t/t4209-log-pickaxe.sh
> > index 42cc8afd8b..d430f6f2f9 100755
> > --- a/t/t4209-log-pickaxe.sh
> > +++ b/t/t4209-log-pickaxe.sh
> > @@ -128,4 +128,15 @@ test_expect_success 'log -G looks into binary files with textconv filter' '
> >  	test_cmp actual expected
> >  '
> >  
> > +test_expect_success 'log -S looks into binary files' '
> > +	rm -rf .git &&
> > +	git init &&
> 
> Same comment as the one for 1/2 applies here.

Fixed as well.

> > +	printf "a\0b" >data.bin &&
> > +	git add data.bin &&
> > +	git commit -m "message" &&
> > +	git log -S a >actual &&
> > +	git log >expected &&
> > +	test_cmp actual expected
> > +'
> > +
> >  test_done
> 
> Other than these, I think both patches look sensible.  Thanks for
> resurrecting the old topic and reigniting it.
>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v1 2/2] log -S: Add test which searches in binary files
  2018-11-22  9:14     ` Ævar Arnfjörð Bjarmason
  2018-11-24  2:27       ` Junio C Hamano
@ 2018-11-28 11:31       ` Thomas Braun
  1 sibling, 0 replies; 30+ messages in thread
From: Thomas Braun @ 2018-11-28 11:31 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason; +Cc: git, gitster, peff

> Ævar Arnfjörð Bjarmason <avarab@gmail.com> hat am 22. November 2018 um 10:14 geschrieben:
> 
> 
> 
> On Wed, Nov 21 2018, Thomas Braun wrote:
> 
> > The -S <regex> option of log looks for differences that changes the
> > number of occurrences of the specified string (i.e. addition/deletion)
> > in a file.
> >
> > Add a test to ensure that we keep looking into binary files with -S
> > as changing that would break backwards compatibility in unexpected ways.
> >
> > Signed-off-by: Thomas Braun <thomas.braun@virtuell-zuhause.de>
> > ---
> >  t/t4209-log-pickaxe.sh | 11 +++++++++++
> >  1 file changed, 11 insertions(+)
> >
> > diff --git a/t/t4209-log-pickaxe.sh b/t/t4209-log-pickaxe.sh
> > index 42cc8afd8b..d430f6f2f9 100755
> > --- a/t/t4209-log-pickaxe.sh
> > +++ b/t/t4209-log-pickaxe.sh
> > @@ -128,4 +128,15 @@ test_expect_success 'log -G looks into binary files with textconv filter' '
> >  	test_cmp actual expected
> >  '
> >
> > +test_expect_success 'log -S looks into binary files' '
> > +	rm -rf .git &&
> > +	git init &&
> > +	printf "a\0b" >data.bin &&
> > +	git add data.bin &&
> > +	git commit -m "message" &&
> > +	git log -S a >actual &&
> > +	git log >expected &&
> > +	test_cmp actual expected
> > +'
> > +
> >  test_done
> 
> This should just be part of 1/2 since the behavior is changed there &
> the commit message should describe both cases.

My reasoning was that this is a separate test which does not fit in with the other part.
But I'm happy in folding both into one patch. Done.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v1 1/2] log -G: Ignore binary files
  2018-11-22 10:16   ` Ævar Arnfjörð Bjarmason
  2018-11-22 16:27     ` Jeff King
@ 2018-11-28 11:31     ` Thomas Braun
  2018-11-28 11:31     ` Thomas Braun
  2 siblings, 0 replies; 30+ messages in thread
From: Thomas Braun @ 2018-11-28 11:31 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason; +Cc: git, gitster, peff

> Ævar Arnfjörð Bjarmason <avarab@gmail.com> hat am 22. November 2018 um 11:16 geschrieben:

[...]

> >
> > +test_expect_success 'log -G ignores binary files' '
> > +	rm -rf .git &&
> > +	git init &&
> > +	printf "a\0b" >data.bin &&
> > +	git add data.bin &&
> > +	git commit -m "message" &&
> > +	git log -G a >result &&
> 
> Would be less confusing as "-Ga" since that's the invocation we
> document, even though I see (but wasn't aware that...) "-G a" works too.

Done.

> > +	test_must_be_empty result
> > +'
> > +
> > +test_expect_success 'log -G looks into binary files with textconv filter' '
> > +	rm -rf .git &&
> > +	git init &&
> > +	echo "* diff=bin" > .gitattributes &&
> > +	printf "a\0b" >data.bin &&
> > +	git add data.bin &&
> > +	git commit -m "message" &&
> > +	git -c diff.bin.textconv=cat log -G a >actual &&
> > +	git log >expected &&
> > +	test_cmp actual expected
> > +'
> > +
> >  test_done
> 
> This patch seems like the wrong direction to me. In particular the
> assertion that "the concept of differences only makes sense for text
> files". That's just not true. This patch breaks this:
> 
>     (
>         rm -rf /tmp/g-test &&
>         git init /tmp/g-test &&
>         cd /tmp/g-test &&
>         for i in {1..10}; do
>             echo "Always matching thensome 5" >file &&
>             printf "a thensome %d binary \0" $i >>file &&
>             git add file &&
>             git commit -m"Bump $i"
>         done &&
>         git log -Gthensome.*5
>     )
> 
> Right now this will emit 3/10 patches, and the right ones! I.e. "Bump
> [156]". The 1st one because it introduces the "Always matching thensome
> 5". Then 5/6 because the add/remove the string "a thensome 5 binary",
> respectively. Which matches /thensome.*5/.

log -p does not show you the patch text in your example because it is treated
as binary. And currently "log -G" has a different opinion into what it looks
and what it ignores. My patch tries to bring both more in line.
 
> I.e. in the first one we do a regex match against the content here
> because we don't have both sides:
> https://github.com/git/git/blob/v2.19.2/diffcore-pickaxe.c#L48-L53
> 
> And then for the later ones where we have both sides we end up in
> diffgrep_consume():
> https://github.com/git/git/blob/v2.19.2/diffcore-pickaxe.c#L27-L36
> 
> I think there may be a real issue here to address, which might be some
> combination of:
> 
>  a) Even though the diffcore can do a binary diff internally, this is
>     not what it exposes with "-p", we just say "Binary files differ".
> 
>     I don't know how to emit the raw version we'll end up passing to
>     diffgrep_consume() in this case. Is it just --binary without the
>     encoding? I don't know...
> 
>  b) Your test case shows that you're matching a string at a \0
>     boundary. Is this perhaps something you ran into? I.e. that we don't
>     have some -F version of -G so we can't supply regexes that match
>     past a \0? I had some related work on grep for this that hasn't been
>     carried over to the diffcore:
> 
>         git log --grep='grep:.*\\0' --author=Ævar
> 
>  c) Is this binary diff we end up matching against just bad in some
>     cases? I haven't dug but that wouldn't surprise me, i.e. that it's
>     trying to be line-based so we'll overmatch in many cases.
> 
> So maybe this is something that should be passed down as a flag? See a
> recent discussion at
> https://public-inbox.org/git/87lg77cmr1.fsf@evledraar.gmail.com/ for how
> that could be done.

It is not about the \0 boundary. v2 of the patches will clarify that. My main
motiviation is to speed up "log -G" as that takes a considerable amount of time 
when it wades through MBs of binary files which change often. And in multiple places
I can already treat binary files differently (e.g. turn off delta compression, skip
trying to diff them, no EOL normalization). And for me making log -G ignore what git 
thinks are binary files is making the line clearer between what should be treated as binary
and what as text.

> Also if we don't have some tests already that were failing with this
> patch we really should have those as "let's test the current behavior
> first". Unfortunately tests in this area are really lacking, see
> e.g. my:
> 
>     git log --author=Junio --min-parents=2 --grep=ab/.*grep
> 
> For some series of patches to grep where to get one patch in I needed to
> often lead with 5-10 test patches to convince reviewers that I knew what
> I was changing, and also to be comfortable that I'd covered all the edge
> cases we currently supported, but weren't testing for.

I'm happy to add more test cases to convince everyone involved :)

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v1 1/2] log -G: Ignore binary files
  2018-11-22 16:20   ` Jeff King
  2018-11-24  2:32     ` Junio C Hamano
@ 2018-11-28 11:31     ` Thomas Braun
  1 sibling, 0 replies; 30+ messages in thread
From: Thomas Braun @ 2018-11-28 11:31 UTC (permalink / raw)
  To: Jeff King; +Cc: git, gitster

> Jeff King <peff@peff.net> hat am 22. November 2018 um 17:20 geschrieben:
> 
> 
> On Wed, Nov 21, 2018 at 09:52:27PM +0100, Thomas Braun wrote:
> 
> > diff --git a/diffcore-pickaxe.c b/diffcore-pickaxe.c
> > index 69fc55ea1e..8c2558b07d 100644
> > --- a/diffcore-pickaxe.c
> > +++ b/diffcore-pickaxe.c
> > @@ -144,6 +144,11 @@ static int pickaxe_match(struct diff_filepair *p, struct diff_options *o,
> >  		textconv_two = get_textconv(o->repo->index, p->two);
> >  	}
> >  
> > +	if ((o->pickaxe_opts & DIFF_PICKAXE_KIND_G) &&
> > +	    ((!textconv_one && diff_filespec_is_binary(o->repo, p->one)) ||
> > +	     (!textconv_two && diff_filespec_is_binary(o->repo, p->two))))
> > +		return 0;
> 
> If the user passes "-a" to treat binary files as text, we should
> probably skip the binary check. I think we'd need to check
> "o->flags.text" here.

Good point. I missed that flag. Added.

> > diff --git a/t/t4209-log-pickaxe.sh b/t/t4209-log-pickaxe.sh
> > index 844df760f7..42cc8afd8b 100755
> > --- a/t/t4209-log-pickaxe.sh
> > +++ b/t/t4209-log-pickaxe.sh
> > @@ -106,4 +106,26 @@ test_expect_success 'log -S --no-textconv (missing textconv tool)' '
> > [...]
> > +test_expect_success 'log -G ignores binary files' '
> > [...]
> > +test_expect_success 'log -G looks into binary files with textconv filter' '
> 
> And likewise add a test here similar to the textconv one.

Added as well.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v1 1/2] log -G: Ignore binary files
  2018-11-27  0:51     ` Junio C Hamano
@ 2018-11-28 11:31       ` Thomas Braun
  0 siblings, 0 replies; 30+ messages in thread
From: Thomas Braun @ 2018-11-28 11:31 UTC (permalink / raw)
  To: Junio C Hamano, Stefan Beller; +Cc: git, Jeff King


> Junio C Hamano <gitster@pobox.com> hat am 27. November 2018 um 01:51 geschrieben:
> 
> 
> Stefan Beller <sbeller@google.com> writes:
> 
> > On Wed, Nov 21, 2018 at 1:08 PM Thomas Braun
> > <thomas.braun@virtuell-zuhause.de> wrote:
> >>
> >> The -G <regex> option of log looks for the differences whose patch text
> >> contains added/removed lines that match regex.
> >>
> >> The concept of differences only makes sense for text files, therefore
> >> we need to ignore binary files when searching with -G <regex> as well.
> >
> > What about partial text/partial binary files?
> 
> Good point. You'd use "-a" (or "--text") to tell the diff machinery
> to treat the contents as text, and the new logic must pay attention
> to that command line option.

Yes exactly. Either use -a for the occasional use or a textconv filter
for permanent use.

Coming from the opposite side: I usually mark svg files as binary as the
textual diff is well, let's say uninspiring.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v1 1/2] log -G: Ignore binary files
  2018-11-22 10:16   ` Ævar Arnfjörð Bjarmason
  2018-11-22 16:27     ` Jeff King
  2018-11-28 11:31     ` Thomas Braun
@ 2018-11-28 11:31     ` Thomas Braun
  2 siblings, 0 replies; 30+ messages in thread
From: Thomas Braun @ 2018-11-28 11:31 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason; +Cc: git, gitster, peff

> Ævar Arnfjörð Bjarmason <avarab@gmail.com> hat am 22. November 2018 um 11:16 geschrieben:

[...]

> >
> > +test_expect_success 'log -G ignores binary files' '
> > +	rm -rf .git &&
> > +	git init &&
> > +	printf "a\0b" >data.bin &&
> > +	git add data.bin &&
> > +	git commit -m "message" &&
> > +	git log -G a >result &&
> 
> Would be less confusing as "-Ga" since that's the invocation we
> document, even though I see (but wasn't aware that...) "-G a" works too.

Done.

> > +	test_must_be_empty result
> > +'
> > +
> > +test_expect_success 'log -G looks into binary files with textconv filter' '
> > +	rm -rf .git &&
> > +	git init &&
> > +	echo "* diff=bin" > .gitattributes &&
> > +	printf "a\0b" >data.bin &&
> > +	git add data.bin &&
> > +	git commit -m "message" &&
> > +	git -c diff.bin.textconv=cat log -G a >actual &&
> > +	git log >expected &&
> > +	test_cmp actual expected
> > +'
> > +
> >  test_done
> 
> This patch seems like the wrong direction to me. In particular the
> assertion that "the concept of differences only makes sense for text
> files". That's just not true. This patch breaks this:
> 
>     (
>         rm -rf /tmp/g-test &&
>         git init /tmp/g-test &&
>         cd /tmp/g-test &&
>         for i in {1..10}; do
>             echo "Always matching thensome 5" >file &&
>             printf "a thensome %d binary \0" $i >>file &&
>             git add file &&
>             git commit -m"Bump $i"
>         done &&
>         git log -Gthensome.*5
>     )
> 
> Right now this will emit 3/10 patches, and the right ones! I.e. "Bump
> [156]". The 1st one because it introduces the "Always matching thensome
> 5". Then 5/6 because the add/remove the string "a thensome 5 binary",
> respectively. Which matches /thensome.*5/.

log -p does not show you the patch text in your example because it is treated
as binary. And currently "log -G" has a different opinion into what it looks
and what it ignores. My patch tries to bring both more in line.
 
> I.e. in the first one we do a regex match against the content here
> because we don't have both sides:
> https://github.com/git/git/blob/v2.19.2/diffcore-pickaxe.c#L48-L53
> 
> And then for the later ones where we have both sides we end up in
> diffgrep_consume():
> https://github.com/git/git/blob/v2.19.2/diffcore-pickaxe.c#L27-L36
> 
> I think there may be a real issue here to address, which might be some
> combination of:
> 
>  a) Even though the diffcore can do a binary diff internally, this is
>     not what it exposes with "-p", we just say "Binary files differ".
> 
>     I don't know how to emit the raw version we'll end up passing to
>     diffgrep_consume() in this case. Is it just --binary without the
>     encoding? I don't know...
> 
>  b) Your test case shows that you're matching a string at a \0
>     boundary. Is this perhaps something you ran into? I.e. that we don't
>     have some -F version of -G so we can't supply regexes that match
>     past a \0? I had some related work on grep for this that hasn't been
>     carried over to the diffcore:
> 
>         git log --grep='grep:.*\\0' --author=Ævar
> 
>  c) Is this binary diff we end up matching against just bad in some
>     cases? I haven't dug but that wouldn't surprise me, i.e. that it's
>     trying to be line-based so we'll overmatch in many cases.
> 
> So maybe this is something that should be passed down as a flag? See a
> recent discussion at
> https://public-inbox.org/git/87lg77cmr1.fsf@evledraar.gmail.com/ for how
> that could be done.

It is not about the \0 boundary. v2 of the patches will clarify that. My main
motiviation is to speed up "log -G" as that takes a considerable amount of time 
when it wades through MBs of binary files which change often. And in multiple places
I can already treat binary files differently (e.g. turn off delta compression, skip
trying to diff them, no EOL normalization). And for me making log -G ignore what git 
thinks are binary files is making the line clearer between what should be treated
as binary and what as text.

> Also if we don't have some tests already that were failing with this
> patch we really should have those as "let's test the current behavior
> first". Unfortunately tests in this area are really lacking, see
> e.g. my:
> 
>     git log --author=Junio --min-parents=2 --grep=ab/.*grep
> 
> For some series of patches to grep where to get one patch in I needed to
> often lead with 5-10 test patches to convince reviewers that I knew what
> I was changing, and also to be comfortable that I'd covered all the edge
> cases we currently supported, but weren't testing for.

I'm happy to add more test cases to convince everyone involved :)

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH v2] log -G: Ignore binary files
  2018-11-21 21:00     ` [PATCH 0/2] Teach log -G to ignore " Thomas Braun
@ 2018-11-28 11:32       ` Thomas Braun
  2018-11-28 12:54         ` Ævar Arnfjörð Bjarmason
  2018-11-29  7:10         ` Junio C Hamano
  2018-12-14 18:49       ` [PATCH v3] log -G: ignore " Thomas Braun
  1 sibling, 2 replies; 30+ messages in thread
From: Thomas Braun @ 2018-11-28 11:32 UTC (permalink / raw)
  To: git; +Cc: gitster, peff, sbeller, avarab, Thomas Braun

The -G<regex> option of log looks for the differences whose patch text
contains added/removed lines that match regex.

As the concept of patch text only makes sense for text files, we need to
ignore binary files when searching with -G <regex> as well.

The -S<block of text> option of log looks for differences that changes
the number of occurrences of the specified block of text (i.e.
addition/deletion) in a file. As we want to keep the current behaviour,
add a test to ensure it.

Signed-off-by: Thomas Braun <thomas.braun@virtuell-zuhause.de>
---

Changes since v1:
- Merged both patches into one
- Adapted commit messages
- Added missing support for -a flag with tests
- Placed new code into correct location to be able to reuse an existing
  optimization
- Uses help-suggested -Ga writing without spaces
- Uses orphan branches instead of cannonball cleanup with rm -rf
- Changed search text to make it clear that it is not about the \0 boundary

 Documentation/gitdiffcore.txt |  2 +-
 diffcore-pickaxe.c            |  6 ++++++
 t/t4209-log-pickaxe.sh        | 40 +++++++++++++++++++++++++++++++++++
 3 files changed, 47 insertions(+), 1 deletion(-)

diff --git a/Documentation/gitdiffcore.txt b/Documentation/gitdiffcore.txt
index c0a60f3158..059ddd3431 100644
--- a/Documentation/gitdiffcore.txt
+++ b/Documentation/gitdiffcore.txt
@@ -242,7 +242,7 @@ textual diff has an added or a deleted line that matches the given
 regular expression.  This means that it will detect in-file (or what
 rename-detection considers the same file) moves, which is noise.  The
 implementation runs diff twice and greps, and this can be quite
-expensive.
+expensive.  Binary files without textconv filter are ignored.
 
 When `-S` or `-G` are used without `--pickaxe-all`, only filepairs
 that match their respective criterion are kept in the output.  When
diff --git a/diffcore-pickaxe.c b/diffcore-pickaxe.c
index 69fc55ea1e..4cea086f80 100644
--- a/diffcore-pickaxe.c
+++ b/diffcore-pickaxe.c
@@ -154,6 +154,12 @@ static int pickaxe_match(struct diff_filepair *p, struct diff_options *o,
 	if (textconv_one == textconv_two && diff_unmodified_pair(p))
 		return 0;
 
+	if ((o->pickaxe_opts & DIFF_PICKAXE_KIND_G) &&
+	    !o->flags.text &&
+	    ((!textconv_one && diff_filespec_is_binary(o->repo, p->one)) ||
+	     (!textconv_two && diff_filespec_is_binary(o->repo, p->two))))
+		return 0;
+
 	mf1.size = fill_textconv(o->repo, textconv_one, p->one, &mf1.ptr);
 	mf2.size = fill_textconv(o->repo, textconv_two, p->two, &mf2.ptr);
 
diff --git a/t/t4209-log-pickaxe.sh b/t/t4209-log-pickaxe.sh
index 844df760f7..5c3e2a16b2 100755
--- a/t/t4209-log-pickaxe.sh
+++ b/t/t4209-log-pickaxe.sh
@@ -106,4 +106,44 @@ test_expect_success 'log -S --no-textconv (missing textconv tool)' '
 	rm .gitattributes
 '
 
+test_expect_success 'log -G ignores binary files' '
+	git checkout --orphan orphan1 &&
+	printf "a\0a" >data.bin &&
+	git add data.bin &&
+	git commit -m "message" &&
+	git log -Ga >result &&
+	test_must_be_empty result
+'
+
+test_expect_success 'log -G looks into binary files with -a' '
+	git checkout --orphan orphan2 &&
+	printf "a\0a" >data.bin &&
+	git add data.bin &&
+	git commit -m "message" &&
+	git log -a -Ga >actual &&
+	git log >expected &&
+	test_cmp actual expected
+'
+
+test_expect_success 'log -G looks into binary files with textconv filter' '
+	git checkout --orphan orphan3 &&
+	echo "* diff=bin" > .gitattributes &&
+	printf "a\0a" >data.bin &&
+	git add data.bin &&
+	git commit -m "message" &&
+	git -c diff.bin.textconv=cat log -Ga >actual &&
+	git log >expected &&
+	test_cmp actual expected
+'
+
+test_expect_success 'log -S looks into binary files' '
+	git checkout --orphan orphan4 &&
+	printf "a\0a" >data.bin &&
+	git add data.bin &&
+	git commit -m "message" &&
+	git log -Sa >actual &&
+	git log >expected &&
+	test_cmp actual expected
+'
+
 test_done
-- 
2.19.0.271.gfe8321ec05.dirty


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: [PATCH v2] log -G: Ignore binary files
  2018-11-28 11:32       ` [PATCH v2] log -G: Ignore " Thomas Braun
@ 2018-11-28 12:54         ` Ævar Arnfjörð Bjarmason
  2018-12-14 18:44           ` Thomas Braun
  2018-11-29  7:10         ` Junio C Hamano
  1 sibling, 1 reply; 30+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2018-11-28 12:54 UTC (permalink / raw)
  To: Thomas Braun; +Cc: git, gitster, peff, sbeller


On Wed, Nov 28 2018, Thomas Braun wrote:

Looks much better this time around.

> The -G<regex> option of log looks for the differences whose patch text
> contains added/removed lines that match regex.
>
> As the concept of patch text only makes sense for text files, we need to
> ignore binary files when searching with -G <regex> as well.
>
> The -S<block of text> option of log looks for differences that changes
> the number of occurrences of the specified block of text (i.e.
> addition/deletion) in a file. As we want to keep the current behaviour,
> add a test to ensure it.
> [...]
> diff --git a/Documentation/gitdiffcore.txt b/Documentation/gitdiffcore.txt
> index c0a60f3158..059ddd3431 100644
> --- a/Documentation/gitdiffcore.txt
> +++ b/Documentation/gitdiffcore.txt
> @@ -242,7 +242,7 @@ textual diff has an added or a deleted line that matches the given
>  regular expression.  This means that it will detect in-file (or what
>  rename-detection considers the same file) moves, which is noise.  The
>  implementation runs diff twice and greps, and this can be quite
> -expensive.
> +expensive.  Binary files without textconv filter are ignored.

Now that we support --text that should be documented. I tried to come up
with something on top:

    diff --git a/Documentation/diff-options.txt b/Documentation/diff-options.txt
    index 0378cd574e..42ae65fb57 100644
    --- a/Documentation/diff-options.txt
    +++ b/Documentation/diff-options.txt
    @@ -524,6 +524,10 @@ struct), and want to know the history of that block since it first
     came into being: use the feature iteratively to feed the interesting
     block in the preimage back into `-S`, and keep going until you get the
     very first version of the block.
    ++
    +Unlike `-G` the `-S` option will always search through binary files
    +without a textconv filter. [[TODO: Don't we want to support --no-text
    +then as an optimization?]].

     -G<regex>::
     	Look for differences whose patch text contains added/removed
    @@ -545,6 +549,15 @@ occurrences of that string did not change).
     +
     See the 'pickaxe' entry in linkgit:gitdiffcore[7] for more
     information.
    ++
    +Unless `--text` is supplied binary files without a textconv filter
    +will be ignored.  This was not the case before Git version 2.21..
    ++
    +With `--text`, instead of patch lines we <some example similar to the
    +above diff showing what we actually do for binary files. [[TODO: How
    +does that work?. Could just link to the "diffcore-pickaxe: For
    +Detecting Addition/Deletion of Specified String" section in
    +gitdiffcore(7) which could explain it]]

     --find-object=<object-id>::
     	Look for differences that change the number of occurrences of
    diff --git a/Documentation/gitdiffcore.txt b/Documentation/gitdiffcore.txt
    index c0a60f3158..26880b4149 100644
    --- a/Documentation/gitdiffcore.txt
    +++ b/Documentation/gitdiffcore.txt
    @@ -251,6 +251,10 @@ criterion in a changeset, the entire changeset is kept.  This behavior
     is designed to make reviewing changes in the context of the whole
     changeset easier.

    +Both `-S' and `-G' will ignore binary files without a textconv filter
    +by default, this can be overriden with `--text`. With `--text` the
    +binary patch we look through is generated as [[TODO: ???]].
    +
     diffcore-order: For Sorting the Output Based on Filenames
     ---------------------------------------------------------

But as you can see given the TODO comments I don't know how this works
exactly. I *could* dig, but that's my main outstanding problem with this
patch, the commit message / docs aren't being updated to reflect the new
behavior.

I.e. let's leave the docs in some state where the reader can as
unambiguously know what to expect with -G and these binary diffs we've
been implicitly supporting as with the textual diffs. Ideally with some
examples of how to generate them (re my question about the base85 output
in v1).

Part of that's obviously behavior we've had all along, but it's much
more convincing to say:

    We are changing X which we've done for ages, it works exactly like
    this, and here's a switch to get it back.

Instead of:

    X doesn't make sense, let's turn it off.

Also the diffcore docs already say stuff about how slow/fast things are,
and in a side-thread you said:

    My main motiviation is to speed up "log -G" as that takes a
    considerable amount of time when it wades through MBs of binary
    files which change often.

Makes sense, but then let's say something about that in that section of
the docs.

>  When `-S` or `-G` are used without `--pickaxe-all`, only filepairs
>  that match their respective criterion are kept in the output.  When
> diff --git a/diffcore-pickaxe.c b/diffcore-pickaxe.c
> index 69fc55ea1e..4cea086f80 100644
> --- a/diffcore-pickaxe.c
> +++ b/diffcore-pickaxe.c
> @@ -154,6 +154,12 @@ static int pickaxe_match(struct diff_filepair *p, struct diff_options *o,
>  	if (textconv_one == textconv_two && diff_unmodified_pair(p))
>  		return 0;
>
> +	if ((o->pickaxe_opts & DIFF_PICKAXE_KIND_G) &&
> +	    !o->flags.text &&
> +	    ((!textconv_one && diff_filespec_is_binary(o->repo, p->one)) ||
> +	     (!textconv_two && diff_filespec_is_binary(o->repo, p->two))))
> +		return 0;
> +
>  	mf1.size = fill_textconv(o->repo, textconv_one, p->one, &mf1.ptr);
>  	mf2.size = fill_textconv(o->repo, textconv_two, p->two, &mf2.ptr);
>
> diff --git a/t/t4209-log-pickaxe.sh b/t/t4209-log-pickaxe.sh
> index 844df760f7..5c3e2a16b2 100755
> --- a/t/t4209-log-pickaxe.sh
> +++ b/t/t4209-log-pickaxe.sh
> @@ -106,4 +106,44 @@ test_expect_success 'log -S --no-textconv (missing textconv tool)' '
>  	rm .gitattributes
>  '
>
> +test_expect_success 'log -G ignores binary files' '
> +	git checkout --orphan orphan1 &&
> +	printf "a\0a" >data.bin &&
> +	git add data.bin &&
> +	git commit -m "message" &&
> +	git log -Ga >result &&
> +	test_must_be_empty result
> +'
> +
> +test_expect_success 'log -G looks into binary files with -a' '
> +	git checkout --orphan orphan2 &&
> +	printf "a\0a" >data.bin &&
> +	git add data.bin &&
> +	git commit -m "message" &&
> +	git log -a -Ga >actual &&
> +	git log >expected &&
> +	test_cmp actual expected
> +'

A large part of the question(s) I have above & future readers would
presumably have would be answered by these tests using more realistic
test data. I.e. also with \n in there to see whether -G is also
line-based in this binary case.

> +test_expect_success 'log -G looks into binary files with textconv filter' '
> +	git checkout --orphan orphan3 &&
> +	echo "* diff=bin" > .gitattributes &&
> +	printf "a\0a" >data.bin &&
> +	git add data.bin &&
> +	git commit -m "message" &&
> +	git -c diff.bin.textconv=cat log -Ga >actual &&
> +	git log >expected &&
> +	test_cmp actual expected
> +'
> +
> +test_expect_success 'log -S looks into binary files' '
> +	git checkout --orphan orphan4 &&
> +	printf "a\0a" >data.bin &&
> +	git add data.bin &&
> +	git commit -m "message" &&
> +	git log -Sa >actual &&
> +	git log >expected &&
> +	test_cmp actual expected
> +'
> +
>  test_done

These tests have way to much repeated boilerplate for no reason. This
could just be (as-is, without the better test data suggested above):

diff --git a/t/t4209-log-pickaxe.sh b/t/t4209-log-pickaxe.sh
index 844df760f7..23ed6cc4b1 100755
--- a/t/t4209-log-pickaxe.sh
+++ b/t/t4209-log-pickaxe.sh
@@ -106,4 +106,34 @@ test_expect_success 'log -S --no-textconv (missing textconv tool)' '
 	rm .gitattributes
 '

+test_expect_success 'setup log -[GS] binary & --text' '
+	git checkout --orphan GS-binary-and-text &&
+	printf "a\0a" >data.bin &&
+	git add data.bin &&
+	git commit -m "message" &&
+	git log >full-log
+'
+
+test_expect_success 'log -G ignores binary files' '
+	git log -Ga >result &&
+	test_must_be_empty result
+'
+
+test_expect_success 'log -G looks into binary files with -a' '
+	git log -a -Ga >actual &&
+	test_cmp actual full-log
+'
+
+test_expect_success 'log -G looks into binary files with textconv filter' '
+	echo "* diff=bin" >.gitattributes &&
+	git -c diff.bin.textconv=cat log -Ga >actual &&
+	test_cmp actual full-log
+'
+
+test_expect_success 'log -S looks into binary files' '
+	>.gitattributes &&
+	git log -Sa >actual &&
+	test_cmp actual full-log
+'
+
 test_done

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: [PATCH v2] log -G: Ignore binary files
  2018-11-28 11:32       ` [PATCH v2] log -G: Ignore " Thomas Braun
  2018-11-28 12:54         ` Ævar Arnfjörð Bjarmason
@ 2018-11-29  7:10         ` Junio C Hamano
  2018-11-29  7:22           ` Junio C Hamano
  2018-12-14 18:45           ` Thomas Braun
  1 sibling, 2 replies; 30+ messages in thread
From: Junio C Hamano @ 2018-11-29  7:10 UTC (permalink / raw)
  To: Thomas Braun; +Cc: git, peff, sbeller, avarab

Thomas Braun <thomas.braun@virtuell-zuhause.de> writes:

> Subject: Re: [PATCH v2] log -G: Ignore binary files

s/Ig/ig/; (will locally munge--this alone is no reason to reroll).

The code changes looked sensible.

> diff --git a/t/t4209-log-pickaxe.sh b/t/t4209-log-pickaxe.sh
> index 844df760f7..5c3e2a16b2 100755
> --- a/t/t4209-log-pickaxe.sh
> +++ b/t/t4209-log-pickaxe.sh
> @@ -106,4 +106,44 @@ test_expect_success 'log -S --no-textconv (missing textconv tool)' '
>  	rm .gitattributes
>  '
>  
> +test_expect_success 'log -G ignores binary files' '
> +	git checkout --orphan orphan1 &&
> +	printf "a\0a" >data.bin &&
> +	git add data.bin &&
> +	git commit -m "message" &&
> +	git log -Ga >result &&
> +	test_must_be_empty result
> +'

As this is the first mention of data.bin, this is adding a new file
data.bin that has two 'a' but is a binary file.  And that is the
only commit in the history leading to orphan1.

The fact that "log -Ga" won't find any means it missed the creation
event, because the blob is binary.  Good.

> +test_expect_success 'log -G looks into binary files with -a' '
> +	git checkout --orphan orphan2 &&
> +	printf "a\0a" >data.bin &&
> +	git add data.bin &&
> +	git commit -m "message" &&

This starts from the state left by the previous test piece, i.e. we
have a binary data.bin file with two 'a' in it.  We pretend to
modify and add, but these two steps are no-op if the previous
succeeded, but even if the previous step failed, we get what we want
in the data.bin file.  And then we make an initial commit the same
way.

> +	git log -a -Ga >actual &&
> +	git log >expected &&

And we ran the same test but this time with "-a" to tell Git that
binary-ness should not matter.  It will find the sole commit.  Good.

> +	test_cmp actual expected
> +'
> +
> +test_expect_success 'log -G looks into binary files with textconv filter' '
> +	git checkout --orphan orphan3 &&
> +	echo "* diff=bin" > .gitattributes &&

s/> />/; (will locally munge--this alone is no reason to reroll).

> +	printf "a\0a" >data.bin &&
> +	git add data.bin &&
> +	git commit -m "message" &&
> +	git -c diff.bin.textconv=cat log -Ga >actual &&

This exposes a slight iffy-ness in the design.  The textconv filter
used here does not strip the "binary-ness" from the payload, but it
is enough to tell the machinery that -G should look into the
difference.  Is that really desirable, though?

IOW, if this weren't the initial commit (which is handled by the
codepath to special-case creation and deletion in diff_grep()
function), would "log -Ga" show it without "-a"?  Should it?

I think this test piece (and probably the previous ones for "-a" vs
"no -a" without textconv, as well) should be using a history with
three commits, where

    - the root commit introduces "a\0a" to data.bin (creation event)

    - the second commit adds another instance of "a\0a" to data.bin
      (forces comparison)

    - the third commit removes data.bin (deletion event)

and make sure that the three are treated identically.  If "log -Ga"
finds one (with the combination of other conditions like use of
textconv or -a option), it should find all three, and vice versa.

> +	git log >expected &&
> +	test_cmp actual expected
> +'
> +
> +test_expect_success 'log -S looks into binary files' '
> +	git checkout --orphan orphan4 &&
> +	printf "a\0a" >data.bin &&
> +	git add data.bin &&
> +	git commit -m "message" &&
> +	git log -Sa >actual &&
> +	git log >expected &&
> +	test_cmp actual expected
> +'

Likewise.  This would also benefit from a three-commit history.

Perhaps you can create such a history at the beginning of these
additions as another "setup -G/-S binary test" step and test
different variations in subsequent tests without the setup?

>  test_done

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2] log -G: Ignore binary files
  2018-11-29  7:10         ` Junio C Hamano
@ 2018-11-29  7:22           ` Junio C Hamano
  2018-12-14 18:45             ` Thomas Braun
  2018-12-14 18:45           ` Thomas Braun
  1 sibling, 1 reply; 30+ messages in thread
From: Junio C Hamano @ 2018-11-29  7:22 UTC (permalink / raw)
  To: Thomas Braun; +Cc: git, peff, sbeller, avarab

Junio C Hamano <gitster@pobox.com> writes:

>> +test_expect_success 'log -G ignores binary files' '
>> +	git checkout --orphan orphan1 &&
>> +	printf "a\0a" >data.bin &&
>> +	git add data.bin &&
>> +	git commit -m "message" &&
>> +	git log -Ga >result &&
>> +	test_must_be_empty result
>> +'
>
> As this is the first mention of data.bin, this is adding a new file
> data.bin that has two 'a' but is a binary file.  And that is the
> only commit in the history leading to orphan1.
>
> The fact that "log -Ga" won't find any means it missed the creation
> event, because the blob is binary.  Good.

By the way, this root commit records another file whose path is
"file" and has "Picked<LF>" in it.  If the file had 'a' in it, it
would have been included in "git log" output, but that is too subtle
a point to be noticed by the readers who are only reading this patch
without seeing what has been done to the index before this test
piece.

If you are going to restructure these tests to create a three-commit
history in a single expect_success that is inspected with various
"log -Ga" invocations in subsequent tests, it is worth removing that
other file (or rather, starting with "read-tree --empty" immediately
after checking out the orphan branch, to clarify to the readers that
there is nothing but what you add in the set-up step in the index)
to make the test more robust.


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2] log -G: Ignore binary files
  2018-11-28 12:54         ` Ævar Arnfjörð Bjarmason
@ 2018-12-14 18:44           ` Thomas Braun
  0 siblings, 0 replies; 30+ messages in thread
From: Thomas Braun @ 2018-12-14 18:44 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason; +Cc: git, gitster, peff, sbeller

> Ævar Arnfjörð Bjarmason <avarab@gmail.com> hat am 28. November 2018 um 13:54 geschrieben:
> 
> 
> 
> On Wed, Nov 28 2018, Thomas Braun wrote:
> 
> Looks much better this time around.

Thanks.
 
> > The -G<regex> option of log looks for the differences whose patch text
> > contains added/removed lines that match regex.
> >
> > As the concept of patch text only makes sense for text files, we need to
> > ignore binary files when searching with -G <regex> as well.
> >
> > The -S<block of text> option of log looks for differences that changes
> > the number of occurrences of the specified block of text (i.e.
> > addition/deletion) in a file. As we want to keep the current behaviour,
> > add a test to ensure it.
> > [...]
> > diff --git a/Documentation/gitdiffcore.txt b/Documentation/gitdiffcore.txt
> > index c0a60f3158..059ddd3431 100644
> > --- a/Documentation/gitdiffcore.txt
> > +++ b/Documentation/gitdiffcore.txt
> > @@ -242,7 +242,7 @@ textual diff has an added or a deleted line that matches the given
> >  regular expression.  This means that it will detect in-file (or what
> >  rename-detection considers the same file) moves, which is noise.  The
> >  implementation runs diff twice and greps, and this can be quite
> > -expensive.
> > +expensive.  Binary files without textconv filter are ignored.
> 
> Now that we support --text that should be documented. I tried to come up
> with something on top:
> 
>     diff --git a/Documentation/diff-options.txt b/Documentation/diff-options.txt
>     index 0378cd574e..42ae65fb57 100644
>     --- a/Documentation/diff-options.txt
>     +++ b/Documentation/diff-options.txt
>     @@ -524,6 +524,10 @@ struct), and want to know the history of that block since it first
>      came into being: use the feature iteratively to feed the interesting
>      block in the preimage back into `-S`, and keep going until you get the
>      very first version of the block.
>     ++
>     +Unlike `-G` the `-S` option will always search through binary files
>     +without a textconv filter. [[TODO: Don't we want to support --no-text
>     +then as an optimization?]].
> 
>      -G<regex>::
>      	Look for differences whose patch text contains added/removed
>     @@ -545,6 +549,15 @@ occurrences of that string did not change).
>      +
>      See the 'pickaxe' entry in linkgit:gitdiffcore[7] for more
>      information.
>     ++
>     +Unless `--text` is supplied binary files without a textconv filter
>     +will be ignored.  This was not the case before Git version 2.21..
>     ++
>     +With `--text`, instead of patch lines we <some example similar to the
>     +above diff showing what we actually do for binary files. [[TODO: How
>     +does that work?. Could just link to the "diffcore-pickaxe: For
>     +Detecting Addition/Deletion of Specified String" section in
>     +gitdiffcore(7) which could explain it]]
> 
>      --find-object=<object-id>::
>      	Look for differences that change the number of occurrences of
>     diff --git a/Documentation/gitdiffcore.txt b/Documentation/gitdiffcore.txt
>     index c0a60f3158..26880b4149 100644
>     --- a/Documentation/gitdiffcore.txt
>     +++ b/Documentation/gitdiffcore.txt
>     @@ -251,6 +251,10 @@ criterion in a changeset, the entire changeset is kept.  This behavior
>      is designed to make reviewing changes in the context of the whole
>      changeset easier.
> 
>     +Both `-S' and `-G' will ignore binary files without a textconv filter
>     +by default, this can be overriden with `--text`. With `--text` the
>     +binary patch we look through is generated as [[TODO: ???]].
>     +
>      diffcore-order: For Sorting the Output Based on Filenames
>      ---------------------------------------------------------
> 
> But as you can see given the TODO comments I don't know how this works
> exactly. I *could* dig, but that's my main outstanding problem with this
> patch, the commit message / docs aren't being updated to reflect the new
> behavior.

v3 will have some more documentation which took inspiration by your sketches here.
I've not included a reference to the git version 2.21 in which that patch will hopefully
land as that seems to be not common in the documentation.

I see tweaking the behaviour of -S outside of this patch series.
 
> I.e. let's leave the docs in some state where the reader can as
> unambiguously know what to expect with -G and these binary diffs we've
> been implicitly supporting as with the textual diffs. Ideally with some
> examples of how to generate them (re my question about the base85 output
> in v1).
> 
> Part of that's obviously behavior we've had all along, but it's much
> more convincing to say:
> 
>     We are changing X which we've done for ages, it works exactly like
>     this, and here's a switch to get it back.
> 
> Instead of:
> 
>     X doesn't make sense, let's turn it off.
> 
> Also the diffcore docs already say stuff about how slow/fast things are,
> and in a side-thread you said:
> 
>     My main motiviation is to speed up "log -G" as that takes a
>     considerable amount of time when it wades through MBs of binary
>     files which change often.
> 
> Makes sense, but then let's say something about that in that section of
> the docs.

Done.

> >  When `-S` or `-G` are used without `--pickaxe-all`, only filepairs
> >  that match their respective criterion are kept in the output.  When
> > diff --git a/diffcore-pickaxe.c b/diffcore-pickaxe.c
> > index 69fc55ea1e..4cea086f80 100644
> > --- a/diffcore-pickaxe.c
> > +++ b/diffcore-pickaxe.c
> > @@ -154,6 +154,12 @@ static int pickaxe_match(struct diff_filepair *p, struct diff_options *o,
> >  	if (textconv_one == textconv_two && diff_unmodified_pair(p))
> >  		return 0;
> >
> > +	if ((o->pickaxe_opts & DIFF_PICKAXE_KIND_G) &&
> > +	    !o->flags.text &&
> > +	    ((!textconv_one && diff_filespec_is_binary(o->repo, p->one)) ||
> > +	     (!textconv_two && diff_filespec_is_binary(o->repo, p->two))))
> > +		return 0;
> > +
> >  	mf1.size = fill_textconv(o->repo, textconv_one, p->one, &mf1.ptr);
> >  	mf2.size = fill_textconv(o->repo, textconv_two, p->two, &mf2.ptr);
> >
> > diff --git a/t/t4209-log-pickaxe.sh b/t/t4209-log-pickaxe.sh
> > index 844df760f7..5c3e2a16b2 100755
> > --- a/t/t4209-log-pickaxe.sh
> > +++ b/t/t4209-log-pickaxe.sh
> > @@ -106,4 +106,44 @@ test_expect_success 'log -S --no-textconv (missing textconv tool)' '
> >  	rm .gitattributes
> >  '
> >
> > +test_expect_success 'log -G ignores binary files' '
> > +	git checkout --orphan orphan1 &&
> > +	printf "a\0a" >data.bin &&
> > +	git add data.bin &&
> > +	git commit -m "message" &&
> > +	git log -Ga >result &&
> > +	test_must_be_empty result
> > +'
> > +
> > +test_expect_success 'log -G looks into binary files with -a' '
> > +	git checkout --orphan orphan2 &&
> > +	printf "a\0a" >data.bin &&
> > +	git add data.bin &&
> > +	git commit -m "message" &&
> > +	git log -a -Ga >actual &&
> > +	git log >expected &&
> > +	test_cmp actual expected
> > +'
> 
> A large part of the question(s) I have above & future readers would
> presumably have would be answered by these tests using more realistic
> test data. I.e. also with \n in there to see whether -G is also
> line-based in this binary case.
> 
> > +test_expect_success 'log -G looks into binary files with textconv filter' '
> > +	git checkout --orphan orphan3 &&
> > +	echo "* diff=bin" > .gitattributes &&
> > +	printf "a\0a" >data.bin &&
> > +	git add data.bin &&
> > +	git commit -m "message" &&
> > +	git -c diff.bin.textconv=cat log -Ga >actual &&
> > +	git log >expected &&
> > +	test_cmp actual expected
> > +'
> > +
> > +test_expect_success 'log -S looks into binary files' '
> > +	git checkout --orphan orphan4 &&
> > +	printf "a\0a" >data.bin &&
> > +	git add data.bin &&
> > +	git commit -m "message" &&
> > +	git log -Sa >actual &&
> > +	git log >expected &&
> > +	test_cmp actual expected
> > +'
> > +
> >  test_done

Done.

> These tests have way to much repeated boilerplate for no reason. This
> could just be (as-is, without the better test data suggested above):
> 
> diff --git a/t/t4209-log-pickaxe.sh b/t/t4209-log-pickaxe.sh
> index 844df760f7..23ed6cc4b1 100755
> --- a/t/t4209-log-pickaxe.sh
> +++ b/t/t4209-log-pickaxe.sh
> @@ -106,4 +106,34 @@ test_expect_success 'log -S --no-textconv (missing textconv tool)' '
>  	rm .gitattributes
>  '
> 
> +test_expect_success 'setup log -[GS] binary & --text' '
> +	git checkout --orphan GS-binary-and-text &&
> +	printf "a\0a" >data.bin &&
> +	git add data.bin &&
> +	git commit -m "message" &&
> +	git log >full-log
> +'
> +
> +test_expect_success 'log -G ignores binary files' '
> +	git log -Ga >result &&
> +	test_must_be_empty result
> +'
> +
> +test_expect_success 'log -G looks into binary files with -a' '
> +	git log -a -Ga >actual &&
> +	test_cmp actual full-log
> +'
> +
> +test_expect_success 'log -G looks into binary files with textconv filter' '
> +	echo "* diff=bin" >.gitattributes &&
> +	git -c diff.bin.textconv=cat log -Ga >actual &&
> +	test_cmp actual full-log
> +'
> +
> +test_expect_success 'log -S looks into binary files' '
> +	>.gitattributes &&
> +	git log -Sa >actual &&
> +	test_cmp actual full-log
> +'
> +
>  test_done

Thanks for pointer. This is resolved in v3 as well. I'm not used to test cases which
depend on each other but your are totally right.

Thanks for the review.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2] log -G: Ignore binary files
  2018-11-29  7:10         ` Junio C Hamano
  2018-11-29  7:22           ` Junio C Hamano
@ 2018-12-14 18:45           ` Thomas Braun
  1 sibling, 0 replies; 30+ messages in thread
From: Thomas Braun @ 2018-12-14 18:45 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git, peff, sbeller, avarab

> Junio C Hamano <gitster@pobox.com> hat am 29. November 2018 um 08:10 geschrieben:
> 
> 
> Thomas Braun <thomas.braun@virtuell-zuhause.de> writes:
> 
> > Subject: Re: [PATCH v2] log -G: Ignore binary files
> 
> s/Ig/ig/; (will locally munge--this alone is no reason to reroll).

Done.
 
> The code changes looked sensible.

Thanks.

> > diff --git a/t/t4209-log-pickaxe.sh b/t/t4209-log-pickaxe.sh
> > index 844df760f7..5c3e2a16b2 100755
> > --- a/t/t4209-log-pickaxe.sh
> > +++ b/t/t4209-log-pickaxe.sh
> > @@ -106,4 +106,44 @@ test_expect_success 'log -S --no-textconv (missing textconv tool)' '
> >  	rm .gitattributes
> >  '
> >  
> > +test_expect_success 'log -G ignores binary files' '
> > +	git checkout --orphan orphan1 &&
> > +	printf "a\0a" >data.bin &&
> > +	git add data.bin &&
> > +	git commit -m "message" &&
> > +	git log -Ga >result &&
> > +	test_must_be_empty result
> > +'
> 
> As this is the first mention of data.bin, this is adding a new file
> data.bin that has two 'a' but is a binary file.  And that is the
> only commit in the history leading to orphan1.
> 
> The fact that "log -Ga" won't find any means it missed the creation
> event, because the blob is binary.  Good.
> 
> > +test_expect_success 'log -G looks into binary files with -a' '
> > +	git checkout --orphan orphan2 &&
> > +	printf "a\0a" >data.bin &&
> > +	git add data.bin &&
> > +	git commit -m "message" &&
> 
> This starts from the state left by the previous test piece, i.e. we
> have a binary data.bin file with two 'a' in it.  We pretend to
> modify and add, but these two steps are no-op if the previous
> succeeded, but even if the previous step failed, we get what we want
> in the data.bin file.  And then we make an initial commit the same
> way.
> 
> > +	git log -a -Ga >actual &&
> > +	git log >expected &&
> 
> And we ran the same test but this time with "-a" to tell Git that
> binary-ness should not matter.  It will find the sole commit.  Good.
> 
> > +	test_cmp actual expected
> > +'
> > +
> > +test_expect_success 'log -G looks into binary files with textconv filter' '
> > +	git checkout --orphan orphan3 &&
> > +	echo "* diff=bin" > .gitattributes &&
> 
> s/> />/; (will locally munge--this alone is no reason to reroll).

Done.

> > +	printf "a\0a" >data.bin &&
> > +	git add data.bin &&
> > +	git commit -m "message" &&
> > +	git -c diff.bin.textconv=cat log -Ga >actual &&
> 
> This exposes a slight iffy-ness in the design.  The textconv filter
> used here does not strip the "binary-ness" from the payload, but it
> is enough to tell the machinery that -G should look into the
> difference.  Is that really desirable, though?
> 
> IOW, if this weren't the initial commit (which is handled by the
> codepath to special-case creation and deletion in diff_grep()
> function), would "log -Ga" show it without "-a"?  Should it?

Yes "log -Ga" will find all three commits (creation, modification, deletion)
which are present in v3 without "-a" and cat as textconv filter.

I can make that more explicit with a textconv filter which removes the binary-ness

git -c diff.bin.textconv="sed -e \"s/\x00//g\"" log -Ga >log &&

(diff.bin.textconv="cat -v" works here as well but seems non-portable)

Now we could also search for "aa" as the NUL separating them is gone but that could
be getting too clever or?

> I think this test piece (and probably the previous ones for "-a" vs
> "no -a" without textconv, as well) should be using a history with
> three commits, where
> 
>     - the root commit introduces "a\0a" to data.bin (creation event)
> 
>     - the second commit adds another instance of "a\0a" to data.bin
>       (forces comparison)
> 
>     - the third commit removes data.bin (deletion event)
> 
> and make sure that the three are treated identically.  If "log -Ga"
> finds one (with the combination of other conditions like use of
> textconv or -a option), it should find all three, and vice versa.

Good point. I've added that.

> > +	git log >expected &&
> > +	test_cmp actual expected
> > +'
> > +
> > +test_expect_success 'log -S looks into binary files' '
> > +	git checkout --orphan orphan4 &&
> > +	printf "a\0a" >data.bin &&
> > +	git add data.bin &&
> > +	git commit -m "message" &&
> > +	git log -Sa >actual &&
> > +	git log >expected &&
> > +	test_cmp actual expected
> > +'
> 
> Likewise.  This would also benefit from a three-commit history.
> 
> Perhaps you can create such a history at the beginning of these
> additions as another "setup -G/-S binary test" step and test
> different variations in subsequent tests without the setup?

Done.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2] log -G: Ignore binary files
  2018-11-29  7:22           ` Junio C Hamano
@ 2018-12-14 18:45             ` Thomas Braun
  0 siblings, 0 replies; 30+ messages in thread
From: Thomas Braun @ 2018-12-14 18:45 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git, peff, sbeller, avarab

> Junio C Hamano <gitster@pobox.com> hat am 29. November 2018 um 08:22 geschrieben:
> 
> 
> Junio C Hamano <gitster@pobox.com> writes:
> 
> >> +test_expect_success 'log -G ignores binary files' '
> >> +	git checkout --orphan orphan1 &&
> >> +	printf "a\0a" >data.bin &&
> >> +	git add data.bin &&
> >> +	git commit -m "message" &&
> >> +	git log -Ga >result &&
> >> +	test_must_be_empty result
> >> +'
> >
> > As this is the first mention of data.bin, this is adding a new file
> > data.bin that has two 'a' but is a binary file.  And that is the
> > only commit in the history leading to orphan1.
> >
> > The fact that "log -Ga" won't find any means it missed the creation
> > event, because the blob is binary.  Good.
> 
> By the way, this root commit records another file whose path is
> "file" and has "Picked<LF>" in it.  If the file had 'a' in it, it
> would have been included in "git log" output, but that is too subtle
> a point to be noticed by the readers who are only reading this patch
> without seeing what has been done to the index before this test
> piece.
> 
> If you are going to restructure these tests to create a three-commit
> history in a single expect_success that is inspected with various
> "log -Ga" invocations in subsequent tests, it is worth removing that
> other file (or rather, starting with "read-tree --empty" immediately
> after checking out the orphan branch, to clarify to the readers that
> there is nothing but what you add in the set-up step in the index)
> to make the test more robust.

Thanks for the explanation. First I though that "checkout --orphan"
already takes care of everything but "read-tree --empty" is the way to go.

Done.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH v3] log -G: ignore binary files
  2018-11-21 21:00     ` [PATCH 0/2] Teach log -G to ignore " Thomas Braun
  2018-11-28 11:32       ` [PATCH v2] log -G: Ignore " Thomas Braun
@ 2018-12-14 18:49       ` Thomas Braun
  2018-12-26 23:24         ` Junio C Hamano
  1 sibling, 1 reply; 30+ messages in thread
From: Thomas Braun @ 2018-12-14 18:49 UTC (permalink / raw)
  To: git; +Cc: gitster, peff, sbeller, avarab

The -G<regex> option of log looks for the differences whose patch text
contains added/removed lines that match regex.

Currently -G looks also into patches of binary files (which
according to [1]) is binary as well.

This has a couple of issues:

- It makes the pickaxe search slow. In a proprietary repository of the
  author with only ~5500 commits and a total .git size of ~300MB
  searching takes ~13 seconds

    $time git log -Gwave > /dev/null

    real    0m13,241s
    user    0m12,596s
    sys     0m0,644s

  whereas when we ignore binary files with this patch it takes ~4s

    $time ~/devel/git/git log -Gwave > /dev/null

    real    0m3,713s
    user    0m3,608s
    sys     0m0,105s

  which is a speedup of more than fourfold.

- The internally used algorithm for generating patch text is based on
  xdiff and its states in [1]

  > The output format of the binary patch file is proprietary
  > (and binary) and it is basically a collection of copy and insert
  > commands [..]

  which means that the current format could change once the internal
  algorithm is changed as the format is not standardized. In addition
  the git binary patch format used for preparing patches for git apply
  is *different* from the xdiff format as can be seen by comparing

  git log -p -a

    commit 6e95bf4bafccf14650d02ab57f3affe669be10cf
    Author: A U Thor <author@example.com>
    Date:   Thu Apr 7 15:14:13 2005 -0700

        modify binary file

    diff --git a/data.bin b/data.bin
    index f414c84..edfeb6f 100644
    --- a/data.bin
    +++ b/data.bin
    @@ -1,2 +1,4 @@
     a
     a^@a
    +a
    +a^@a

  with git log --binary

    commit 6e95bf4bafccf14650d02ab57f3affe669be10cf
    Author: A U Thor <author@example.com>
    Date:   Thu Apr 7 15:14:13 2005 -0700

        modify binary file

    diff --git a/data.bin b/data.bin
    index f414c84bd3aa25fa07836bb1fb73db784635e24b..edfeb6f501[..]
    GIT binary patch
    literal 12
    QcmYe~N@Pgn0zx1O01)N^ZvX%Q

    literal 6
    NcmYe~N@Pgn0ssWg0XP5v

  which seems unexpected.

To resolve these issues this patch makes -G<regex> ignore binary files
by default. Textconv filters are supported and also -a/--text for
getting the old and broken behaviour back.

The -S<block of text> option of log looks for differences that changes
the number of occurrences of the specified block of text (i.e.
addition/deletion) in a file. As we want to keep the current behaviour,
add a test to ensure it stays that way.

[1]: http://www.xmailserver.org/xdiff.html

Signed-off-by: Thomas Braun <thomas.braun@virtuell-zuhause.de>
---

Changes since v2:
 - Introduce a setup step for the new tests 
 - Really start with a clean history in the tests
 - Added more complex commit history for the tests
 - Use test_when_finished for cleanup instead of doing nothing
 - Enhanced commit message to motivate the change better
 - Added some more documentation

 Documentation/diff-options.txt |  5 +++++
 Documentation/gitdiffcore.txt  |  3 ++-
 diffcore-pickaxe.c             |  6 ++++++
 t/t4209-log-pickaxe.sh         | 35 ++++++++++++++++++++++++++++++++++
 4 files changed, 48 insertions(+), 1 deletion(-)

diff --git a/Documentation/diff-options.txt b/Documentation/diff-options.txt
index 0378cd574e..b94d332f71 100644
--- a/Documentation/diff-options.txt
+++ b/Documentation/diff-options.txt
@@ -524,6 +524,8 @@ struct), and want to know the history of that block since it first
 came into being: use the feature iteratively to feed the interesting
 block in the preimage back into `-S`, and keep going until you get the
 very first version of the block.
++
+Binary files are searched as well.
 
 -G<regex>::
 	Look for differences whose patch text contains added/removed
@@ -543,6 +545,9 @@ While `git log -G"regexec\(regexp"` will show this commit, `git log
 -S"regexec\(regexp" --pickaxe-regex` will not (because the number of
 occurrences of that string did not change).
 +
+Unless `--text` is supplied patches of binary files without a textconv
+filter will be ignored.
++
 See the 'pickaxe' entry in linkgit:gitdiffcore[7] for more
 information.
 
diff --git a/Documentation/gitdiffcore.txt b/Documentation/gitdiffcore.txt
index c0a60f3158..c970d9fe43 100644
--- a/Documentation/gitdiffcore.txt
+++ b/Documentation/gitdiffcore.txt
@@ -242,7 +242,8 @@ textual diff has an added or a deleted line that matches the given
 regular expression.  This means that it will detect in-file (or what
 rename-detection considers the same file) moves, which is noise.  The
 implementation runs diff twice and greps, and this can be quite
-expensive.
+expensive.  To speed things up binary files without textconv filters
+will be ignored.
 
 When `-S` or `-G` are used without `--pickaxe-all`, only filepairs
 that match their respective criterion are kept in the output.  When
diff --git a/diffcore-pickaxe.c b/diffcore-pickaxe.c
index 69fc55ea1e..4cea086f80 100644
--- a/diffcore-pickaxe.c
+++ b/diffcore-pickaxe.c
@@ -154,6 +154,12 @@ static int pickaxe_match(struct diff_filepair *p, struct diff_options *o,
 	if (textconv_one == textconv_two && diff_unmodified_pair(p))
 		return 0;
 
+	if ((o->pickaxe_opts & DIFF_PICKAXE_KIND_G) &&
+	    !o->flags.text &&
+	    ((!textconv_one && diff_filespec_is_binary(o->repo, p->one)) ||
+	     (!textconv_two && diff_filespec_is_binary(o->repo, p->two))))
+		return 0;
+
 	mf1.size = fill_textconv(o->repo, textconv_one, p->one, &mf1.ptr);
 	mf2.size = fill_textconv(o->repo, textconv_two, p->two, &mf2.ptr);
 
diff --git a/t/t4209-log-pickaxe.sh b/t/t4209-log-pickaxe.sh
index 844df760f7..5d06f5f45e 100755
--- a/t/t4209-log-pickaxe.sh
+++ b/t/t4209-log-pickaxe.sh
@@ -106,4 +106,39 @@ test_expect_success 'log -S --no-textconv (missing textconv tool)' '
 	rm .gitattributes
 '
 
+test_expect_success 'setup log -[GS] binary & --text' '
+	git checkout --orphan GS-binary-and-text &&
+	git read-tree --empty &&
+	printf "a\na\0a\n" >data.bin &&
+	git add data.bin &&
+	git commit -m "create binary file" data.bin &&
+	printf "a\na\0a\n" >>data.bin &&
+	git commit -m "modify binary file" data.bin &&
+	git rm data.bin &&
+	git commit -m "delete binary file" data.bin &&
+	git log >full-log
+'
+
+test_expect_success 'log -G ignores binary files' '
+	git log -Ga >log &&
+	test_must_be_empty log
+'
+
+test_expect_success 'log -G looks into binary files with -a' '
+	git log -a -Ga >log &&
+	test_cmp log full-log
+'
+
+test_expect_success 'log -G looks into binary files with textconv filter' '
+	test_when_finished "rm .gitattributes" &&
+	echo "* diff=bin" >.gitattributes &&
+	git -c diff.bin.textconv=cat log -Ga >log &&
+	test_cmp log full-log
+'
+
+test_expect_success 'log -S looks into binary files' '
+	git log -Sa >log &&
+	test_cmp log full-log
+'
+
 test_done
-- 
2.19.0.271.gfe8321ec05.dirty


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: [PATCH v3] log -G: ignore binary files
  2018-12-14 18:49       ` [PATCH v3] log -G: ignore " Thomas Braun
@ 2018-12-26 23:24         ` Junio C Hamano
  0 siblings, 0 replies; 30+ messages in thread
From: Junio C Hamano @ 2018-12-26 23:24 UTC (permalink / raw)
  To: Thomas Braun; +Cc: git, peff, sbeller, avarab

Thomas Braun <thomas.braun@virtuell-zuhause.de> writes:

> - The internally used algorithm for generating patch text is based on
>   xdiff and its states in [1]
>
>   > The output format of the binary patch file is proprietary
>   > (and binary) and it is basically a collection of copy and insert
>   > commands [..]
>
>   which means that the current format could change once the internal
>   algorithm is changed as the format is not standardized. In addition
>   the git binary patch format used for preparing patches for git apply
>   is *different* from the xdiff format as can be seen by comparing

This particular argument sounds like a red herring.  After all, when
the --text option is given

>
>   git log -p -a
>
>     commit 6e95bf4bafccf14650d02ab57f3affe669be10cf
>     Author: A U Thor <author@example.com>
>     Date:   Thu Apr 7 15:14:13 2005 -0700
>
>         modify binary file
>
>     diff --git a/data.bin b/data.bin
>     index f414c84..edfeb6f 100644
>     --- a/data.bin
>     +++ b/data.bin
>     @@ -1,2 +1,4 @@
>      a
>      a^@a
>     +a
>     +a^@a

we will see 'a' in the output no matter how xdiff internally works;
there is no way to express the above change textually without
showing "+a" somewhere in the patch output.

The rest of the log message looks good, and ...

> Changes since v2:
>  - Introduce a setup step for the new tests 
>  - Really start with a clean history in the tests
>  - Added more complex commit history for the tests
>  - Use test_when_finished for cleanup instead of doing nothing
>  - Enhanced commit message to motivate the change better
>  - Added some more documentation

... the tests are certainly a lot easier to follow.

Thanks.

^ permalink raw reply	[flat|nested] 30+ messages in thread

end of thread, other threads:[~2018-12-28 20:13 UTC | newest]

Thread overview: 30+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-11-21 20:52 [PATCH 0/2] Teach log -G to ignore binary files Thomas Braun
2018-11-21 20:52 ` [PATCH v1 1/2] log -G: Ignore " Thomas Braun
2018-11-21 20:52   ` [PATCH v1 2/2] log -S: Add test which searches in " Thomas Braun
2018-11-21 21:00     ` [PATCH 0/2] Teach log -G to ignore " Thomas Braun
2018-11-28 11:32       ` [PATCH v2] log -G: Ignore " Thomas Braun
2018-11-28 12:54         ` Ævar Arnfjörð Bjarmason
2018-12-14 18:44           ` Thomas Braun
2018-11-29  7:10         ` Junio C Hamano
2018-11-29  7:22           ` Junio C Hamano
2018-12-14 18:45             ` Thomas Braun
2018-12-14 18:45           ` Thomas Braun
2018-12-14 18:49       ` [PATCH v3] log -G: ignore " Thomas Braun
2018-12-26 23:24         ` Junio C Hamano
2018-11-22  1:34     ` [PATCH v1 2/2] log -S: Add test which searches in " Junio C Hamano
2018-11-28 11:31       ` Thomas Braun
2018-11-22  9:14     ` Ævar Arnfjörð Bjarmason
2018-11-24  2:27       ` Junio C Hamano
2018-11-28 11:31       ` Thomas Braun
2018-11-22  1:29   ` [PATCH v1 1/2] log -G: Ignore " Junio C Hamano
2018-11-28 11:31     ` Thomas Braun
2018-11-22 10:16   ` Ævar Arnfjörð Bjarmason
2018-11-22 16:27     ` Jeff King
2018-11-28 11:31     ` Thomas Braun
2018-11-28 11:31     ` Thomas Braun
2018-11-22 16:20   ` Jeff King
2018-11-24  2:32     ` Junio C Hamano
2018-11-28 11:31     ` Thomas Braun
2018-11-26 20:19   ` Stefan Beller
2018-11-27  0:51     ` Junio C Hamano
2018-11-28 11:31       ` Thomas Braun

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).