git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: "Ævar Arnfjörð Bjarmason" <avarab@gmail.com>
To: Thomas Braun <thomas.braun@virtuell-zuhause.de>
Cc: git@vger.kernel.org, gitster@pobox.com, peff@peff.net,
	sbeller@google.com
Subject: Re: [PATCH v2] log -G: Ignore binary files
Date: Wed, 28 Nov 2018 13:54:42 +0100	[thread overview]
Message-ID: <87a7ltz7jh.fsf@evledraar.gmail.com> (raw)
In-Reply-To: <c4eac0b0ff0812e5aa8b081e603fc8bdd042ddeb.1543403143.git.thomas.braun@virtuell-zuhause.de>


On Wed, Nov 28 2018, Thomas Braun wrote:

Looks much better this time around.

> The -G<regex> option of log looks for the differences whose patch text
> contains added/removed lines that match regex.
>
> As the concept of patch text only makes sense for text files, we need to
> ignore binary files when searching with -G <regex> as well.
>
> The -S<block of text> option of log looks for differences that changes
> the number of occurrences of the specified block of text (i.e.
> addition/deletion) in a file. As we want to keep the current behaviour,
> add a test to ensure it.
> [...]
> diff --git a/Documentation/gitdiffcore.txt b/Documentation/gitdiffcore.txt
> index c0a60f3158..059ddd3431 100644
> --- a/Documentation/gitdiffcore.txt
> +++ b/Documentation/gitdiffcore.txt
> @@ -242,7 +242,7 @@ textual diff has an added or a deleted line that matches the given
>  regular expression.  This means that it will detect in-file (or what
>  rename-detection considers the same file) moves, which is noise.  The
>  implementation runs diff twice and greps, and this can be quite
> -expensive.
> +expensive.  Binary files without textconv filter are ignored.

Now that we support --text that should be documented. I tried to come up
with something on top:

    diff --git a/Documentation/diff-options.txt b/Documentation/diff-options.txt
    index 0378cd574e..42ae65fb57 100644
    --- a/Documentation/diff-options.txt
    +++ b/Documentation/diff-options.txt
    @@ -524,6 +524,10 @@ struct), and want to know the history of that block since it first
     came into being: use the feature iteratively to feed the interesting
     block in the preimage back into `-S`, and keep going until you get the
     very first version of the block.
    ++
    +Unlike `-G` the `-S` option will always search through binary files
    +without a textconv filter. [[TODO: Don't we want to support --no-text
    +then as an optimization?]].

     -G<regex>::
     	Look for differences whose patch text contains added/removed
    @@ -545,6 +549,15 @@ occurrences of that string did not change).
     +
     See the 'pickaxe' entry in linkgit:gitdiffcore[7] for more
     information.
    ++
    +Unless `--text` is supplied binary files without a textconv filter
    +will be ignored.  This was not the case before Git version 2.21..
    ++
    +With `--text`, instead of patch lines we <some example similar to the
    +above diff showing what we actually do for binary files. [[TODO: How
    +does that work?. Could just link to the "diffcore-pickaxe: For
    +Detecting Addition/Deletion of Specified String" section in
    +gitdiffcore(7) which could explain it]]

     --find-object=<object-id>::
     	Look for differences that change the number of occurrences of
    diff --git a/Documentation/gitdiffcore.txt b/Documentation/gitdiffcore.txt
    index c0a60f3158..26880b4149 100644
    --- a/Documentation/gitdiffcore.txt
    +++ b/Documentation/gitdiffcore.txt
    @@ -251,6 +251,10 @@ criterion in a changeset, the entire changeset is kept.  This behavior
     is designed to make reviewing changes in the context of the whole
     changeset easier.

    +Both `-S' and `-G' will ignore binary files without a textconv filter
    +by default, this can be overriden with `--text`. With `--text` the
    +binary patch we look through is generated as [[TODO: ???]].
    +
     diffcore-order: For Sorting the Output Based on Filenames
     ---------------------------------------------------------

But as you can see given the TODO comments I don't know how this works
exactly. I *could* dig, but that's my main outstanding problem with this
patch, the commit message / docs aren't being updated to reflect the new
behavior.

I.e. let's leave the docs in some state where the reader can as
unambiguously know what to expect with -G and these binary diffs we've
been implicitly supporting as with the textual diffs. Ideally with some
examples of how to generate them (re my question about the base85 output
in v1).

Part of that's obviously behavior we've had all along, but it's much
more convincing to say:

    We are changing X which we've done for ages, it works exactly like
    this, and here's a switch to get it back.

Instead of:

    X doesn't make sense, let's turn it off.

Also the diffcore docs already say stuff about how slow/fast things are,
and in a side-thread you said:

    My main motiviation is to speed up "log -G" as that takes a
    considerable amount of time when it wades through MBs of binary
    files which change often.

Makes sense, but then let's say something about that in that section of
the docs.

>  When `-S` or `-G` are used without `--pickaxe-all`, only filepairs
>  that match their respective criterion are kept in the output.  When
> diff --git a/diffcore-pickaxe.c b/diffcore-pickaxe.c
> index 69fc55ea1e..4cea086f80 100644
> --- a/diffcore-pickaxe.c
> +++ b/diffcore-pickaxe.c
> @@ -154,6 +154,12 @@ static int pickaxe_match(struct diff_filepair *p, struct diff_options *o,
>  	if (textconv_one == textconv_two && diff_unmodified_pair(p))
>  		return 0;
>
> +	if ((o->pickaxe_opts & DIFF_PICKAXE_KIND_G) &&
> +	    !o->flags.text &&
> +	    ((!textconv_one && diff_filespec_is_binary(o->repo, p->one)) ||
> +	     (!textconv_two && diff_filespec_is_binary(o->repo, p->two))))
> +		return 0;
> +
>  	mf1.size = fill_textconv(o->repo, textconv_one, p->one, &mf1.ptr);
>  	mf2.size = fill_textconv(o->repo, textconv_two, p->two, &mf2.ptr);
>
> diff --git a/t/t4209-log-pickaxe.sh b/t/t4209-log-pickaxe.sh
> index 844df760f7..5c3e2a16b2 100755
> --- a/t/t4209-log-pickaxe.sh
> +++ b/t/t4209-log-pickaxe.sh
> @@ -106,4 +106,44 @@ test_expect_success 'log -S --no-textconv (missing textconv tool)' '
>  	rm .gitattributes
>  '
>
> +test_expect_success 'log -G ignores binary files' '
> +	git checkout --orphan orphan1 &&
> +	printf "a\0a" >data.bin &&
> +	git add data.bin &&
> +	git commit -m "message" &&
> +	git log -Ga >result &&
> +	test_must_be_empty result
> +'
> +
> +test_expect_success 'log -G looks into binary files with -a' '
> +	git checkout --orphan orphan2 &&
> +	printf "a\0a" >data.bin &&
> +	git add data.bin &&
> +	git commit -m "message" &&
> +	git log -a -Ga >actual &&
> +	git log >expected &&
> +	test_cmp actual expected
> +'

A large part of the question(s) I have above & future readers would
presumably have would be answered by these tests using more realistic
test data. I.e. also with \n in there to see whether -G is also
line-based in this binary case.

> +test_expect_success 'log -G looks into binary files with textconv filter' '
> +	git checkout --orphan orphan3 &&
> +	echo "* diff=bin" > .gitattributes &&
> +	printf "a\0a" >data.bin &&
> +	git add data.bin &&
> +	git commit -m "message" &&
> +	git -c diff.bin.textconv=cat log -Ga >actual &&
> +	git log >expected &&
> +	test_cmp actual expected
> +'
> +
> +test_expect_success 'log -S looks into binary files' '
> +	git checkout --orphan orphan4 &&
> +	printf "a\0a" >data.bin &&
> +	git add data.bin &&
> +	git commit -m "message" &&
> +	git log -Sa >actual &&
> +	git log >expected &&
> +	test_cmp actual expected
> +'
> +
>  test_done

These tests have way to much repeated boilerplate for no reason. This
could just be (as-is, without the better test data suggested above):

diff --git a/t/t4209-log-pickaxe.sh b/t/t4209-log-pickaxe.sh
index 844df760f7..23ed6cc4b1 100755
--- a/t/t4209-log-pickaxe.sh
+++ b/t/t4209-log-pickaxe.sh
@@ -106,4 +106,34 @@ test_expect_success 'log -S --no-textconv (missing textconv tool)' '
 	rm .gitattributes
 '

+test_expect_success 'setup log -[GS] binary & --text' '
+	git checkout --orphan GS-binary-and-text &&
+	printf "a\0a" >data.bin &&
+	git add data.bin &&
+	git commit -m "message" &&
+	git log >full-log
+'
+
+test_expect_success 'log -G ignores binary files' '
+	git log -Ga >result &&
+	test_must_be_empty result
+'
+
+test_expect_success 'log -G looks into binary files with -a' '
+	git log -a -Ga >actual &&
+	test_cmp actual full-log
+'
+
+test_expect_success 'log -G looks into binary files with textconv filter' '
+	echo "* diff=bin" >.gitattributes &&
+	git -c diff.bin.textconv=cat log -Ga >actual &&
+	test_cmp actual full-log
+'
+
+test_expect_success 'log -S looks into binary files' '
+	>.gitattributes &&
+	git log -Sa >actual &&
+	test_cmp actual full-log
+'
+
 test_done

  reply	other threads:[~2018-11-28 12:54 UTC|newest]

Thread overview: 30+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-11-21 20:52 [PATCH 0/2] Teach log -G to ignore binary files Thomas Braun
2018-11-21 20:52 ` [PATCH v1 1/2] log -G: Ignore " Thomas Braun
2018-11-21 20:52   ` [PATCH v1 2/2] log -S: Add test which searches in " Thomas Braun
2018-11-21 21:00     ` [PATCH 0/2] Teach log -G to ignore " Thomas Braun
2018-11-28 11:32       ` [PATCH v2] log -G: Ignore " Thomas Braun
2018-11-28 12:54         ` Ævar Arnfjörð Bjarmason [this message]
2018-12-14 18:44           ` Thomas Braun
2018-11-29  7:10         ` Junio C Hamano
2018-11-29  7:22           ` Junio C Hamano
2018-12-14 18:45             ` Thomas Braun
2018-12-14 18:45           ` Thomas Braun
2018-12-14 18:49       ` [PATCH v3] log -G: ignore " Thomas Braun
2018-12-26 23:24         ` Junio C Hamano
2018-11-22  1:34     ` [PATCH v1 2/2] log -S: Add test which searches in " Junio C Hamano
2018-11-28 11:31       ` Thomas Braun
2018-11-22  9:14     ` Ævar Arnfjörð Bjarmason
2018-11-24  2:27       ` Junio C Hamano
2018-11-28 11:31       ` Thomas Braun
2018-11-22  1:29   ` [PATCH v1 1/2] log -G: Ignore " Junio C Hamano
2018-11-28 11:31     ` Thomas Braun
2018-11-22 10:16   ` Ævar Arnfjörð Bjarmason
2018-11-22 16:27     ` Jeff King
2018-11-28 11:31     ` Thomas Braun
2018-11-28 11:31     ` Thomas Braun
2018-11-22 16:20   ` Jeff King
2018-11-24  2:32     ` Junio C Hamano
2018-11-28 11:31     ` Thomas Braun
2018-11-26 20:19   ` Stefan Beller
2018-11-27  0:51     ` Junio C Hamano
2018-11-28 11:31       ` Thomas Braun

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87a7ltz7jh.fsf@evledraar.gmail.com \
    --to=avarab@gmail.com \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    --cc=peff@peff.net \
    --cc=sbeller@google.com \
    --cc=thomas.braun@virtuell-zuhause.de \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).