git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
* Difficulty with parsing colorized diff output
@ 2018-12-08  0:09 George King
  2018-12-08  7:16 ` Jeff King
  0 siblings, 1 reply; 10+ messages in thread
From: George King @ 2018-12-08  0:09 UTC (permalink / raw)
  To: Git Mailing List

Hello, I have a rather elaborate diff highlighter that I have implemented as a post-processor to regular git output. I am writing to discuss some difficult aspects of git diff's color output that I am observing with version 2.19.2. This is not a regression report; I am trying to implement a new feature and am stymied by these details.

My goal is to detect SGR color sequences, e.g. '\x1b[32m', that exist in the source text, and have my highlighter print escaped representations of those. For example, I have checked in files that are expected test outputs for tools that emit color codes, and diffs of those get very confusing.

Figuring out which color codes are from the source text and which were added by git is proving very difficult. The obvious solution is to turn git diff coloring off, but as far as I can see this also turns off all coloring for logs, which is undesirable.

Then I tried to remove just the color codes that git adds to the diff. This almost works, but there are some irregularities. Most lines begin with a style/color code and end with a reset code, which would be a perfect indicator that git is using colors. However:

* Context lines do not begin with reset code, but do end with a reset code. It would be preferable in my opinion if they had both (like every other line), or none at all.

* Added lines have excess codes after the plus sign. The entire prefix is, `\x1b[32m+\x1b[m\x1b[32m` translating to GREEN PLUS RESET GREEN. Emitting codes after the plus sign makes the parsing more complex and idiosyncratic.


In summary, I would like to suggest the following improvements:

* Remove the excess codes after the plus sign.

* When git diff is adding colors, ensure that every line begins with an SGR code and ends with the RESET code.

* Add a config feature to turn on log coloring while leaving diff coloring off.


I would be willing to attempt a fix for this myself, but I'd like to hear what the maintainers think first, and would appreciate any hints as to where I should start looking in the code base.


If anyone is curious about the implementation it is called `same-same` and lives here: https://github.com/gwk/pithy/blob/master/pithy/bin/same_same.py

I configure it like this in .gitconfig:

[core]
  pager = same-same | LESSANSIENDCHARS=mK less --RAW-CONTROL-CHARS
[interactive]
  diffFilter = same-same -interactive | LESSANSIENDCHARS=mK less --RAW-CONTROL-CHARS


Thank you,
George


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Difficulty with parsing colorized diff output
  2018-12-08  0:09 Difficulty with parsing colorized diff output George King
@ 2018-12-08  7:16 ` Jeff King
  2018-12-11  3:26   ` Stefan Beller
  0 siblings, 1 reply; 10+ messages in thread
From: Jeff King @ 2018-12-08  7:16 UTC (permalink / raw)
  To: George King; +Cc: Git Mailing List

On Fri, Dec 07, 2018 at 07:09:58PM -0500, George King wrote:

> My goal is to detect SGR color sequences, e.g. '\x1b[32m', that exist
> in the source text, and have my highlighter print escaped
> representations of those. For example, I have checked in files that
> are expected test outputs for tools that emit color codes, and diffs
> of those get very confusing.
> 
> Figuring out which color codes are from the source text and which were
> added by git is proving very difficult. The obvious solution is to
> turn git diff coloring off, but as far as I can see this also turns
> off all coloring for logs, which is undesirable.

Another gotcha that I don't think you've hit yet: whitespace
highlighting can color arbitrary parts of the line.

Try this, for example:

  printf 'foo\n' >old
  printf '\t \tfoo\n' >new
  git diff --no-index old new

So I'm not sure what you want to do can ever be completely robust. That
said, I'm not opposed to making Git's color output more regular. It
seems like a reasonable cleanup on its own.

(For the Git project itself, we long ago realized that putting raw color
codes into the source is a big pain when working with diffs, and we
instead use tools like t/test-lib-functions.sh's test_decode_color).

> * Context lines do not begin with reset code, but do end with a reset
> code. It would be preferable in my opinion if they had both (like
> every other line), or none at all.

Context lines do have both. It's just that the default color for context
lines is empty. ;)

But yes, I think it might be reasonable to recognize when an opening
color is empty, and turn the closing reset into a noop in that case (or
I guess you are also advocating the opposite: turn an empty color into a
noop \x1b[m or similar).

I think most of the coloring, including context lines, is coming from
diff.c:emit_diff_symbol_from_struct(). Instead of unconditionally
calling:

  context = diff_get_color_opt(o, DIFF_CONTEXT);
  reset = diff_get_color_opt(o, DIFF_RESET);

I suspect we could have a helper like this:

  static const char *diff_get_reset(const char *color)
  {
	if (!color || !*color)
		return "";
	return diff_colors[DIFF_RESET];
  }
  ...
  context = diff_get_color_opt(o, DIFF_CONTEXT);
  reset = diff_get_reset(context);

> * Added lines have excess codes after the plus sign. The entire prefix
> is, `\x1b[32m+\x1b[m\x1b[32m` translating to GREEN PLUS RESET GREEN.
> Emitting codes after the plus sign makes the parsing more complex and
> idiosyncratic.

Yeah, I've run into this one, too (when working on diff-highlight, which
similarly tries to pass-through Git's existing colors).

It's unfortunately not quite trivial, due to some interactions with
whitespace coloring. See this old patch:

  https://public-inbox.org/git/20101109220136.GA5617@sigill.intra.peff.net/

and the followup:

  https://public-inbox.org/git/20101118064050.GA12825@sigill.intra.peff.net/

> * Add a config feature to turn on log coloring while leaving diff coloring off.

Yes, the fact that --pretty log coloring is tied to diffs is mostly a
historical artifact. I think it would be OK to introduce a color.log
config option that defaults to the value of color.diff if unset. That
would be backwards compatible, but allow you to set them independently
if you wanted to.

-Peff

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Difficulty with parsing colorized diff output
  2018-12-08  7:16 ` Jeff King
@ 2018-12-11  3:26   ` Stefan Beller
  2018-12-11 10:17     ` Jeff King
  0 siblings, 1 reply; 10+ messages in thread
From: Stefan Beller @ 2018-12-11  3:26 UTC (permalink / raw)
  To: Jeff King; +Cc: george.w.king, git

> (For the Git project itself, we long ago realized that putting raw color
> codes into the source is a big pain when working with diffs, and we
> instead use tools like t/test-lib-functions.sh's test_decode_color).

And also we hid the colors behind #defines and such.

> > * Context lines do not begin with reset code, but do end with a reset
> > code. It would be preferable in my opinion if they had both (like
> > every other line), or none at all.
>
> Context lines do have both. It's just that the default color for context
> lines is empty. ;)

The content itself can contain color codes.

Instead of unconditionally resetting each line, we could parse each
content line to determine if we actually have to reset the colors.

>
> But yes, I think it might be reasonable to recognize when an opening
> color is empty, and turn the closing reset into a noop in that case (or
> I guess you are also advocating the opposite: turn an empty color into a
> noop \x1b[m or similar).


>
> I think most of the coloring, including context lines, is coming from
> diff.c:emit_diff_symbol_from_struct(). Instead of unconditionally
> calling:
>
>   context = diff_get_color_opt(o, DIFF_CONTEXT);
>   reset = diff_get_color_opt(o, DIFF_RESET);
>
> I suspect we could have a helper like this:
>
>   static const char *diff_get_reset(const char *color)
>   {
>         if (!color || !*color)
>                 return "";
>         return diff_colors[DIFF_RESET];
>   }
>   ...
>   context = diff_get_color_opt(o, DIFF_CONTEXT);
>   reset = diff_get_reset(context);

Another easier way to do so would be to drop
the line

    needs_reset = 1; /* 'line' may contain color codes. */

in diff.c::emit_line_0
I run the test suite and it passes (I thought we had a test
enforcing we'd reset any user provided coloring).

> > * Added lines have excess codes after the plus sign. The entire prefix
> > is, `\x1b[32m+\x1b[m\x1b[32m` translating to GREEN PLUS RESET GREEN.
> > Emitting codes after the plus sign makes the parsing more complex and
> > idiosyncratic.

Then we have broken code in diff.c::emit_line_ws_markup
in the last case ("else {") which first emits the sign via
emit_line_0 and then the rest via ws_check_emit.
It sounds like we'd want to replace `reset` in the call
of emit_line_0 with
  set == set_sign ? NULL : reset
and set in the call to ws_check_emit with
  set == set_sign ? NULL : set

Manually looking at some diff output  of said diff we'd get
  <RED>- emit_line_0(o, set_sign ? set_sign : set, NULL, !!set_sign,
reset,<RESET>
  <GREEN>+<RESET> <GREEN>emit_line_0(o, set_sign ? set_sign : set,
NULL, !!set_sign, set == set_sign ? NULL : reset,<RESET>

and the issue is that we do not color the beginning white space
of the emission from ws_check_emit but maybe we should.

Another idea would be to allow Git to output its output
as if it was run through test_decode_color, slightly related:
https://public-inbox.org/git/20180804015317.182683-8-sbeller@google.com/
i.e. we'd markup the output instead of coloring it.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Difficulty with parsing colorized diff output
  2018-12-11  3:26   ` Stefan Beller
@ 2018-12-11 10:17     ` Jeff King
  2018-12-11 14:47       ` George King
  2018-12-11 16:28       ` Ævar Arnfjörð Bjarmason
  0 siblings, 2 replies; 10+ messages in thread
From: Jeff King @ 2018-12-11 10:17 UTC (permalink / raw)
  To: Stefan Beller; +Cc: george.w.king, git

On Mon, Dec 10, 2018 at 07:26:46PM -0800, Stefan Beller wrote:

> > Context lines do have both. It's just that the default color for context
> > lines is empty. ;)
> 
> The content itself can contain color codes.
> 
> Instead of unconditionally resetting each line, we could parse each
> content line to determine if we actually have to reset the colors.

Good point. I don't recall that being the motivation back when this
behavior started, but it's a nice side effect (and the more recent line
you mentioned in emit_line_0 certainly is doing it intentionally).

That doesn't cover _other_ terminal codes, which could also make for
confusing output, but I do think color codes are somewhat special. We
generally send patches through "less -R", which will pass through the
colors but show escaped versions of other codes.

> Another idea would be to allow Git to output its output
> as if it was run through test_decode_color, slightly related:
> https://public-inbox.org/git/20180804015317.182683-8-sbeller@google.com/
> i.e. we'd markup the output instead of coloring it.

Yeah, I think in the most general form, the problem is that colorizing
(including whitespace highlighting) loses information within a single
line. It would be nice to have a machine-readable format that represents
all the various annotations (like whitespace and coloring moved bits)
that Git computes.

-Peff

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Difficulty with parsing colorized diff output
  2018-12-11 10:17     ` Jeff King
@ 2018-12-11 14:47       ` George King
  2018-12-11 16:28       ` Ævar Arnfjörð Bjarmason
  1 sibling, 0 replies; 10+ messages in thread
From: George King @ 2018-12-11 14:47 UTC (permalink / raw)
  To: Jeff King; +Cc: Stefan Beller, git

Peff & Stefan, thank you for the feedback. For my purposes, I am content to rely on gitconfig to reduce the colors to something that I can parse without losing information. Since my first email I have found that `wsErrorHighlight = none` gets rid of the problematic extra green highlights in the `+` lines.

I still think that a config to differentiate log coloring from diff coloring would be worthwhile, as it would guarantee that highlighters are getting unadulterated lines from git.

I also think that bracketing every colored line with a color code before the space/plus/minus and a reset just before the newline would be smart, because then a parser can just parse line by line: if there is an SGR sequence before the space/plus/minus, then it would know to strip off the final reset. This is in contrast to how it stands now, where a context line (with no leading color) is ambiguous by itself; I have to remember from previous lines in the hunk that we have been seeing colors in order to know wether I should strip off the reset.

I agree that a machine-readable format would be nice, but regardless it would be useful to make the regular output more parser-friendly.


> On 2018-12-11, at 5:17 AM, Jeff King <peff@peff.net> wrote:
> 
> On Mon, Dec 10, 2018 at 07:26:46PM -0800, Stefan Beller wrote:
> 
>>> Context lines do have both. It's just that the default color for context
>>> lines is empty. ;)
>> 
>> The content itself can contain color codes.
>> 
>> Instead of unconditionally resetting each line, we could parse each
>> content line to determine if we actually have to reset the colors.
> 
> Good point. I don't recall that being the motivation back when this
> behavior started, but it's a nice side effect (and the more recent line
> you mentioned in emit_line_0 certainly is doing it intentionally).
> 
> That doesn't cover _other_ terminal codes, which could also make for
> confusing output, but I do think color codes are somewhat special. We
> generally send patches through "less -R", which will pass through the
> colors but show escaped versions of other codes.
> 
>> Another idea would be to allow Git to output its output
>> as if it was run through test_decode_color, slightly related:
>> https://public-inbox.org/git/20180804015317.182683-8-sbeller@google.com/
>> i.e. we'd markup the output instead of coloring it.
> 
> Yeah, I think in the most general form, the problem is that colorizing
> (including whitespace highlighting) loses information within a single
> line. It would be nice to have a machine-readable format that represents
> all the various annotations (like whitespace and coloring moved bits)
> that Git computes.
> 
> -Peff


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Difficulty with parsing colorized diff output
  2018-12-11 10:17     ` Jeff King
  2018-12-11 14:47       ` George King
@ 2018-12-11 16:28       ` Ævar Arnfjörð Bjarmason
  2018-12-11 16:41         ` George King
  1 sibling, 1 reply; 10+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2018-12-11 16:28 UTC (permalink / raw)
  To: Jeff King; +Cc: Stefan Beller, george.w.king, git


On Tue, Dec 11 2018, Jeff King wrote:

> On Mon, Dec 10, 2018 at 07:26:46PM -0800, Stefan Beller wrote:
>
>> > Context lines do have both. It's just that the default color for context
>> > lines is empty. ;)
>>
>> The content itself can contain color codes.
>>
>> Instead of unconditionally resetting each line, we could parse each
>> content line to determine if we actually have to reset the colors.
>
> Good point. I don't recall that being the motivation back when this
> behavior started, but it's a nice side effect (and the more recent line
> you mentioned in emit_line_0 certainly is doing it intentionally).
>
> That doesn't cover _other_ terminal codes, which could also make for
> confusing output, but I do think color codes are somewhat special. We
> generally send patches through "less -R", which will pass through the
> colors but show escaped versions of other codes.

I wonder if optimizing this one way or the other matters for some
terminals. I.e. if we print out some huge diff of thousands of
consecutive "green" added lines is it faster/slower on some of them to
do one "begin green" and "reset" at the end, or is one line at a time
better, or doesn't it matter at all?

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Difficulty with parsing colorized diff output
  2018-12-11 16:28       ` Ævar Arnfjörð Bjarmason
@ 2018-12-11 16:41         ` George King
  2018-12-11 18:55           ` George King
  2018-12-12 12:49           ` Jeff King
  0 siblings, 2 replies; 10+ messages in thread
From: George King @ 2018-12-11 16:41 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason; +Cc: Jeff King, Stefan Beller, git

I first started playing around with terminal colors about 5 years ago, and I recall learning the hard way that Apple Terminal at least behaves very strangely when you have background colors cross line boundaries: background colors disappeared when I scrolled lines back into view. I filed a bug thinking it couldn't be right and Apple closed it as behaving according to compatibility expectations. I never figured out whether they had misunderstood my report or if old terminals were just that crazy. Instead I decided that the safe thing to do was reset after every line. Perhaps some git author reached the same conclusion.

From the perspective of parsing this output, it is really much easier if each line can be understood without considering state of previous lines. If anything, I think it is a safe approach to ensuring that it renders correctly on various terminals as well.

> On 2018-12-11, at 11:28 AM, Ævar Arnfjörð Bjarmason <avarab@gmail.com> wrote:
> 
> 
> On Tue, Dec 11 2018, Jeff King wrote:
> 
>> On Mon, Dec 10, 2018 at 07:26:46PM -0800, Stefan Beller wrote:
>> 
>>>> Context lines do have both. It's just that the default color for context
>>>> lines is empty. ;)
>>> 
>>> The content itself can contain color codes.
>>> 
>>> Instead of unconditionally resetting each line, we could parse each
>>> content line to determine if we actually have to reset the colors.
>> 
>> Good point. I don't recall that being the motivation back when this
>> behavior started, but it's a nice side effect (and the more recent line
>> you mentioned in emit_line_0 certainly is doing it intentionally).
>> 
>> That doesn't cover _other_ terminal codes, which could also make for
>> confusing output, but I do think color codes are somewhat special. We
>> generally send patches through "less -R", which will pass through the
>> colors but show escaped versions of other codes.
> 
> I wonder if optimizing this one way or the other matters for some
> terminals. I.e. if we print out some huge diff of thousands of
> consecutive "green" added lines is it faster/slower on some of them to
> do one "begin green" and "reset" at the end, or is one line at a time
> better, or doesn't it matter at all?


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Difficulty with parsing colorized diff output
  2018-12-11 16:41         ` George King
@ 2018-12-11 18:55           ` George King
  2018-12-12 13:52             ` Jeff King
  2018-12-12 12:49           ` Jeff King
  1 sibling, 1 reply; 10+ messages in thread
From: George King @ 2018-12-11 18:55 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason; +Cc: Jeff King, Stefan Beller, git

I just noticed that while `wsErrorHighlight = none` fixes the problem of extra green codes for regular diff, it fails to have any effect during interactive `git add -p`.


> On 2018-12-11, at 11:41 AM, George King <george.w.king@gmail.com> wrote:
> 
> I first started playing around with terminal colors about 5 years ago, and I recall learning the hard way that Apple Terminal at least behaves very strangely when you have background colors cross line boundaries: background colors disappeared when I scrolled lines back into view. I filed a bug thinking it couldn't be right and Apple closed it as behaving according to compatibility expectations. I never figured out whether they had misunderstood my report or if old terminals were just that crazy. Instead I decided that the safe thing to do was reset after every line. Perhaps some git author reached the same conclusion.
> 
> From the perspective of parsing this output, it is really much easier if each line can be understood without considering state of previous lines. If anything, I think it is a safe approach to ensuring that it renders correctly on various terminals as well.
> 
>> On 2018-12-11, at 11:28 AM, Ævar Arnfjörð Bjarmason <avarab@gmail.com> wrote:
>> 
>> 
>> On Tue, Dec 11 2018, Jeff King wrote:
>> 
>>> On Mon, Dec 10, 2018 at 07:26:46PM -0800, Stefan Beller wrote:
>>> 
>>>>> Context lines do have both. It's just that the default color for context
>>>>> lines is empty. ;)
>>>> 
>>>> The content itself can contain color codes.
>>>> 
>>>> Instead of unconditionally resetting each line, we could parse each
>>>> content line to determine if we actually have to reset the colors.
>>> 
>>> Good point. I don't recall that being the motivation back when this
>>> behavior started, but it's a nice side effect (and the more recent line
>>> you mentioned in emit_line_0 certainly is doing it intentionally).
>>> 
>>> That doesn't cover _other_ terminal codes, which could also make for
>>> confusing output, but I do think color codes are somewhat special. We
>>> generally send patches through "less -R", which will pass through the
>>> colors but show escaped versions of other codes.
>> 
>> I wonder if optimizing this one way or the other matters for some
>> terminals. I.e. if we print out some huge diff of thousands of
>> consecutive "green" added lines is it faster/slower on some of them to
>> do one "begin green" and "reset" at the end, or is one line at a time
>> better, or doesn't it matter at all?
> 


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Difficulty with parsing colorized diff output
  2018-12-11 16:41         ` George King
  2018-12-11 18:55           ` George King
@ 2018-12-12 12:49           ` Jeff King
  1 sibling, 0 replies; 10+ messages in thread
From: Jeff King @ 2018-12-12 12:49 UTC (permalink / raw)
  To: George King; +Cc: Ævar Arnfjörð Bjarmason, Stefan Beller, git

On Tue, Dec 11, 2018 at 11:41:18AM -0500, George King wrote:

> I first started playing around with terminal colors about 5 years ago,
> and I recall learning the hard way that Apple Terminal at least
> behaves very strangely when you have background colors cross line
> boundaries: background colors disappeared when I scrolled lines back
> into view. I filed a bug thinking it couldn't be right and Apple
> closed it as behaving according to compatibility expectations. I never
> figured out whether they had misunderstood my report or if old
> terminals were just that crazy. Instead I decided that the safe thing
> to do was reset after every line. Perhaps some git author reached the
> same conclusion.

Yes, that's exactly it. Certain codes do not do well as they cross
lines. It's been long enough that I don't remember the details, and a
quick grep of the archive found too many results for me to bother wading
through.

-Peff

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Difficulty with parsing colorized diff output
  2018-12-11 18:55           ` George King
@ 2018-12-12 13:52             ` Jeff King
  0 siblings, 0 replies; 10+ messages in thread
From: Jeff King @ 2018-12-12 13:52 UTC (permalink / raw)
  To: George King; +Cc: Ævar Arnfjörð Bjarmason, Stefan Beller, git

On Tue, Dec 11, 2018 at 01:55:09PM -0500, George King wrote:

> I just noticed that while `wsErrorHighlight = none` fixes the problem
> of extra green codes for regular diff, it fails to have any effect
> during interactive `git add -p`.

This is due to the way add--interactive invokes diff. It uses the
plumbing commands (diff-tree, diff-files, etc), which do not respect
config which changes the output (since their purpose is to provide
stable machine-readable output).

But add--interactive does a slightly funny thing: it runs each diff
twice. Once to get the machine-readable version, and once to get a
colorized version which it shows to the user. When it does the latter,
it has to manually enable other options (e.g., passing --color as
appropriate, or passing along diff.algorithm).

So the matching behavior here would be for it to look as
wsErrorHighlight and pass it along as a command line option.

-Peff

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2018-12-12 13:52 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-12-08  0:09 Difficulty with parsing colorized diff output George King
2018-12-08  7:16 ` Jeff King
2018-12-11  3:26   ` Stefan Beller
2018-12-11 10:17     ` Jeff King
2018-12-11 14:47       ` George King
2018-12-11 16:28       ` Ævar Arnfjörð Bjarmason
2018-12-11 16:41         ` George King
2018-12-11 18:55           ` George King
2018-12-12 13:52             ` Jeff King
2018-12-12 12:49           ` Jeff King

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).