git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
* Parsing trailers
@ 2018-12-23 22:41 William Chargin
  2018-12-24 10:58 ` Christian Couder
  2019-01-03  7:07 ` Jeff King
  0 siblings, 2 replies; 8+ messages in thread
From: William Chargin @ 2018-12-23 22:41 UTC (permalink / raw)
  To: Git Mailing List

I'm interested in parsing the output of `git-interpret-trailers` in a
script. I had hoped that the `--parse` option would make this easy, but
it seems that the `trailer.separators` configuration option is used to
specify both the input format (which separators may indicate a trailer)
and the output format of `git interpret-trailers --parse`. Given that
`trailer.separators` may contain any (non-NUL) characters, including
whitespace, parsing the output is not straightforward.

Here's what I've come up with. The output format is "<tok><sep> <val>",
where "<tok>" and "<val>" have been trimmed and so have no leading or
trailing whitespace, but "<val>" may have internal whitespace while
"<tok>" may not. Thus, the first space character in the output may
correspond to either "<sep>" or the fixed space, but we should be able
to determine which is the case: the first space is immediately followed
by a second space if and only if the first space corresponds to "<sep>".

Assuming that the above analysis is correct, the following procedure
should suffice to safely parse the output:

  - Let `i` be the index of the first space in `s`.
  - If `s[i+1]` is a space, let `sep_pos` be `i`. Otherwise, let
    `sep_pos` be `i - 1`.
  - The substring `s[:sep_pos]` is the token.
  - The character at index `sep_pos` is the separator.
  - The character at index `sep_pos + 1` is the fixed space.
  - The substring `s[sep_pos+2:nl]` is the value, where `nl` is the
    index of the first newline in `s` after `sep_pos`.

(It seems unfortunately complicated when all we want to do is parse the
output of `--parse`, but I don't see a better approach!)

My questions:

  - Is this accurate?
  - Is this algorithm guaranteed to remain correct in future versions of
    Git?
  - Is there a simpler way to extract the token-value pairs from a
    commit message string?

Would appreciate any advice.

Thanks!
WC

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Parsing trailers
  2018-12-23 22:41 Parsing trailers William Chargin
@ 2018-12-24 10:58 ` Christian Couder
  2018-12-24 18:52   ` William Chargin
  2019-01-03  7:07 ` Jeff King
  1 sibling, 1 reply; 8+ messages in thread
From: Christian Couder @ 2018-12-24 10:58 UTC (permalink / raw)
  To: William Chargin; +Cc: Git Mailing List

On Sun, Dec 23, 2018 at 11:44 PM William Chargin <wchargin@gmail.com> wrote:
>
> I'm interested in parsing the output of `git-interpret-trailers` in a
> script. I had hoped that the `--parse` option would make this easy, but
> it seems that the `trailer.separators` configuration option is used to
> specify both the input format (which separators may indicate a trailer)
> and the output format of `git interpret-trailers --parse`.

Yeah, but it can take many characters, not just one.
For example you might want to do something like:

seps=$(git config trailer.separators)
test -z "$seps" && seps=':'
git -c trailer.separators="|$seps" interpret-trailers --parse infile >outfile

So that the output uses '|' as a separator.

> Given that
> `trailer.separators` may contain any (non-NUL) characters, including
> whitespace, parsing the output is not straightforward.

Changing the default separator as shown above, should make it easier
to parse the result.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Parsing trailers
  2018-12-24 10:58 ` Christian Couder
@ 2018-12-24 18:52   ` William Chargin
  2018-12-26  4:33     ` Christian Couder
  0 siblings, 1 reply; 8+ messages in thread
From: William Chargin @ 2018-12-24 18:52 UTC (permalink / raw)
  To: Christian Couder; +Cc: Git Mailing List

Hi Christian: thanks for your reply.

> Changing the default separator as shown above, should make it easier
> to parse the result.

But this actually also changes which lines are considered trailers,
right? If the commit message ends with

    Signed-off-by: one
    Signed-off-by| two

and the user’s `trailer.separators` is set to `:`, then the correct
result should be only `Signed-off-by: one`. But when adding `|` as a
separator, we also see `Signed-off-by: two` in the result.

    $ printf '.\n\nSigned-off-by: one\nSigned-off-by| two\n' |
    > git interpret-trailers --parse
    Signed-off-by: one

    $ printf '.\n\nSigned-off-by: one\nSigned-off-by| two\n' |
    > git -c trailer.separators='|:' interpret-trailers --parse
    Signed-off-by| one
    Signed-off-by| two

Best,
WC

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Parsing trailers
  2018-12-24 18:52   ` William Chargin
@ 2018-12-26  4:33     ` Christian Couder
  2018-12-26 21:30       ` William Chargin
  0 siblings, 1 reply; 8+ messages in thread
From: Christian Couder @ 2018-12-26  4:33 UTC (permalink / raw)
  To: William Chargin; +Cc: Git Mailing List

Hi William,

On Mon, Dec 24, 2018 at 7:52 PM William Chargin <wchargin@gmail.com> wrote:
>
> Hi Christian: thanks for your reply.
>
> > Changing the default separator as shown above, should make it easier
> > to parse the result.
>
> But this actually also changes which lines are considered trailers,
> right?

Yes.

> If the commit message ends with
>
>     Signed-off-by: one
>     Signed-off-by| two
>
> and the user’s `trailer.separators` is set to `:`, then the correct
> result should be only `Signed-off-by: one`. But when adding `|` as a
> separator, we also see `Signed-off-by: two` in the result.

Yeah, but you can perhaps check that the input doesn't contain '|'
before doing the above. If it does contain '|' then you can probably
find another char that it doesn't contain and use that char instead of
'|'.

Another solution would be to develop a trailer.outputseparator config
option, which should not be very difficult.

Best,
Christian.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Parsing trailers
  2018-12-26  4:33     ` Christian Couder
@ 2018-12-26 21:30       ` William Chargin
  0 siblings, 0 replies; 8+ messages in thread
From: William Chargin @ 2018-12-26 21:30 UTC (permalink / raw)
  To: Christian Couder; +Cc: Git Mailing List

> Yeah, but you can perhaps check that the input doesn't contain '|'
> before doing the above. If it does contain '|' then you can probably
> find another char that it doesn't contain and use that char instead of
> '|'.

This sounds true in the usual case, though of course there are
pathological cases of commit messages that use the entire character
set. But it is starting to sound more complicated than the
slightly-tricky whitespace indexing logic from my original message.

> Another solution would be to develop a trailer.outputseparator config
> option, which should not be very difficult.

Yes, something like this would be a nice addition for future versions of
Git. Perhaps simpler would be to add a `-z` option to interpret-trailers
that would change the output format to <tok>NUL<val>NUL or similar.
Maybe I’ll send out a patch if I find some free time. :-)

Best,
WC

On Tue, Dec 25, 2018 at 8:33 PM Christian Couder
<christian.couder@gmail.com> wrote:
>
> Hi William,
>
> On Mon, Dec 24, 2018 at 7:52 PM William Chargin <wchargin@gmail.com> wrote:
> >
> > Hi Christian: thanks for your reply.
> >
> > > Changing the default separator as shown above, should make it easier
> > > to parse the result.
> >
> > But this actually also changes which lines are considered trailers,
> > right?
>
> Yes.
>
> > If the commit message ends with
> >
> >     Signed-off-by: one
> >     Signed-off-by| two
> >
> > and the user’s `trailer.separators` is set to `:`, then the correct
> > result should be only `Signed-off-by: one`. But when adding `|` as a
> > separator, we also see `Signed-off-by: two` in the result.
>
> Yeah, but you can perhaps check that the input doesn't contain '|'
> before doing the above. If it does contain '|' then you can probably
> find another char that it doesn't contain and use that char instead of
> '|'.
>
> Another solution would be to develop a trailer.outputseparator config
> option, which should not be very difficult.
>
> Best,
> Christian.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Parsing trailers
  2018-12-23 22:41 Parsing trailers William Chargin
  2018-12-24 10:58 ` Christian Couder
@ 2019-01-03  7:07 ` Jeff King
  2019-01-03  7:43   ` William Chargin
  1 sibling, 1 reply; 8+ messages in thread
From: Jeff King @ 2019-01-03  7:07 UTC (permalink / raw)
  To: William Chargin; +Cc: Git Mailing List

On Sun, Dec 23, 2018 at 02:41:20PM -0800, William Chargin wrote:

> I'm interested in parsing the output of `git-interpret-trailers` in a
> script. I had hoped that the `--parse` option would make this easy, but
> it seems that the `trailer.separators` configuration option is used to
> specify both the input format (which separators may indicate a trailer)
> and the output format of `git interpret-trailers --parse`. Given that
> `trailer.separators` may contain any (non-NUL) characters, including
> whitespace, parsing the output is not straightforward.

IMHO this is a bug in --parse. It was always meant to give sane,
normalized output, that you could parse with something like:

  [^:]+: .*

That's what "%(trailers:only)" does (even if the original separator was
something else). It also trims any extra whitespace.

So I think it would be reasonable to:

  1. Add an --output-separator option. This should be uncontroversial.

  2. Make --parse imply "--output-separator=:". This might be more
     controversial, because it does change the output. But as I said
     above, I consider the current behavior to simply be a bug.

  3. (Optional) Add a "-z" option which uses "\0" both
     within and between trailer records. This obviously solves your
     problem without steps 1 and 2, but I think we should have a way to
     do it without relying on NULs, since they're harder to work with in
     some tools.

-Peff

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Parsing trailers
  2019-01-03  7:07 ` Jeff King
@ 2019-01-03  7:43   ` William Chargin
  2019-01-03  7:50     ` Jeff King
  0 siblings, 1 reply; 8+ messages in thread
From: William Chargin @ 2019-01-03  7:43 UTC (permalink / raw)
  To: Jeff King; +Cc: Git Mailing List

> That's what "%(trailers:only)" does (even if the original separator was
> something else). It also trims any extra whitespace.

Ooh, this is good to know: thanks. (I had found `print_tok_val` in
`trailer.c` and assumed that this was the only place with the output
logic, but I now see that `format_trailer_info` also exists in the same
file.)

> IMHO this is a bug in --parse. It was always meant to give sane,
> normalized output

Okay; this is good to hear. In that case, what would you think about
changing `interpret-trailers` as a whole to always emit colons? (Note
that the command is already lossy: even with no flags, it replaces each
separator character with the first character of `trailer.separators`.)
This has the advantage that we avoid adding a configuration option of
dubious value—it’s not clear to me why a user would actually _want_ to
change the separator to anything other than a colon. The patch should be
quite simple, too (only tested lightly on my machine):

diff --git a/trailer.c b/trailer.c
index 0796f326b3..722040e48c 100644
--- a/trailer.c
+++ b/trailer.c
@@ -156,7 +156,7 @@ static void print_tok_val(FILE *outfile, const
char *tok, const char *val)
        if (strchr(separators, c))
                fprintf(outfile, "%s%s\n", tok, val);
        else
-               fprintf(outfile, "%s%c %s\n", tok, separators[0], val);
+               fprintf(outfile, "%s: %s\n", tok, val);
 }

Is this veering too much from “bug fix” toward “backward-incompatible
behavior change” for your comfort?

I agree that either this or your (1) and (2) would eliminate the need
for `-z`, which is nice.

Best,
WC

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: Parsing trailers
  2019-01-03  7:43   ` William Chargin
@ 2019-01-03  7:50     ` Jeff King
  0 siblings, 0 replies; 8+ messages in thread
From: Jeff King @ 2019-01-03  7:50 UTC (permalink / raw)
  To: William Chargin; +Cc: Git Mailing List

On Wed, Jan 02, 2019 at 11:43:55PM -0800, William Chargin wrote:

> > IMHO this is a bug in --parse. It was always meant to give sane,
> > normalized output
> 
> Okay; this is good to hear. In that case, what would you think about
> changing `interpret-trailers` as a whole to always emit colons? (Note
> that the command is already lossy: even with no flags, it replaces each
> separator character with the first character of `trailer.separators`.)
> This has the advantage that we avoid adding a configuration option of
> dubious value—it’s not clear to me why a user would actually _want_ to
> change the separator to anything other than a colon. The patch should be
> quite simple, too (only tested lightly on my machine):
> 
> diff --git a/trailer.c b/trailer.c
> index 0796f326b3..722040e48c 100644
> --- a/trailer.c
> +++ b/trailer.c
> @@ -156,7 +156,7 @@ static void print_tok_val(FILE *outfile, const
> char *tok, const char *val)
>         if (strchr(separators, c))
>                 fprintf(outfile, "%s%s\n", tok, val);
>         else
> -               fprintf(outfile, "%s%c %s\n", tok, separators[0], val);
> +               fprintf(outfile, "%s: %s\n", tok, val);
>  }
> 
> Is this veering too much from “bug fix” toward “backward-incompatible
> behavior change” for your comfort?

I think that gets weird in non-parse modes. For example, if I'm not
trying to parse, but rather to _add_ a new trailer (because I'm writing
out a commit message to feed to git-commit-tree), then I'd presumably
want the normal configured separators.

I dunno. I don't think I've ever seen a trailer with a non-colon
separator, nor have I ever used interpret-trailers for anything except
parsing. But it obviously was designed with more flexibility in mind,
and I suspect the patch above has a high chance of breaking something
somewhere.  Tying it to --parse seems a lot safer, since that was
introduced exactly for this case.

-Peff

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2019-01-03  7:50 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-12-23 22:41 Parsing trailers William Chargin
2018-12-24 10:58 ` Christian Couder
2018-12-24 18:52   ` William Chargin
2018-12-26  4:33     ` Christian Couder
2018-12-26 21:30       ` William Chargin
2019-01-03  7:07 ` Jeff King
2019-01-03  7:43   ` William Chargin
2019-01-03  7:50     ` Jeff King

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).