* Parsing trailers @ 2018-12-23 22:41 William Chargin 2018-12-24 10:58 ` Christian Couder 2019-01-03 7:07 ` Jeff King 0 siblings, 2 replies; 8+ messages in thread From: William Chargin @ 2018-12-23 22:41 UTC (permalink / raw) To: Git Mailing List I'm interested in parsing the output of `git-interpret-trailers` in a script. I had hoped that the `--parse` option would make this easy, but it seems that the `trailer.separators` configuration option is used to specify both the input format (which separators may indicate a trailer) and the output format of `git interpret-trailers --parse`. Given that `trailer.separators` may contain any (non-NUL) characters, including whitespace, parsing the output is not straightforward. Here's what I've come up with. The output format is "<tok><sep> <val>", where "<tok>" and "<val>" have been trimmed and so have no leading or trailing whitespace, but "<val>" may have internal whitespace while "<tok>" may not. Thus, the first space character in the output may correspond to either "<sep>" or the fixed space, but we should be able to determine which is the case: the first space is immediately followed by a second space if and only if the first space corresponds to "<sep>". Assuming that the above analysis is correct, the following procedure should suffice to safely parse the output: - Let `i` be the index of the first space in `s`. - If `s[i+1]` is a space, let `sep_pos` be `i`. Otherwise, let `sep_pos` be `i - 1`. - The substring `s[:sep_pos]` is the token. - The character at index `sep_pos` is the separator. - The character at index `sep_pos + 1` is the fixed space. - The substring `s[sep_pos+2:nl]` is the value, where `nl` is the index of the first newline in `s` after `sep_pos`. (It seems unfortunately complicated when all we want to do is parse the output of `--parse`, but I don't see a better approach!) My questions: - Is this accurate? - Is this algorithm guaranteed to remain correct in future versions of Git? - Is there a simpler way to extract the token-value pairs from a commit message string? Would appreciate any advice. Thanks! WC ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Parsing trailers 2018-12-23 22:41 Parsing trailers William Chargin @ 2018-12-24 10:58 ` Christian Couder 2018-12-24 18:52 ` William Chargin 2019-01-03 7:07 ` Jeff King 1 sibling, 1 reply; 8+ messages in thread From: Christian Couder @ 2018-12-24 10:58 UTC (permalink / raw) To: William Chargin; +Cc: Git Mailing List On Sun, Dec 23, 2018 at 11:44 PM William Chargin <wchargin@gmail.com> wrote: > > I'm interested in parsing the output of `git-interpret-trailers` in a > script. I had hoped that the `--parse` option would make this easy, but > it seems that the `trailer.separators` configuration option is used to > specify both the input format (which separators may indicate a trailer) > and the output format of `git interpret-trailers --parse`. Yeah, but it can take many characters, not just one. For example you might want to do something like: seps=$(git config trailer.separators) test -z "$seps" && seps=':' git -c trailer.separators="|$seps" interpret-trailers --parse infile >outfile So that the output uses '|' as a separator. > Given that > `trailer.separators` may contain any (non-NUL) characters, including > whitespace, parsing the output is not straightforward. Changing the default separator as shown above, should make it easier to parse the result. ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Parsing trailers 2018-12-24 10:58 ` Christian Couder @ 2018-12-24 18:52 ` William Chargin 2018-12-26 4:33 ` Christian Couder 0 siblings, 1 reply; 8+ messages in thread From: William Chargin @ 2018-12-24 18:52 UTC (permalink / raw) To: Christian Couder; +Cc: Git Mailing List Hi Christian: thanks for your reply. > Changing the default separator as shown above, should make it easier > to parse the result. But this actually also changes which lines are considered trailers, right? If the commit message ends with Signed-off-by: one Signed-off-by| two and the user’s `trailer.separators` is set to `:`, then the correct result should be only `Signed-off-by: one`. But when adding `|` as a separator, we also see `Signed-off-by: two` in the result. $ printf '.\n\nSigned-off-by: one\nSigned-off-by| two\n' | > git interpret-trailers --parse Signed-off-by: one $ printf '.\n\nSigned-off-by: one\nSigned-off-by| two\n' | > git -c trailer.separators='|:' interpret-trailers --parse Signed-off-by| one Signed-off-by| two Best, WC ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Parsing trailers 2018-12-24 18:52 ` William Chargin @ 2018-12-26 4:33 ` Christian Couder 2018-12-26 21:30 ` William Chargin 0 siblings, 1 reply; 8+ messages in thread From: Christian Couder @ 2018-12-26 4:33 UTC (permalink / raw) To: William Chargin; +Cc: Git Mailing List Hi William, On Mon, Dec 24, 2018 at 7:52 PM William Chargin <wchargin@gmail.com> wrote: > > Hi Christian: thanks for your reply. > > > Changing the default separator as shown above, should make it easier > > to parse the result. > > But this actually also changes which lines are considered trailers, > right? Yes. > If the commit message ends with > > Signed-off-by: one > Signed-off-by| two > > and the user’s `trailer.separators` is set to `:`, then the correct > result should be only `Signed-off-by: one`. But when adding `|` as a > separator, we also see `Signed-off-by: two` in the result. Yeah, but you can perhaps check that the input doesn't contain '|' before doing the above. If it does contain '|' then you can probably find another char that it doesn't contain and use that char instead of '|'. Another solution would be to develop a trailer.outputseparator config option, which should not be very difficult. Best, Christian. ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Parsing trailers 2018-12-26 4:33 ` Christian Couder @ 2018-12-26 21:30 ` William Chargin 0 siblings, 0 replies; 8+ messages in thread From: William Chargin @ 2018-12-26 21:30 UTC (permalink / raw) To: Christian Couder; +Cc: Git Mailing List > Yeah, but you can perhaps check that the input doesn't contain '|' > before doing the above. If it does contain '|' then you can probably > find another char that it doesn't contain and use that char instead of > '|'. This sounds true in the usual case, though of course there are pathological cases of commit messages that use the entire character set. But it is starting to sound more complicated than the slightly-tricky whitespace indexing logic from my original message. > Another solution would be to develop a trailer.outputseparator config > option, which should not be very difficult. Yes, something like this would be a nice addition for future versions of Git. Perhaps simpler would be to add a `-z` option to interpret-trailers that would change the output format to <tok>NUL<val>NUL or similar. Maybe I’ll send out a patch if I find some free time. :-) Best, WC On Tue, Dec 25, 2018 at 8:33 PM Christian Couder <christian.couder@gmail.com> wrote: > > Hi William, > > On Mon, Dec 24, 2018 at 7:52 PM William Chargin <wchargin@gmail.com> wrote: > > > > Hi Christian: thanks for your reply. > > > > > Changing the default separator as shown above, should make it easier > > > to parse the result. > > > > But this actually also changes which lines are considered trailers, > > right? > > Yes. > > > If the commit message ends with > > > > Signed-off-by: one > > Signed-off-by| two > > > > and the user’s `trailer.separators` is set to `:`, then the correct > > result should be only `Signed-off-by: one`. But when adding `|` as a > > separator, we also see `Signed-off-by: two` in the result. > > Yeah, but you can perhaps check that the input doesn't contain '|' > before doing the above. If it does contain '|' then you can probably > find another char that it doesn't contain and use that char instead of > '|'. > > Another solution would be to develop a trailer.outputseparator config > option, which should not be very difficult. > > Best, > Christian. ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Parsing trailers 2018-12-23 22:41 Parsing trailers William Chargin 2018-12-24 10:58 ` Christian Couder @ 2019-01-03 7:07 ` Jeff King 2019-01-03 7:43 ` William Chargin 1 sibling, 1 reply; 8+ messages in thread From: Jeff King @ 2019-01-03 7:07 UTC (permalink / raw) To: William Chargin; +Cc: Git Mailing List On Sun, Dec 23, 2018 at 02:41:20PM -0800, William Chargin wrote: > I'm interested in parsing the output of `git-interpret-trailers` in a > script. I had hoped that the `--parse` option would make this easy, but > it seems that the `trailer.separators` configuration option is used to > specify both the input format (which separators may indicate a trailer) > and the output format of `git interpret-trailers --parse`. Given that > `trailer.separators` may contain any (non-NUL) characters, including > whitespace, parsing the output is not straightforward. IMHO this is a bug in --parse. It was always meant to give sane, normalized output, that you could parse with something like: [^:]+: .* That's what "%(trailers:only)" does (even if the original separator was something else). It also trims any extra whitespace. So I think it would be reasonable to: 1. Add an --output-separator option. This should be uncontroversial. 2. Make --parse imply "--output-separator=:". This might be more controversial, because it does change the output. But as I said above, I consider the current behavior to simply be a bug. 3. (Optional) Add a "-z" option which uses "\0" both within and between trailer records. This obviously solves your problem without steps 1 and 2, but I think we should have a way to do it without relying on NULs, since they're harder to work with in some tools. -Peff ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Parsing trailers 2019-01-03 7:07 ` Jeff King @ 2019-01-03 7:43 ` William Chargin 2019-01-03 7:50 ` Jeff King 0 siblings, 1 reply; 8+ messages in thread From: William Chargin @ 2019-01-03 7:43 UTC (permalink / raw) To: Jeff King; +Cc: Git Mailing List > That's what "%(trailers:only)" does (even if the original separator was > something else). It also trims any extra whitespace. Ooh, this is good to know: thanks. (I had found `print_tok_val` in `trailer.c` and assumed that this was the only place with the output logic, but I now see that `format_trailer_info` also exists in the same file.) > IMHO this is a bug in --parse. It was always meant to give sane, > normalized output Okay; this is good to hear. In that case, what would you think about changing `interpret-trailers` as a whole to always emit colons? (Note that the command is already lossy: even with no flags, it replaces each separator character with the first character of `trailer.separators`.) This has the advantage that we avoid adding a configuration option of dubious value—it’s not clear to me why a user would actually _want_ to change the separator to anything other than a colon. The patch should be quite simple, too (only tested lightly on my machine): diff --git a/trailer.c b/trailer.c index 0796f326b3..722040e48c 100644 --- a/trailer.c +++ b/trailer.c @@ -156,7 +156,7 @@ static void print_tok_val(FILE *outfile, const char *tok, const char *val) if (strchr(separators, c)) fprintf(outfile, "%s%s\n", tok, val); else - fprintf(outfile, "%s%c %s\n", tok, separators[0], val); + fprintf(outfile, "%s: %s\n", tok, val); } Is this veering too much from “bug fix” toward “backward-incompatible behavior change” for your comfort? I agree that either this or your (1) and (2) would eliminate the need for `-z`, which is nice. Best, WC ^ permalink raw reply related [flat|nested] 8+ messages in thread
* Re: Parsing trailers 2019-01-03 7:43 ` William Chargin @ 2019-01-03 7:50 ` Jeff King 0 siblings, 0 replies; 8+ messages in thread From: Jeff King @ 2019-01-03 7:50 UTC (permalink / raw) To: William Chargin; +Cc: Git Mailing List On Wed, Jan 02, 2019 at 11:43:55PM -0800, William Chargin wrote: > > IMHO this is a bug in --parse. It was always meant to give sane, > > normalized output > > Okay; this is good to hear. In that case, what would you think about > changing `interpret-trailers` as a whole to always emit colons? (Note > that the command is already lossy: even with no flags, it replaces each > separator character with the first character of `trailer.separators`.) > This has the advantage that we avoid adding a configuration option of > dubious value—it’s not clear to me why a user would actually _want_ to > change the separator to anything other than a colon. The patch should be > quite simple, too (only tested lightly on my machine): > > diff --git a/trailer.c b/trailer.c > index 0796f326b3..722040e48c 100644 > --- a/trailer.c > +++ b/trailer.c > @@ -156,7 +156,7 @@ static void print_tok_val(FILE *outfile, const > char *tok, const char *val) > if (strchr(separators, c)) > fprintf(outfile, "%s%s\n", tok, val); > else > - fprintf(outfile, "%s%c %s\n", tok, separators[0], val); > + fprintf(outfile, "%s: %s\n", tok, val); > } > > Is this veering too much from “bug fix” toward “backward-incompatible > behavior change” for your comfort? I think that gets weird in non-parse modes. For example, if I'm not trying to parse, but rather to _add_ a new trailer (because I'm writing out a commit message to feed to git-commit-tree), then I'd presumably want the normal configured separators. I dunno. I don't think I've ever seen a trailer with a non-colon separator, nor have I ever used interpret-trailers for anything except parsing. But it obviously was designed with more flexibility in mind, and I suspect the patch above has a high chance of breaking something somewhere. Tying it to --parse seems a lot safer, since that was introduced exactly for this case. -Peff ^ permalink raw reply [flat|nested] 8+ messages in thread
end of thread, other threads:[~2019-01-03 7:50 UTC | newest] Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2018-12-23 22:41 Parsing trailers William Chargin 2018-12-24 10:58 ` Christian Couder 2018-12-24 18:52 ` William Chargin 2018-12-26 4:33 ` Christian Couder 2018-12-26 21:30 ` William Chargin 2019-01-03 7:07 ` Jeff King 2019-01-03 7:43 ` William Chargin 2019-01-03 7:50 ` Jeff King
Code repositories for project(s) associated with this public inbox https://80x24.org/mirrors/git.git This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).