* Structured (ie: json) output for query commands? [not found] <CACPiFC++fG-WL8uvTkiydf3wD8TY6dStVpuLcKA9cX_EnwoHGA@mail.gmail.com> @ 2021-06-30 17:00 ` Martin Langhoff 2021-06-30 17:59 ` Jeff King 2021-07-01 8:18 ` Han-Wen Nienhuys 0 siblings, 2 replies; 11+ messages in thread From: Martin Langhoff @ 2021-06-30 17:00 UTC (permalink / raw) To: Git Mailing List Hi Git Gang, I'm used to automating git by parsing stuff in Unix shell style. But the golden days that gave us Perl are well behind us. Before I write (unreliable, leaky, likely buggy) text parsing bits one more time, has git grown formatted output? I'm aware of fmt and friends, I'm thinking of something that handles escaping, nested data structures, etc. Something like "git shortlog --json". Have there been discussions? Patches? (I don't spot anything in what's cooking). m -- martin.langhoff@gmail.com - ask interesting questions ~ http://linkedin.com/in/martinlanghoff - don't be distracted ~ http://github.com/martin-langhoff by shiny stuff ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Structured (ie: json) output for query commands? 2021-06-30 17:00 ` Structured (ie: json) output for query commands? Martin Langhoff @ 2021-06-30 17:59 ` Jeff King 2021-06-30 18:20 ` Martin Langhoff 2021-06-30 20:19 ` brian m. carlson 2021-07-01 8:18 ` Han-Wen Nienhuys 1 sibling, 2 replies; 11+ messages in thread From: Jeff King @ 2021-06-30 17:59 UTC (permalink / raw) To: Martin Langhoff; +Cc: Git Mailing List On Wed, Jun 30, 2021 at 01:00:24PM -0400, Martin Langhoff wrote: > I'm used to automating git by parsing stuff in Unix shell style. But > the golden days that gave us Perl are well behind us. > > Before I write (unreliable, leaky, likely buggy) text parsing bits one > more time, has git grown formatted output? I'm aware of fmt and > friends, I'm thinking of something that handles escaping, nested data > structures, etc. > > Something like "git shortlog --json". > > Have there been discussions? Patches? (I don't spot anything in what's cooking). It's been discussed off-and-on over the years, but I don't think anybody's actively working on it. The trace2 facility has some json output. That's probably not helpful for what you're doing, but it does mean we have basic json output routines available. One complication we faced is that a lot of Git's data is bag-of-bytes, not utf8. And json technically requires utf8. I don't remember if we simply fudged that and output possibly non-utf8 sequences, or if we actually encode them. -Peff ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Structured (ie: json) output for query commands? 2021-06-30 17:59 ` Jeff King @ 2021-06-30 18:20 ` Martin Langhoff 2021-07-01 15:47 ` Jeff King 2021-06-30 20:19 ` brian m. carlson 1 sibling, 1 reply; 11+ messages in thread From: Martin Langhoff @ 2021-06-30 18:20 UTC (permalink / raw) To: Jeff King; +Cc: Git Mailing List On Wed, Jun 30, 2021 at 1:59 PM Jeff King <peff@peff.net> wrote: > It's been discussed off-and-on over the years, but I don't think > anybody's actively working on it. thanks! > The trace2 facility has some json output. That's probably not helpful > for what you're doing, but it does mean we have basic json output > routines available. Cool. I see json-writer.c; right. > One complication we faced is that a lot of Git's data is bag-of-bytes, Great point -- hadn't thought of that. Don't see anything in json-writer.c but we do use iconv already. cheers, m -- martin.langhoff@gmail.com - ask interesting questions ~ http://linkedin.com/in/martinlanghoff - don't be distracted ~ http://github.com/martin-langhoff by shiny stuff ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Structured (ie: json) output for query commands? 2021-06-30 18:20 ` Martin Langhoff @ 2021-07-01 15:47 ` Jeff King 0 siblings, 0 replies; 11+ messages in thread From: Jeff King @ 2021-07-01 15:47 UTC (permalink / raw) To: Martin Langhoff; +Cc: Git Mailing List On Wed, Jun 30, 2021 at 02:20:09PM -0400, Martin Langhoff wrote: > > One complication we faced is that a lot of Git's data is bag-of-bytes, > > Great point -- hadn't thought of that. Don't see anything in > json-writer.c but we do use iconv already. We do, but the problem is deeper than that. We don't always know the intended encoding of bytes in the repository. For commits, there's an "encoding" header and we default to utf8 if it's not specified. But filenames in trees do not have an encoding (nor are two entries in a single tree even required to be in the same encoding). They really are just sequences of NUL-terminated binary bytes from Git's perspective. Most of the time that just works, of course. People tend to use utf8 these days anyway. And even if they aren't utf8, as long as the user's terminal is configured to match, then everything will look OK to them (you do have to turn off core.quotepath to see any high-bit characters in filenames). So in practice I suspect it is fine to just output them as-is in json. Things will Just Work for people using utf8 consistently. People using other encodings will have things look OK in their terminal, but probably JSON parsers would choke. We could provide an option to say "when you generate json, assume paths are in encoding XYZ (say, latin1) and convert to utf8". That wouldn't help people who have mix-and-match encodings in their trees, but that seems even more rare. -Peff ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Structured (ie: json) output for query commands? 2021-06-30 17:59 ` Jeff King 2021-06-30 18:20 ` Martin Langhoff @ 2021-06-30 20:19 ` brian m. carlson 2021-06-30 23:27 ` Martin Langhoff 2021-07-01 16:00 ` Jeff King 1 sibling, 2 replies; 11+ messages in thread From: brian m. carlson @ 2021-06-30 20:19 UTC (permalink / raw) To: Jeff King; +Cc: Martin Langhoff, Git Mailing List [-- Attachment #1: Type: text/plain, Size: 1363 bytes --] On 2021-06-30 at 17:59:43, Jeff King wrote: > One complication we faced is that a lot of Git's data is bag-of-bytes, > not utf8. And json technically requires utf8. I don't remember if we > simply fudged that and output possibly non-utf8 sequences, or if we > actually encode them. I think we just emit invalid UTF-8 in that case, which is a problem. That's why Git is not well suited to JSON output and why it isn't a good choice for structured data here. I'd like us not to do more JSON in our codebase, since it's practically impossible for users to depend on our output if we do that due to encoding issues[0]. We could emit data in a different format, such as YAML, which does have encoding for arbitrary byte sequences. However, in YAML, binary data is always base64 encoded, which is less readable, although still interchangeable. CBOR is also a possibility, although it's not human readable at all. I'm personally fine with the ad-hoc approach we use now, which is actually very convenient to script and, in my view, not to terrible to parse in other tools and languages. Your mileage may vary, though. [0] I worked on a codebase for many years that exploited its JSON parser not requiring UTF-8 and it was a colossal mess that I'd like us not to repeat. -- brian m. carlson (he/him or they/them) Toronto, Ontario, CA [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 262 bytes --] ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Structured (ie: json) output for query commands? 2021-06-30 20:19 ` brian m. carlson @ 2021-06-30 23:27 ` Martin Langhoff 2021-07-01 16:00 ` Jeff King 1 sibling, 0 replies; 11+ messages in thread From: Martin Langhoff @ 2021-06-30 23:27 UTC (permalink / raw) To: brian m. carlson, Jeff King, Martin Langhoff, Git Mailing List On Wed, Jun 30, 2021 at 4:20 PM brian m. carlson <sandals@crustytoothpaste.net> wrote: > I'm personally fine with the ad-hoc approach we use now, which is > actually very convenient to script and, in my view, not to terrible to > parse in other tools and languages. Your mileage may vary, though. It's of one piece with the Unix tradition that gave us Perl. That's both a blessing and a curse. Outputting fields, escapes, etc and the ability to use xpath to extract, manipulate and query specific data is... very useful. There's a hard tradeoff with utf-8, and any attempt at papering over that - iconv conversions, etc – will inevitably munge data in some cases. hmmm, m -- martin.langhoff@gmail.com - ask interesting questions ~ http://linkedin.com/in/martinlanghoff - don't be distracted ~ http://github.com/martin-langhoff by shiny stuff ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Structured (ie: json) output for query commands? 2021-06-30 20:19 ` brian m. carlson 2021-06-30 23:27 ` Martin Langhoff @ 2021-07-01 16:00 ` Jeff King 2021-07-01 21:18 ` brian m. carlson 1 sibling, 1 reply; 11+ messages in thread From: Jeff King @ 2021-07-01 16:00 UTC (permalink / raw) To: brian m. carlson; +Cc: Martin Langhoff, Git Mailing List On Wed, Jun 30, 2021 at 08:19:49PM +0000, brian m. carlson wrote: > On 2021-06-30 at 17:59:43, Jeff King wrote: > > One complication we faced is that a lot of Git's data is bag-of-bytes, > > not utf8. And json technically requires utf8. I don't remember if we > > simply fudged that and output possibly non-utf8 sequences, or if we > > actually encode them. > > I think we just emit invalid UTF-8 in that case, which is a problem. > That's why Git is not well suited to JSON output and why it isn't a good > choice for structured data here. I'd like us not to do more JSON in our > codebase, since it's practically impossible for users to depend on our > output if we do that due to encoding issues[0]. > > We could emit data in a different format, such as YAML, which does have > encoding for arbitrary byte sequences. However, in YAML, binary data is > always base64 encoded, which is less readable, although still > interchangeable. CBOR is also a possibility, although it's not human > readable at all. I don't love the invalid-utf8-in-json thing in general. But I think it may be the least-bad solution. I seem to recall that YAML has its own complexities, and losing human-readability (even to base64) is a pretty big downside. And the tooling for working with json seems more common and mature (certainly over something like CBOR, but I think even YAML doesn't have anything nearly as nice as jq). Our sloppy json encoding does work correctly if you use utf8 paths, and I think we could provide options to cover other common cases (e.g., a single option for "assume my paths are latin1"). I think life is hardest on somebody writing a script/service which is meant to process arbitrary repositories (and isn't in control of the strictness of whatever is parsing the json). I'm sensitive to the issue of implementing something that works most of the time, but then fails spectacularly when somebody does something unusual. But it also sucks for many users not to have that "something that works most of the time" if it would make their lives easier. > I'm personally fine with the ad-hoc approach we use now, which is > actually very convenient to script and, in my view, not to terrible to > parse in other tools and languages. Your mileage may vary, though. There are a lot of gotchas, there, too. When the data gets complex, "-z" splitting becomes ambiguous (e.g., "git log -z --raw" uses a NUL both to separate commits from their diffs, diffs from each other, and diffs from subsequent commits, so you have to pattern-match each type). It's also context-dependent (e.g., you can't parse a "--raw -z" entry without interpreting its type character, since "R" and "C" will have multiple path fields; there are almost certainly a lot of "works most of the time" parsers out there). -Peff ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Structured (ie: json) output for query commands? 2021-07-01 16:00 ` Jeff King @ 2021-07-01 21:18 ` brian m. carlson 2021-07-01 21:48 ` Jeff King 2021-07-02 13:13 ` Ævar Arnfjörð Bjarmason 0 siblings, 2 replies; 11+ messages in thread From: brian m. carlson @ 2021-07-01 21:18 UTC (permalink / raw) To: Jeff King; +Cc: Martin Langhoff, Git Mailing List [-- Attachment #1: Type: text/plain, Size: 3340 bytes --] On 2021-07-01 at 16:00:19, Jeff King wrote: > On Wed, Jun 30, 2021 at 08:19:49PM +0000, brian m. carlson wrote: > > > On 2021-06-30 at 17:59:43, Jeff King wrote: > > > One complication we faced is that a lot of Git's data is bag-of-bytes, > > > not utf8. And json technically requires utf8. I don't remember if we > > > simply fudged that and output possibly non-utf8 sequences, or if we > > > actually encode them. > > > > I think we just emit invalid UTF-8 in that case, which is a problem. > > That's why Git is not well suited to JSON output and why it isn't a good > > choice for structured data here. I'd like us not to do more JSON in our > > codebase, since it's practically impossible for users to depend on our > > output if we do that due to encoding issues[0]. > > > > We could emit data in a different format, such as YAML, which does have > > encoding for arbitrary byte sequences. However, in YAML, binary data is > > always base64 encoded, which is less readable, although still > > interchangeable. CBOR is also a possibility, although it's not human > > readable at all. > > I don't love the invalid-utf8-in-json thing in general. But I think it > may be the least-bad solution. I seem to recall that YAML has its own > complexities, and losing human-readability (even to base64) is a pretty > big downside. And the tooling for working with json seems more common > and mature (certainly over something like CBOR, but I think even YAML > doesn't have anything nearly as nice as jq). I'm not opposed to JSON as long as we don't write landmines. We could URI-encode anything that contains a bag-of-bytes, which lets people have the niceties of JSON without the breakage when people don't write valid UTF-8. Most things will still be human-readable. We could even have --json be an alias for --json=encoded (URI-encoding) and also have --json=strict for the situation where you assert everything is valid UTF-8 and explicitly said you wanted us to die() if we saw non-UTF-8. I don't want us to say that something is JSON and then emit junk, since that's a bad user experience. Ideally, we'd have some generic serializer support for this case, so if people _do_ want to add YAML or CBOR output, it can be stuffed in. > Our sloppy json encoding does work correctly if you use utf8 paths, and > I think we could provide options to cover other common cases (e.g., a > single option for "assume my paths are latin1"). I think life is hardest > on somebody writing a script/service which is meant to process arbitrary > repositories (and isn't in control of the strictness of whatever is > parsing the json). I think I'd rather provide a general encoding functionality than try to handle random encodings. I _do_ want people to be able to do things like store arbitrary bytes in paths, because many people do use that functionality for shipping test files that verify their code works correctly on Unix systems. I also want us to handle arbitrary bytes where we've stated that's a thing we support (e.g., in refs). I _don't_ want to encourage people to use non-UTF-8 text encodings, because I firmly believe those are obsolete. So, correct binary data support, yes; non-UTF-8 text, no. -- brian m. carlson (he/him or they/them) Toronto, Ontario, CA [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 262 bytes --] ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Structured (ie: json) output for query commands? 2021-07-01 21:18 ` brian m. carlson @ 2021-07-01 21:48 ` Jeff King 2021-07-02 13:13 ` Ævar Arnfjörð Bjarmason 1 sibling, 0 replies; 11+ messages in thread From: Jeff King @ 2021-07-01 21:48 UTC (permalink / raw) To: brian m. carlson; +Cc: Martin Langhoff, Git Mailing List On Thu, Jul 01, 2021 at 09:18:01PM +0000, brian m. carlson wrote: > > I don't love the invalid-utf8-in-json thing in general. But I think it > > may be the least-bad solution. I seem to recall that YAML has its own > > complexities, and losing human-readability (even to base64) is a pretty > > big downside. And the tooling for working with json seems more common > > and mature (certainly over something like CBOR, but I think even YAML > > doesn't have anything nearly as nice as jq). > > I'm not opposed to JSON as long as we don't write landmines. We could > URI-encode anything that contains a bag-of-bytes, which lets people have > the niceties of JSON without the breakage when people don't write valid > UTF-8. Most things will still be human-readable. > > We could even have --json be an alias for --json=encoded (URI-encoding) > and also have --json=strict for the situation where you assert > everything is valid UTF-8 and explicitly said you wanted us to die() if > we saw non-UTF-8. I don't want us to say that something is JSON and > then emit junk, since that's a bad user experience. > > Ideally, we'd have some generic serializer support for this case, so if > people _do_ want to add YAML or CBOR output, it can be stuffed in. Yep, I'd agree with all of that. I think we're on more-or-less the same page. One annoying thing about JSON is that (to my knowledge) it doesn't have a binary data type. So you have to encode things and shove them into "string". I guess that is not too bad if you are using backslash or percent-encoding, as only a minority of characters get encoded. But it sure would be nice for readers if the values, once extracted from the json, could be used without further munging. That's most of the benefit of using json in the first place. But it may be the best we can do. -Peff ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Structured (ie: json) output for query commands? 2021-07-01 21:18 ` brian m. carlson 2021-07-01 21:48 ` Jeff King @ 2021-07-02 13:13 ` Ævar Arnfjörð Bjarmason 1 sibling, 0 replies; 11+ messages in thread From: Ævar Arnfjörð Bjarmason @ 2021-07-02 13:13 UTC (permalink / raw) To: brian m. carlson; +Cc: Jeff King, Martin Langhoff, Git Mailing List On Thu, Jul 01 2021, brian m. carlson wrote: > [[PGP Signed Part:Undecided]] > On 2021-07-01 at 16:00:19, Jeff King wrote: >> On Wed, Jun 30, 2021 at 08:19:49PM +0000, brian m. carlson wrote: >> >> > On 2021-06-30 at 17:59:43, Jeff King wrote: >> > > One complication we faced is that a lot of Git's data is bag-of-bytes, >> > > not utf8. And json technically requires utf8. I don't remember if we >> > > simply fudged that and output possibly non-utf8 sequences, or if we >> > > actually encode them. >> > >> > I think we just emit invalid UTF-8 in that case, which is a problem. >> > That's why Git is not well suited to JSON output and why it isn't a good >> > choice for structured data here. I'd like us not to do more JSON in our >> > codebase, since it's practically impossible for users to depend on our >> > output if we do that due to encoding issues[0]. >> > >> > We could emit data in a different format, such as YAML, which does have >> > encoding for arbitrary byte sequences. However, in YAML, binary data is >> > always base64 encoded, which is less readable, although still >> > interchangeable. CBOR is also a possibility, although it's not human >> > readable at all. >> >> I don't love the invalid-utf8-in-json thing in general. But I think it >> may be the least-bad solution. I seem to recall that YAML has its own >> complexities, and losing human-readability (even to base64) is a pretty >> big downside. And the tooling for working with json seems more common >> and mature (certainly over something like CBOR, but I think even YAML >> doesn't have anything nearly as nice as jq). > > I'm not opposed to JSON as long as we don't write landmines. We could > URI-encode anything that contains a bag-of-bytes, which lets people have > the niceties of JSON without the breakage when people don't write valid > UTF-8. Most things will still be human-readable. > > We could even have --json be an alias for --json=encoded (URI-encoding) > and also have --json=strict for the situation where you assert > everything is valid UTF-8 and explicitly said you wanted us to die() if > we saw non-UTF-8. I don't want us to say that something is JSON and > then emit junk, since that's a bad user experience. > > Ideally, we'd have some generic serializer support for this case, so if > people _do_ want to add YAML or CBOR output, it can be stuffed in. I'd think the ideal end-state is for us to have some standardized way to pass structs of structured data around at the C-level. Then everything that now supports a format such as git-log, for-each-ref, cat-file --batch etc. could share the same formatting logic. Our human-readable output would just be a special-case of providing a default format, as it is in the case of some of these commands. If we had a bit of an extension of the %(if) etc. syntax that for-each-ref uses to handle such nested structures we could emit arbitrary structured data, e.g. the formatting language would be sufficient to start a nested structure, only emit commas between elements etc. You could then emit JSON, XML or whatever you'd like with a "simple" (well, it would be quite verbose) format specification. We could then ship some default formats. A related (but not quite the same) benefit would be to make the logic driving the built-ins reeantrant, so a formatting feature like this could be combined with a "--batch" mode supported by every (or most) commands. So you could also e.g. run "log --batch" not just "cat-file --batch" if you needed some of the formatting it provides. This would be immensely useful to editor implementations and things invoking git on the server, where we often need to pay the startup cost for invoking N number of commands that are built into the "git" binary anyway. So if they could be sent as a --batch ... >> Our sloppy json encoding does work correctly if you use utf8 paths, and >> I think we could provide options to cover other common cases (e.g., a >> single option for "assume my paths are latin1"). I think life is hardest >> on somebody writing a script/service which is meant to process arbitrary >> repositories (and isn't in control of the strictness of whatever is >> parsing the json). > > I think I'd rather provide a general encoding functionality than try to > handle random encodings. I _do_ want people to be able to do things > like store arbitrary bytes in paths, because many people do use that > functionality for shipping test files that verify their code works > correctly on Unix systems. I also want us to handle arbitrary bytes > where we've stated that's a thing we support (e.g., in refs). I _don't_ > want to encourage people to use non-UTF-8 text encodings, because I > firmly believe those are obsolete. > > So, correct binary data support, yes; non-UTF-8 text, no. I don't know how widely they're used with gits, but there's several non-Unicode encodings in wide use, and e.g. non-UTF-8 but Unicode encodings like UTF-16 in some contexts/platforms: https://stackoverflow.com/questions/1200063/why-does-anyone-use-an-encoding-other-than-utf-8/2470079 ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Structured (ie: json) output for query commands? 2021-06-30 17:00 ` Structured (ie: json) output for query commands? Martin Langhoff 2021-06-30 17:59 ` Jeff King @ 2021-07-01 8:18 ` Han-Wen Nienhuys 1 sibling, 0 replies; 11+ messages in thread From: Han-Wen Nienhuys @ 2021-07-01 8:18 UTC (permalink / raw) To: Martin Langhoff; +Cc: Git Mailing List On Wed, Jun 30, 2021 at 7:00 PM Martin Langhoff <martin.langhoff@gmail.com> wrote: > I'm used to automating git by parsing stuff in Unix shell style. But > the golden days that gave us Perl are well behind us. > > Before I write (unreliable, leaky, likely buggy) text parsing bits one > more time, has git grown formatted output? I'm aware of fmt and > friends, I'm thinking of something that handles escaping, nested data > structures, etc. > > Something like "git shortlog --json". > > Have there been discussions? Patches? (I don't spot anything in what's cooking). <unpopular opinion> text parsing is fine if you're doing a one-off, but if it's anything that needs to be just moderately complex, robust or long-lived, writing a proper program is a better idea. I'd recommend either go-git (single file programs can easily be run with "go run"), or Dulwhich (no experience, though) in python. -- Han-Wen Nienhuys - Google Munich I work 80%. Don't expect answers from me on Fridays. -- Google Germany GmbH, Erika-Mann-Strasse 33, 80636 Munich Registergericht und -nummer: Hamburg, HRB 86891 Sitz der Gesellschaft: Hamburg Geschäftsführer: Paul Manicle, Halimah DeLaine Prado ^ permalink raw reply [flat|nested] 11+ messages in thread
end of thread, other threads:[~2021-07-02 13:21 UTC | newest] Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- [not found] <CACPiFC++fG-WL8uvTkiydf3wD8TY6dStVpuLcKA9cX_EnwoHGA@mail.gmail.com> 2021-06-30 17:00 ` Structured (ie: json) output for query commands? Martin Langhoff 2021-06-30 17:59 ` Jeff King 2021-06-30 18:20 ` Martin Langhoff 2021-07-01 15:47 ` Jeff King 2021-06-30 20:19 ` brian m. carlson 2021-06-30 23:27 ` Martin Langhoff 2021-07-01 16:00 ` Jeff King 2021-07-01 21:18 ` brian m. carlson 2021-07-01 21:48 ` Jeff King 2021-07-02 13:13 ` Ævar Arnfjörð Bjarmason 2021-07-01 8:18 ` Han-Wen Nienhuys
Code repositories for project(s) associated with this public inbox https://80x24.org/mirrors/git.git This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).