git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: Jeff King <peff@peff.net>
To: "brian m. carlson" <sandals@crustytoothpaste.net>
Cc: Martin Langhoff <martin.langhoff@gmail.com>,
	Git Mailing List <git@vger.kernel.org>
Subject: Re: Structured (ie: json) output for query commands?
Date: Thu, 1 Jul 2021 12:00:19 -0400	[thread overview]
Message-ID: <YN3mk0LnyJyuQ+9T@coredump.intra.peff.net> (raw)
In-Reply-To: <YNzR5ZZDTfcN2Q+s@camp.crustytoothpaste.net>

On Wed, Jun 30, 2021 at 08:19:49PM +0000, brian m. carlson wrote:

> On 2021-06-30 at 17:59:43, Jeff King wrote:
> > One complication we faced is that a lot of Git's data is bag-of-bytes,
> > not utf8. And json technically requires utf8. I don't remember if we
> > simply fudged that and output possibly non-utf8 sequences, or if we
> > actually encode them.
> 
> I think we just emit invalid UTF-8 in that case, which is a problem.
> That's why Git is not well suited to JSON output and why it isn't a good
> choice for structured data here.  I'd like us not to do more JSON in our
> codebase, since it's practically impossible for users to depend on our
> output if we do that due to encoding issues[0].
> 
> We could emit data in a different format, such as YAML, which does have
> encoding for arbitrary byte sequences.  However, in YAML, binary data is
> always base64 encoded, which is less readable, although still
> interchangeable.  CBOR is also a possibility, although it's not human
> readable at all.

I don't love the invalid-utf8-in-json thing in general. But I think it
may be the least-bad solution. I seem to recall that YAML has its own
complexities, and losing human-readability (even to base64) is a pretty
big downside. And the tooling for working with json seems more common
and mature (certainly over something like CBOR, but I think even YAML
doesn't have anything nearly as nice as jq).

Our sloppy json encoding does work correctly if you use utf8 paths, and
I think we could provide options to cover other common cases (e.g., a
single option for "assume my paths are latin1"). I think life is hardest
on somebody writing a script/service which is meant to process arbitrary
repositories (and isn't in control of the strictness of whatever is
parsing the json).

I'm sensitive to the issue of implementing something that works most of
the time, but then fails spectacularly when somebody does something
unusual. But it also sucks for many users not to have that "something
that works most of the time" if it would make their lives easier.

> I'm personally fine with the ad-hoc approach we use now, which is
> actually very convenient to script and, in my view, not to terrible to
> parse in other tools and languages.  Your mileage may vary, though.

There are a lot of gotchas, there, too. When the data gets complex, "-z"
splitting becomes ambiguous (e.g., "git log -z --raw" uses a NUL both to
separate commits from their diffs, diffs from each other, and diffs from
subsequent commits, so you have to pattern-match each type). It's also
context-dependent (e.g., you can't parse a "--raw -z" entry without
interpreting its type character, since "R" and "C" will have multiple
path fields; there are almost certainly a lot of "works most of the
time" parsers out there).

-Peff

  parent reply	other threads:[~2021-07-01 16:00 UTC|newest]

Thread overview: 11+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <CACPiFC++fG-WL8uvTkiydf3wD8TY6dStVpuLcKA9cX_EnwoHGA@mail.gmail.com>
2021-06-30 17:00 ` Structured (ie: json) output for query commands? Martin Langhoff
2021-06-30 17:59   ` Jeff King
2021-06-30 18:20     ` Martin Langhoff
2021-07-01 15:47       ` Jeff King
2021-06-30 20:19     ` brian m. carlson
2021-06-30 23:27       ` Martin Langhoff
2021-07-01 16:00       ` Jeff King [this message]
2021-07-01 21:18         ` brian m. carlson
2021-07-01 21:48           ` Jeff King
2021-07-02 13:13           ` Ævar Arnfjörð Bjarmason
2021-07-01  8:18   ` Han-Wen Nienhuys

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=YN3mk0LnyJyuQ+9T@coredump.intra.peff.net \
    --to=peff@peff.net \
    --cc=git@vger.kernel.org \
    --cc=martin.langhoff@gmail.com \
    --cc=sandals@crustytoothpaste.net \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).