On 2021-07-01 at 16:00:19, Jeff King wrote:
> On Wed, Jun 30, 2021 at 08:19:49PM +0000, brian m. carlson wrote:
> 
> > On 2021-06-30 at 17:59:43, Jeff King wrote:
> > > One complication we faced is that a lot of Git's data is bag-of-bytes,
> > > not utf8. And json technically requires utf8. I don't remember if we
> > > simply fudged that and output possibly non-utf8 sequences, or if we
> > > actually encode them.
> > 
> > I think we just emit invalid UTF-8 in that case, which is a problem.
> > That's why Git is not well suited to JSON output and why it isn't a good
> > choice for structured data here.  I'd like us not to do more JSON in our
> > codebase, since it's practically impossible for users to depend on our
> > output if we do that due to encoding issues[0].
> > 
> > We could emit data in a different format, such as YAML, which does have
> > encoding for arbitrary byte sequences.  However, in YAML, binary data is
> > always base64 encoded, which is less readable, although still
> > interchangeable.  CBOR is also a possibility, although it's not human
> > readable at all.
> 
> I don't love the invalid-utf8-in-json thing in general. But I think it
> may be the least-bad solution. I seem to recall that YAML has its own
> complexities, and losing human-readability (even to base64) is a pretty
> big downside. And the tooling for working with json seems more common
> and mature (certainly over something like CBOR, but I think even YAML
> doesn't have anything nearly as nice as jq).

I'm not opposed to JSON as long as we don't write landmines.  We could
URI-encode anything that contains a bag-of-bytes, which lets people have
the niceties of JSON without the breakage when people don't write valid
UTF-8.  Most things will still be human-readable.

We could even have --json be an alias for --json=encoded (URI-encoding)
and also have --json=strict for the situation where you assert
everything is valid UTF-8 and explicitly said you wanted us to die() if
we saw non-UTF-8.  I don't want us to say that something is JSON and
then emit junk, since that's a bad user experience.

Ideally, we'd have some generic serializer support for this case, so if
people _do_ want to add YAML or CBOR output, it can be stuffed in.

> Our sloppy json encoding does work correctly if you use utf8 paths, and
> I think we could provide options to cover other common cases (e.g., a
> single option for "assume my paths are latin1"). I think life is hardest
> on somebody writing a script/service which is meant to process arbitrary
> repositories (and isn't in control of the strictness of whatever is
> parsing the json).

I think I'd rather provide a general encoding functionality than try to
handle random encodings.  I _do_ want people to be able to do things
like store arbitrary bytes in paths, because many people do use that
functionality for shipping test files that verify their code works
correctly on Unix systems.  I also want us to handle arbitrary bytes
where we've stated that's a thing we support (e.g., in refs).  I _don't_
want to encourage people to use non-UTF-8 text encodings, because I
firmly believe those are obsolete.

So, correct binary data support, yes; non-UTF-8 text, no.
-- 
brian m. carlson (he/him or they/them)
Toronto, Ontario, CA