On 2021-06-30 at 17:59:43, Jeff King wrote:
> One complication we faced is that a lot of Git's data is bag-of-bytes,
> not utf8. And json technically requires utf8. I don't remember if we
> simply fudged that and output possibly non-utf8 sequences, or if we
> actually encode them.

I think we just emit invalid UTF-8 in that case, which is a problem.
That's why Git is not well suited to JSON output and why it isn't a good
choice for structured data here.  I'd like us not to do more JSON in our
codebase, since it's practically impossible for users to depend on our
output if we do that due to encoding issues[0].

We could emit data in a different format, such as YAML, which does have
encoding for arbitrary byte sequences.  However, in YAML, binary data is
always base64 encoded, which is less readable, although still
interchangeable.  CBOR is also a possibility, although it's not human
readable at all.

I'm personally fine with the ad-hoc approach we use now, which is
actually very convenient to script and, in my view, not to terrible to
parse in other tools and languages.  Your mileage may vary, though.

[0] I worked on a codebase for many years that exploited its JSON parser
not requiring UTF-8 and it was a colossal mess that I'd like us not to
repeat.
-- 
brian m. carlson (he/him or they/them)
Toronto, Ontario, CA