On 2021-06-30 at 17:59:43, Jeff King wrote: > One complication we faced is that a lot of Git's data is bag-of-bytes, > not utf8. And json technically requires utf8. I don't remember if we > simply fudged that and output possibly non-utf8 sequences, or if we > actually encode them. I think we just emit invalid UTF-8 in that case, which is a problem. That's why Git is not well suited to JSON output and why it isn't a good choice for structured data here. I'd like us not to do more JSON in our codebase, since it's practically impossible for users to depend on our output if we do that due to encoding issues[0]. We could emit data in a different format, such as YAML, which does have encoding for arbitrary byte sequences. However, in YAML, binary data is always base64 encoded, which is less readable, although still interchangeable. CBOR is also a possibility, although it's not human readable at all. I'm personally fine with the ad-hoc approach we use now, which is actually very convenient to script and, in my view, not to terrible to parse in other tools and languages. Your mileage may vary, though. [0] I worked on a codebase for many years that exploited its JSON parser not requiring UTF-8 and it was a colossal mess that I'd like us not to repeat. -- brian m. carlson (he/him or they/them) Toronto, Ontario, CA