git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: Jeff King <peff@peff.net>
To: "Ævar Arnfjörð Bjarmason" <avarab@gmail.com>
Cc: git@jeffhostetler.com, git@vger.kernel.org, gitster@pobox.com,
	lars.schneider@autodesk.com,
	Jeff Hostetler <jeffhost@microsoft.com>
Subject: Re: [PATCH 0/2] routines to generate JSON data
Date: Tue, 20 Mar 2018 01:52:00 -0400	[thread overview]
Message-ID: <20180320055200.GD15813@sigill.intra.peff.net> (raw)
In-Reply-To: <87tvtfd3sl.fsf@evledraar.gmail.com>

On Sat, Mar 17, 2018 at 12:00:26AM +0100, Ævar Arnfjörð Bjarmason wrote:

> 
> On Fri, Mar 16 2018, Jeff King jotted:
> 
> > I really like the idea of being able to send our machine-readable output
> > in some "standard" syntax for which people may already have parsers. But
> > one big hangup with JSON is that it assumes all strings are UTF-8.
> 
> FWIW It's not UTF-8 but "Unicode characters", i.e. any Unicode encoding
> is valid, not that it changes anything you're pointing out, but people
> on Win32 could use UTF-16 as-is if their filenames were in that format.

But AIUI, non-UTF8 has to come as "\u" escapes, right? That at least
gives us an "out" for exotic characters, but I don't think we can just
blindly dump pathnames into quoted strings, can we?

> > Some possible solutions I can think of:
> >
> >   1. Ignore the UTF-8 requirement, making a JSON-like output (which I
> >      think is what your patches do). I'm not sure what problems this
> >      might cause on the parsing side.
> 
> Maybe some JSON parsers are more permissive, but they'll commonly just
> die on non-Unicode (usually UTF-8) input, e.g.:
> 
>     $ (echo -n '{"str ": "'; head -c 3 /dev/urandom ; echo -n '"}') | perl -0666 -MJSON::XS -wE 'say decode_json(<>)->{str}'
>     malformed UTF-8 character in JSON string, at character offset 10 (before "\x{fffd}e\x{fffd}"}") at -e line 1, <> chunk 1.

OK, that's about what I expected.

> >   2. Specially encode non-UTF-8 bits. I'm not familiar enough with JSON
> >      to know the options here, but my understanding is that numeric
> >      escapes are just for inserting unicode code points. _Can_ you
> >      actually transport arbitrary binary data across JSON without
> >      base64-encoding it (yech)?
> 
> There's no way to transfer binary data in JSON without it being shoved
> into a UTF-8 encoding, so you'd need to know on the other side that
> such-and-such a field has binary in it, i.e. you'll need to invent your
> own schema.

Yuck. That's what I was afraid of. Is there any kind of standard scheme
here? It seems like we lose all of the benefits of JSON if the receiver
has to know whether and when to de-base64 (or whatever) our data.

> I think for git's use-case we're probably best off with JSON. It's going
> to work almost all of the time, and when it doesn't it's going to be on
> someone's weird non-UTF-8 repo, and those people are probably used to
> dealing with crap because of that anyway and can just manually decode
> their thing after it gets double-encoded.

That sounds a bit hand-wavy. While I agree that anybody using non-utf8
at this point is slightly insane, Git _does_ actually work with
arbitrary encodings in things like pathnames. It just seems kind of lame
to settle on a new universal encoding format for output that's actually
less capable than the current output.

> That sucks, but given that we'll be using this either for just ASCII
> (telemetry) or UTF-8 most of the time, and that realistically other
> formats either suck more or aren't nearly as ubiquitous...

I'd hoped to be able to output something like "git status" in JSON,
which is inherently going to deal with user paths.

-Peff

  reply	other threads:[~2018-03-20  5:52 UTC|newest]

Thread overview: 11+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-03-16 19:40 [PATCH 0/2] routines to generate JSON data git
2018-03-16 19:40 ` [PATCH 1/2] json_writer: new routines to create data in JSON format git
2018-03-16 19:40 ` [PATCH 2/2] json-writer: unit test git
2018-03-16 21:18 ` [PATCH 0/2] routines to generate JSON data Jeff King
2018-03-16 23:00   ` Ævar Arnfjörð Bjarmason
2018-03-20  5:52     ` Jeff King [this message]
2018-03-17  7:38   ` Jacob Keller
2018-03-19 17:31     ` Jeff Hostetler
2018-03-19 10:19   ` Jeff Hostetler
2018-03-20  5:42     ` Jeff King
2018-03-20 16:44       ` Jeff Hostetler

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20180320055200.GD15813@sigill.intra.peff.net \
    --to=peff@peff.net \
    --cc=avarab@gmail.com \
    --cc=git@jeffhostetler.com \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    --cc=jeffhost@microsoft.com \
    --cc=lars.schneider@autodesk.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).