JSON spec says that a UTF shall be used to encode it as a byte stream (with a strong preference to UTF-8). However it also allows characters to be encoded in the form of \uXXXX. This poses a problem on the parser implementation side:

I'm writing this as a comment to Tim Bray's "I-JSON" which is an emerging description of a stricter JSON aimed to avoid interoperability problems with JSON (RFC 7493). So here's my comment/proposal in a nutshell:

  1. Mandate a UTF encoding with a "MUST", not a "SHALL".
  2. Drop the \uXXXX escapes.

Makes sense?

Comments: 7

  1. John Cowan

    The main advantage of the \uxxxx escapes is that when writing JSON by hand (which does happen) you can insert an arbitrary character into it, and you can be sure that you have done so. The string "foo\u200bbar" is much more obviously a seven-character string than "foo​bar" is.

  2. Ivan Sagalaev

    John Cowan: granted. I still think the author's advantage in this case has less value than the gain to the parser from dropping the escaping. It has a performance price for every parsed JSON document.

  3. Cowtowncoder

    Ivan: parser still must handle "simple" escapes to allow inclusion of backslashes, quotes and control characters within String values, so simplification from not having to handle unicode escapes is not all that significant. And although it would slightly help in making canonicalization easier, simple escapes also contain nice-to-have characters that may be chosen optionally.

    Not that I would be against dropping Unicode escapes necessarily, just pointing out that having written high-performance JSON parser I am familiar with optimizations, and argue that this won't make a big difference either way.

  4. Alexander Batishchev

    Makes sense for me. At least from the perspective you described.

  5. Igor Kalnitsky

    Generally I agree, though there's one benefit from escaping you didn't consider. Escaping allows you to have an ASCII-compatible representation of JSON document. That matters, don't you think so?

    Btw, it's very interesting to know who do you retrieve encoding from the stream? I mean the JSON document could be sent in UTF-8 as well as in UTF-16 or UTF-32, so you have to choose the right one for proper decoding and it seems not trivial for me.

  6. Ivan Sagalaev

    Cowtowncoder:

    parser still must handle "simple" escapes to allow inclusion of backslashes, quotes and control characters within String values, so simplification from not having to handle unicode escapes is not all that significant.

    You're right of course, I didn't take those into account. But even marginal simplification is simplification. The important part is that there's no downside (as far as I can see).

    Igor Kalnitsky:

    Escaping allows you to have an ASCII-compatible representation of JSON document. That matters, don't you think so?

    No, not really. Every modern system can produce UTF of some sort these days.

    Btw, it's very interesting to know who do you retrieve encoding from the stream?

    In HTTP you have the Content-type header, in a file system you have system-wide defaults. For other transports you can invent your own way of communicating encoding out of stream (by simple convention, for example). It's not a problem in practice, really.

  7. Larry West

    As I read RFC 2119, "MUST" and "SHALL" are equivalent:

    1. MUST This word, or the terms "REQUIRED" or "SHALL", mean that the definition is an absolute requirement of the specification.

    ... perhaps I'm missing some subtle point when you propose mandating use of "MUST" rather than "SHALL"?

    https://www.ietf.org/rfc/rfc2119.txt

Add comment