Problem with JSON encoding

JSON spec says that a UTF shall be used to encode it as a byte stream (with a strong preference to UTF-8). However it also allows characters to be encoded in the form of \uXXXX. This poses a problem on the parser implementation side:

You have to decode a byte stream twice: first convert a UTF encoding to Unicode code points and then replace any \uXXXX sequences with a single code point.

The \u encoding is redundant as you can encode all of Unicode in a UTF just fine. The only technical reason that I can see in the RFC for this is that a UTF encoding is a SHALL, not a MUST.
Your language runtime probably doesn't make the second step easy.

Modern languages that I'm familiar with use distinct data types for UTF-encoded sequences of bytes and Unicode characters. So even if your runtime has a built-in codec for \uXXX escapes it probably expects a sequence of byte on input to produce a sequence of Unicode characters on output. But treating your input stream first as UTF-encoded produces those \uXXXX as characters already, not bytes. So you can't use you library codec and have to decode those manually, which is brittle and silly.

I'm writing this as a comment to Tim Bray's "I-JSON" which is an emerging description of a stricter JSON aimed to avoid interoperability problems with JSON (RFC 7493). So here's my comment/proposal in a nutshell:

Mandate a UTF encoding with a "MUST", not a "SHALL".
Drop the \uXXXX escapes.

Makes sense?

Comments: 7

John Cowan

The main advantage of the \uxxxx escapes is that when writing JSON by hand (which does happen) you can insert an arbitrary character into it, and you can be sure that you have done so. The string "foo\u200bbar" is much more obviously a seven-character string than "foobar" is.

Ivan Sagalaev

John Cowan: granted. I still think the author's advantage in this case has less value than the gain to the parser from dropping the escaping. It has a performance price for every parsed JSON document.

Cowtowncoder

Ivan: parser still must handle "simple" escapes to allow inclusion of backslashes, quotes and control characters within String values, so simplification from not having to handle unicode escapes is not all that significant. And although it would slightly help in making canonicalization easier, simple escapes also contain nice-to-have characters that may be chosen optionally.

Not that I would be against dropping Unicode escapes necessarily, just pointing out that having written high-performance JSON parser I am familiar with optimizations, and argue that this won't make a big difference either way.

Alexander Batishchev

Makes sense for me. At least from the perspective you described.

Igor Kalnitsky

Generally I agree, though there's one benefit from escaping you didn't consider. Escaping allows you to have an ASCII-compatible representation of JSON document. That matters, don't you think so?

Btw, it's very interesting to know who do you retrieve encoding from the stream? I mean the JSON document could be sent in UTF-8 as well as in UTF-16 or UTF-32, so you have to choose the right one for proper decoding and it seems not trivial for me.

Ivan Sagalaev

Cowtowncoder:

parser still must handle "simple" escapes to allow inclusion of backslashes, quotes and control characters within String values, so simplification from not having to handle unicode escapes is not all that significant.

You're right of course, I didn't take those into account. But even marginal simplification is simplification. The important part is that there's no downside (as far as I can see).

Igor Kalnitsky:

Escaping allows you to have an ASCII-compatible representation of JSON document. That matters, don't you think so?

No, not really. Every modern system can produce UTF of some sort these days.

Btw, it's very interesting to know who do you retrieve encoding from the stream?

In HTTP you have the Content-type header, in a file system you have system-wide defaults. For other transports you can invent your own way of communicating encoding out of stream (by simple convention, for example). It's not a problem in practice, really.

Larry West

As I read RFC 2119, "MUST" and "SHALL" are equivalent:

MUST This word, or the terms "REQUIRED" or "SHALL", mean that the definition is an absolute requirement of the specification.

... perhaps I'm missing some subtle point when you propose mandating use of "MUST" rather than "SHALL"?

https://www.ietf.org/rfc/rfc2119.txt

John Cowan

The main advantage of the \uxxxx escapes is that when writing JSON by hand (which does happen) you can insert an arbitrary character into it, and you can be sure that you have done so. The string "foo\u200bbar" is much more obviously a seven-character string than "foobar" is.
Ivan Sagalaev

John Cowan: granted. I still think the author's advantage in this case has less value than the gain to the parser from dropping the escaping. It has a performance price for every parsed JSON document.
Cowtowncoder

Ivan: parser still must handle "simple" escapes to allow inclusion of backslashes, quotes and control characters within String values, so simplification from not having to handle unicode escapes is not all that significant. And although it would slightly help in making canonicalization easier, simple escapes also contain nice-to-have characters that may be chosen optionally.

Not that I would be against dropping Unicode escapes necessarily, just pointing out that having written high-performance JSON parser I am familiar with optimizations, and argue that this won't make a big difference either way.
Alexander Batishchev

Makes sense for me. At least from the perspective you described.
Igor Kalnitsky

Generally I agree, though there's one benefit from escaping you didn't consider. Escaping allows you to have an ASCII-compatible representation of JSON document. That matters, don't you think so?

Btw, it's very interesting to know who do you retrieve encoding from the stream? I mean the JSON document could be sent in UTF-8 as well as in UTF-16 or UTF-32, so you have to choose the right one for proper decoding and it seems not trivial for me.
Ivan Sagalaev

Cowtowncoder:

parser still must handle "simple" escapes to allow inclusion of backslashes, quotes and control characters within String values, so simplification from not having to handle unicode escapes is not all that significant.

You're right of course, I didn't take those into account. But even marginal simplification is simplification. The important part is that there's no downside (as far as I can see).

Igor Kalnitsky:

Escaping allows you to have an ASCII-compatible representation of JSON document. That matters, don't you think so?

No, not really. Every modern system can produce UTF of some sort these days.

Btw, it's very interesting to know who do you retrieve encoding from the stream?

In HTTP you have the Content-type header, in a file system you have system-wide defaults. For other transports you can invent your own way of communicating encoding out of stream (by simple convention, for example). It's not a problem in practice, really.
Larry West
As I read RFC 2119, "MUST" and "SHALL" are equivalent:
1. MUST This word, or the terms "REQUIRED" or "SHALL", mean that the definition is an absolute requirement of the specification.
... perhaps I'm missing some subtle point when you propose mandating use of "MUST" rather than "SHALL"?

https://www.ietf.org/rfc/rfc2119.txt

Problem with JSON encoding

Comments: 7

Add comment