JSON spec says that a UTF shall be used to encode it as a byte stream (with a strong preference to UTF-8). However it also allows characters to be encoded in the form of
\uXXXX. This poses a problem on the parser implementation side:
You have to decode a byte stream twice: first convert a UTF encoding to Unicode code points and then replace any
\uXXXXsequences with a single code point.
\uencoding is redundant as you can encode all of Unicode in a UTF just fine. The only technical reason that I can see in the RFC for this is that a UTF encoding is a SHALL, not a MUST.
Your language runtime probably doesn't make the second step easy.
Modern languages that I'm familiar with use distinct data types for UTF-encoded sequences of bytes and Unicode characters. So even if your runtime has a built-in codec for
\uXXXescapes it probably expects a sequence of byte on input to produce a sequence of Unicode characters on output. But treating your input stream first as UTF-encoded produces those
\uXXXXas characters already, not bytes. So you can't use you library codec and have to decode those manually, which is brittle and silly.
I'm writing this as a comment to Tim Bray's "I-JSON" which is an emerging description of a stricter JSON aimed to avoid interoperability problems with JSON (RFC 7493). So here's my comment/proposal in a nutshell:
- Mandate a UTF encoding with a "MUST", not a "SHALL".
- Drop the