JSON spec says that a UTF shall be used to encode it as a byte stream (with a strong preference to UTF-8). However it also allows characters to be encoded in the form of
\uXXXX. This poses a problem on the parser implementation side:
You have to decode a byte stream twice: first convert a UTF encoding to Unicode code points and then replace any
\uXXXX sequences with a single code point.
\u encoding is redundant as you can encode all of Unicode in a UTF just fine. The only technical reason that I can see in the RFC for this is that a UTF encoding is a SHALL, not a MUST.
Your language runtime probably doesn't make the second step easy.
Modern languages that I'm familiar with use distinct data types for UTF-encoded sequences of bytes and Unicode characters. So even if your runtime has a built-in codec for
\uXXX escapes it probably expects a sequence of byte on input to produce a sequence of Unicode characters on output. But treating your input stream first as UTF-encoded produces those
\uXXXX as characters already, not bytes. So you can't use you library codec and have to decode those manually, which is brittle and silly.
I'm writing this as a comment to Tim Bray's "I-JSON" which is an emerging description of a stricter JSON aimed to avoid interoperability problems with JSON (RFC 7493). So here's my comment/proposal in a nutshell: