Today marks a milestone: with implementation of string unescaping my json parser actually produces entirely correct output! Which doesn't necessarily mean it's easy to use or particularly fast yet. But one step at a time :-)
The code came out rather small, here's the whole function (source):
fn unescape(s: &str) -> String {
let mut result = String::with_capacity(s.len());
let mut chars = s.chars();
while let Some(ch) = chars.next() {
result.push(
if ch != '\\' {
ch
} else {
match chars.next() {
Some('u') => {
let value = chars.by_ref().take(4).fold(0, |acc, c| acc * 16 + c.to_digit(16).unwrap());
char::from_u32(value).unwrap()
}
Some('b') => '\x08',
Some('f') => '\x0c',
Some('n') => '\n',
Some('r') => '\r',
Some('t') => '\t',
Some(ch) => ch,
_ => panic!("Malformed escape"),
}
}
)
}
result
}
Likes
Luckily I can make an educated guess about how much memory my resulting string would occupy and allocate it at once with String::with_capacity()
. It works because s.len()
gives me the length of a UTF-8 string in bytes, so my output is guaranteed to be equal or smaller than the source, because:
- raw UTF-8 characters are left intact
\n
,\t
, etc. are translated into one byte from two\uXXXX
become UTF-8 sequences which occupy less or equal than the original 6 bytes
Look ma, no re-allocations!
Char by char iteration
I seriously don't like having to result.push
every single byte even for strings containing no \-escapes whatsoever (which is the vast majority of strings in the real-world JSON). I'd like to be able to walk through a source string and either a) copy chunks between \
in bulk or b) if there's none found simply return the source slice converting it to an owned string with to_owned()
. But I wasn't yet able to figure out how to approach that.
By the way, I find while let Some(ch) = chars.next()
rather brilliant! It loops as long as the iterator returns something that can be destructured into a usable value and handily binds the latter to a local var.
Also, XMPPwocky at #rust IRC channel suggested "to write something on top of a Reader
" and "specifically something over a Cursor<Vec<u8>>
, actually". Though that was prompted by an entirely different discussion.
Non-obvious .by_ref()
There's this long line in the middle that converts four bytes after \u
into a corresponding char:
let value = chars.by_ref().take(4).fold(0, |acc, c| acc * 16 + c.to_digit(16).unwrap());
What happened without by_ref()
was this line stole ownership of chars
from the outer while
loop, and Rust didn't let me use chars
anywhere else.
If you aren't familiar with the concept of ownership in Rust, head over to the official explanation.
That was rather surprising because my gut feeling is (or was) that .take(4)
is hardly any different than calling .next()
four times in a loop, and yet the latter leaves the original iterator alone with its owner.
Hex conversion
You may notice that I convert hex numbers into chars manually with .fold()
(aka "reduce" in other languages) even though Rust has from_str_radix(16)
for that. I used it at first but I had to use a separate &str
which I was only able to get by allocating a temporary String. I didn't like an extra allocation so I resorted to the manual way which, frankly, isn't all that bad.
Comments: 2
Does it parse
\/
and\"
?Yes,
match chars.next()
always advances the iterator past one character after\
andSome(ch) => ch,
simply copies any unrecognized escape intact.