Маниакальный веблог » Мои программы

Shoppy

2026-05-18T21:29:49.095331+00:00

Meet Shoppy! It's a helper app for my recently revived shopping list, with which I'm hoping to grow the dataset for categories prediction. In fact, even early beta tests have made Shoppy significantly more savvy about alcoholic drinks (the initial data comes from my own shopping, and my entire family happens to be non-drinkers). See if you can confuse it about something it doesn't know!

But besides that, there's a few deeper philosophical and technical notes I wanted to share.

The app

It's a very, very simple Django app. When I first had the idea to build it I entertained some thoughts about trying some front-end based technology, because, you know, it's an "app"… But then after actually thinking about what it's going to be — a handful of static screens and a couple of forms — I decided to go the familiar way.

Now I have a small, view-source'able HTML app which I'm proud to offer as an example of how you can build something interactive without the layers of modern front-end technology.

If you're new here, simplicity is kind of my thing in software engineering. Although it's really hard to convince people to do simple.

CSS

Trying modern CSS after a long break felt really exciting! Nested blocks, variables, complete control over the box model, new useful units (like vw), and niceties like width: fit-content — all of these made my life much simpler.

I was especially impressed with border-image which allowed me to make speech and form bubbles flexible. Without it, trying to make text of variable length look nice in a fixed-size bubble caused me a lot of frustration.

For layout, I tried flexbox and grid, but they didn't really work for me. It's my own fault, really. You see, ever since I bought into the idea of separating the roles of markup and style, I dislike adding extra structure to markup purely for styling convenience. Markup needs to mean something!

And the one thing that grids and flexboxes really like is having straightforward container <div>s with stuff inside of them. But what I have is a <body> which consists of naked <img>, <p>, <form> and <footer>, in this order — and that's just not enough structure to say "this goes here, and that goes there".

So I ended up with good old absolute positioning and some paddings around Shoppy's avatar. CSS variables really do shine for things like this.

And! It was my first time making a responsive layout that looks nice both on mobile and desktop! Tell me if something is broken on your particular setup.

Model

The model is a mapping from "terms" to categories. I learned to build such things while working on the Search team at Shutterstock, and their simplicity still amazes me!

Here's how it works:

You get a search query, like "Honeycrisp apples".
You split it into words, stem them and sort them, which gives you ["appl", "honeycrisp"] — a predictable set of keys independent of morphology and the input order (they're called unigrams).
Then you generate all two-word combinations (called bigrams) from this set, which in this case gives you just ["appl honeycrisp"], and add them to unigrams.
And then you look up each of the search terms in the dataset and pick the entry that comes the earliest. In this case, there's only one: appl,produce.

That's it!

But there's a few non-obvious tricks it lets you do:

You don't need to list all the apple varieties, unknown words are simply ignored, and you just recognize any apple as produce.
But what of "apple juice"? For that it has an entry juic,drink, which is deliberately placed before the apples, so it gets picked up instead. In fact, what it means is that "any kind of juice is a drink, regardless of what it's made of". Same goes for "oat milk" (drink), "diced tomatoes" (canned products), etc.
Now think of "apple sauce". "Apple" is produce, "sauce" is (usually) a condiment. But "apple sauce" is a snack! This is where bigrams come into play: the bigram entry appl sauc,snack comes before both sauce and appl, which resolves the conundrum. (In fact, all of the bigrams must come before all the unigrams, because they're always more specific.)

There's some more to it all, and there are downsides, but I won't go any deeper right now.

LLMs

It's 2026, so I can't not talk about it, can I?

Generative AI happened to the world right in between of me first coming up with the idea of category prediction and having a chance to actually implement it. And I admit of having thoughts that may be there's no point in building your own model for such a thing now. After all, just ask any LLM "which grocery category is dill weed" and it will tell you… a lot of text with several variants, which you can't really use in a precise manner :-)

So of course I went back to my own idea, because it's much, much simpler. And local. And free. And ethical.

Luckily, the simpler solution doesn't really lose on feeling magical and intelligent. I've seen people play with the app and really engage with it, and be impressed! One of the testers, when trying to come up with a random grocery item for the first time, said, "There's probably a million of them!" It doesn't matter that my entire model is just around 500 entries, it still feels like it knows much more simply because people overestimate the size of the problem :-)

Graphics

You see, I can process photos, I can do business graphics, and I'm known to have put together a few toolbar icons in my time… but for the life of me I can't draw! And even if I could, I'm particularly hopeless at coming up with what to draw.

So I commissioned the graphics from an artist, who also introduced me to the concept of "object shows" and the whole OSC fandom. Not sure I'm joining as a fan yet, but I'm definitely very happy with the original character of Shoppy! Oh, and the background.

And none of it is AI!

nfp -e

2026-05-05T15:03:51.634724+00:00

Last Friday I spotted Dave Gauer's post about using a text editor as a UI which hit some of my sweet spots about computers. One of the examples mentioned in it was crontab -e which opens a cron config file in a text editor. And not only it spares you remembering the location of the config, but it also offers a guiding commented example if the config is missing, and helpfully signals cron to pick up the changes after you finish editing.

And almost immediately I thought about my own tool that could use something like this: nfp.

Amazingly enough, not only had I actually opened the project and started fiddling around, I continued doing it through the weekend, and by the end of day on Sunday actually finished the feature! And despite it being a rather small one, as they go, I have to add that I was coding all through watching snooker matches, cooking food, chauffeuring my family on errands and dealing with some emergencies.

So I went to bed feeling quite happy with myself :-)

Complications

It still feels exciting to me how any programming task gradually reveals its true complexity after you go from thinking about what you should do to actually doing it. Saying "nfp -e should open the config in a text editor and restart after editing" sounds simple enough, but here's a few of the questions I had to work through. Some of them were quite the head-scratchers.

First, open which editor? There's an obvious $EDITOR env var, but there's also $VISUAL, which usually takes precedence.
How do you restart? Sending a signal to a working daemon was the first thing that came to my mind, but that might prove cumbersome, as the daemon lives in a loop waiting for file events, and this loop owns the information parsed from the config. Handling SIGHUP would require a separate facility to update that information. I'm not quite comfortable thinking of how to do that in Rust.

Thankfully, this complication turned out to be a blessing in disguise: since the file watching machinery is already there, just watch the config too and add a special case to handle it differently from regular files!
Do you edit the config file directly or do you do it on the side, in a temp file? The temp file feels like a cleaner, safer choice, because it gives you a chance to verify correctness of the new config and prevent the real one from breaking. But there's a downside: how do you open the same temp file for the user to continue editing it the next time they run nfp -e? It's going to be a new process, it doesn't know the old temp file.

crontab -e does it by organizing a loop within the same process with a yes/no prompt asking the user if they want to re-edit the same file. But that starts feeling more complicated than the feature deserves.

I ended up with a simpler solution where I always open the actual config, which means it can get mangled. I handle it in the running daemon itself, which simply refuses to restart its main loop when it can't parse the config.
Providing an example config on the first run proved to be tricky, as confy (the config handling library) actually immediately creates a non-empty file if it doesn't exist on the a load attempt. So I had to rewire my brain to think "a config with no useful entries" instead of "a missing config." That worked!

All in all, this was quite fun!

Some admin notes

I (finally) converted the repository from pijul to git and pushed it to CodeBerg. I still think pijul has a superior architecture as a VCS, but the world has apparently settled on git for good. Also, while I'm happy to not deal with the toxic culture of GitHub, having code published in a weird way means most people wouldn't even want to try it. After 4 years I haven't gotten a single peep of feedback :-) And I still believe in sharing. I hope CodeBerg becomes my sweet spot.

Keep in mind though, that I don't code in Rust regularly and don't keep up with modern idioms. So my code is most certainly very ripe for various improvements. I'd love to hear from you! (An obvious one is, the code needs tests. I just need to learn how to write them in Rust!)

Pet project restart

2025-11-23T06:24:21.059276+00:00

So what happened was, I have developed my shopping list to the point where it got useful to me, after which I lost interest in working on it. You know, the usual story… It was however causing me enough annoyances to still want to get back to it eventually. So a few weeks ago, after not having done any programming for a year, I finally broke through the dread of launching my IDE again and started on slowly fixing the accumulated bitrot. And through the last several days I was on a blast implementing some really useful stuff and feeling the familiar thrill of being in the flow.

Annoyances

Since I was mostly focused on making the app useful I didn't pay a lot of attention to the UI, so most of the annoyances were caused purely by my not wanting to spend much time on fighting Android APIs.

Here's one of those. The app keeps several shopping lists in a swipe-able pager, and at the same time swiping is how you remove items from the list while going through the store. The problem was that swiping individual items was really sensitive to a precise finger movement, so instead it would often be intercepted by the pager and it would switch to the next list instead.

Very annoying…

That's fixed now (with an ugly hack).

But the biggest deficiency of the app was that it didn't let me get away from one particular grocery store that I started to rather dislike. You might find it weird that some app could exert such control over my actions, but let me explain. It all comes down to three missing features…

The central feature of my app is remembering the order in which I buy grocery items. This means I need a separate list for every store, as every one of them has a different physical layout. By the time I was thinking of switching to another store I already had an idea about a new evolution of the order training algorithm in the app, and a new store would be a great dogfooding use case for it. So I've got a sort of mental block: I didn't want to switch stores before I implemented this new algorithm.
Over some years of using the app with a single store I've been manually associating grocery categories with products ("dairy", "produce", etc.). They are color coded, which make the list easier to scan visually. But starting a new list for another store meant that I would either need to do it all again for every single item, or accept looking at a dull, unhelpful gray list. What I really needed was some smart automatic prediction, but I didn't have it.
I usually collect items in a list over a week for an upcoming visit to the store, and sometimes I realize that I need something that it simply doesn't carry, or my other errands would make it easier to go to another store. At this point I'd like to select all the items in a filled-up list and move them to another, which the app also couldn't do.

See, it all makes sense!

Now, of course it wasn't a literal impossibility for me to go to other stores, and on occasion I did, it just wasn't very convenient. But these are all pretty major deficiencies, and I'm not ready to offer the app to other people without them sorted out.

Progress

Anyway… Over the course of three weeks I implemented two of those big features: category guessing and cross-store moves. And I convinced myself that I can live with the old ordering algorithm for a while. So now I can finally wean myself off of the QFC on Redmond Way (which keeps getting worse, by the way) and start going to another QFC (a completely different experience).

Freedom!

Category guessing

All the categories (item colors) you see in the screencaps above were guessed automatically. My prediction model works pretty well on my catalog of 400+ grocery items: the data comes from me tagging them manually while doing my own shopping these past 4 years. And this also means, of course, that it's biased towards what I tend to buy. It doesn't know much about alcohol or frozen ready-to-eat foods, for example. I'm planning to put up a little web app to let other people help me train it further. I'll keep y'all posted!

One important note though…

No, it's not a frigging LLM!

It's technically not even ML, as there is no automatic calibration of weights in a matrix or anything. Instead it's built on a funny little trick I learned at Shutterstock while working on a search suggest widget. I'll tell you more when I launch the web app.

Android UI

When I started developing the app, I used the official UI toolkit documented on developer.android.com. It's a bunch of APIs with a feel of a traditional desktop GUI paradigm (made insanely complicated by Google "gurus"). Then the reactive UI revolution happened, and if you wanted something native for Android, it was represented by Flutter. Now they're recommending Compose. I'm sure both are much better than the legacy APIs, but I'm kind of happy I wasn't looking in this space for a few years and wasn't tempted to rewrite half the code. Working in the industry made me very averse to constant framework churn.

Future plans

I'm not making any promises, but as the app is taking shape rather nicely, I'm again entertaining the idea of actually… uhm… finishing it. Which would mean beta testing, commissioning professional artwork and finally selling the final product.

Wish me luck!

Debounce

2023-12-24T23:56:03.986303+00:00

When last time I lamented the unoriginality of my tool nfp I also entertained the idea of salvaging some value from it by extracting the event debouncing part into a stand-alone tool. So that's what I did.

Meet debounce, a Rust library and a prototype command-line tool.

Rust

In all honesty, I didn't really need debounce as a library. nfp was already working fine. But it felt like the Right Thing™ to do, and gave me an excellent opportunity to play with Rust's synchronization primitives.

Blocking and waiting

For a clean solution (that is, one without busy polling I employed before) I needed two threads: the main one to block and wait forever for external events, and a worker to wait out timeouts and perform specified actions. The worker would also have a mode where there are no events and it should block and wait for the main thread to supply one.

This sounds like the job for a conditional variable, and I had hoped Rust would have some idiomatic higher-level wrapper around them. Turned out, Rust had three :-)

Condvar, which is exactly the Rusty wrapper around the idea of a conditional var
channel, a higher-level interface for consumers to wait on data supplied by producers (which I suspect is built on top of Condvar)
parking, a built-in lightweight ability for a thread to suspend ("park") until the other thread wakes it up.

Somehow it's very on-brand for a language that gives you 5 kinds of pointers and 4 kinds of strings :-) But this is also what makes it fun! Anyway, I ended up using parking, as it didn't need any extra code and worked well for my use case where I don't mind the worker thread being occasionally randomly woken up out of turn.

Traits intricacies

Another purely Rustian puzzle I stumbled upon had to do with polymorphism. I have two kinds of event buffer types whith identical interface for getting values out of them. In Rust you express this with a trait which concrete types then implement in their own way:

pub trait Get<T> {
    fn get(&mut self) -> State<T>;
}

T is the type of data stored in the buffer.

Here seasonal Rustians are probably already asking their screens something along the lines of "wait, if your return type strictly depends on what's in the buffer, it doesn't really make sense for it to be a parameter of the trait…" And they are totally correct, but I didn't know that at that point.

So far so good. Then I thought that, having a .get(), it should be pretty natural for the buffer to implement a standard Iterator that would call .get() as long as there are items in the buffer in the ready state.

So I wrote the obvious:

impl<T, B> Iterator for B
where
    B: Get<T>,
{
    fn next(&mut self) -> Option<T> {
        todo!();
    }
}

Which says "this is an implementation of the standard Iterator trait for any type B which implements Get".

This however produced a compiler error which proved too intricate for me to understand. So, long story short, I went on Rust user forum where nice people imparted me over a couple of days with deep knowledge about traits, blanket implementations and associated types (which I think I finally get). Now my buffers are also iterators and I don't need to repeatedly call .get() in my tests :-)

Here's a couple of things I had a chance to reflect on, following this story:

Having such a go-to place as users.rust-lang.org is exactly what I'm missing while developing for Android. To my knowledge, there just isn't anything like this for that ecosystem, and everyone just shouts in the abyss of Stack Overflow and tries to sort out random pieces of code coming from there.
This type system wrangling is one of the things that makes dynamically typed languages more productive. And yes, I'm aware of the downsides, so no need to repeat the mantra of "typed languages remove a whole class of bugs" in the comments. Better think of the whole new class of code structures you need to learn and maintain to do it :-)

CLI tool

Rust's packaging tool, Cargo, has a built-in notion of "examples", where you can implement something working without affecting your library dependencies and have it automatically built alongside the main code.

So I implemented a CLI tool which works exactly in the way I described in the previous post, by removing sequential duplicates from stdin that happen within a specified grace period:

inotifywait -m . | debounce -t 200

It's very bare-bones, as much as you would expect from a working example. I encourage anyone who needs additional options and features to write their own solution. (Here's a free idea: let the user specify by which part of the string to test equality, either with a regex or a field index, or something.)

Open source

So this technically makes me an open source maintainer. Again. But this time, having 10+ years of experience maintaining highlight.js I think I'm going to do things differently.

I don't like the default assumptions about what FLOSS maintainers are supposed to do these days. You're supposed to write code, do regular releases (lest your project will be pronounced dead), react on issues, review PRs from random people and be extra energized when dealing with anything that has the word "security" attached to it. And as a bonus for particularly good work you'll be rewarded with a Community™, whose self-proclaimed leaders would harass you for being a dictator who should feel guilty about not having the Community's interests, as formulated by the "leaders", in mind every single second of your life.

This is all bullshit, of course. But this is also reality. And I used to bitch about it before, there's nothing new here.

So here's what I'm going to do:

I won't develop the code past what I need from it myself. If someone needs more features, they should write their own solution and maintain it (or not maintain it!) in the way they want. The license explicitly allows it.
I am interested in what other people would make of it, but I make no promise about accepting all derivative work into my code. As long as you don't forcefully insist on having your PR merged, I remain a nice person and encourage sharing of ideas!
I am especially interested in suggestions (in any form) on improving my Rust. This is, after all, what I wrote the thing for!

In this light, my choice of pijul as a version control system plays well into this, as I expect to be somewhat shielded from Github's crowd where that sense of needy entitlement is especially strong.

A random recent example is this thread where people with Opinions™ have been harassing maintainers of black about a minor issue for three years, and not a single one of them thought of volunteering to maintain a fork with the stability guarantees they ostensibly require so hard. Such work is not much fun of course, but they assume the maintainers owe it to them.

P.S. I think I should write more about pijul, it's an interesting project!

P.P.S. By the way, check out highlight.js! Since I transferred it to more motivated people it became such a powerhouse!

New pet project

2025-11-03T00:07:17.013132+00:00

So anyway, I'm making a shopping list app for Android. As I understand, "shopping list" is something of a hello-world exercise of Android development, which may explain why there are so many rudimentary ones in Google Play. Only in my case I actually need one, and I know exactly what I want from it.

See, for the past 10 years or so I've been in charge of food supply in our family, which includes everything from grocery shopping logistics, to cooking, to arranging dishes in the dishwasher. And the app is an essential part of the first stage of that chain.

Previously

Up until recently I used Out of Milk, which someone suggested me a long time ago, and at that time it was probably the best choice. I remember being quite happy to pay for a full version. Over time though it got a little bloated in ways I didn't need and a little neglected in places I cared about. The UI got very "traditional", requiring fiddly unnecessary motions for core functionality.

Here's the short list of its wrongs that I still remember:

Start-up time of several seconds, sometimes overflowing into dozens. I believe my 4-year old phone should be perfectly able to load a shopping list in sub-second time.
Adding an item when it's already on the list results in two identical items on the list. (Yes, really.)
Auto suggest when adding an item has whatever ordering and limits the amount of displayed results. This meant I could never get "Tomatoes" in there, as they were buried under "Roma tomatoes", "Cherry tomatoes", and a few others with no way to scroll to it.
Tiny click target to check an item off the list. I was constantly fat-fingering around those and getting into a different screen.
Checking an item off the list puts it into another list below the main one, which you either have to empty all the time, or end up with a huge scroll height. As I understand, the idea was that you could uncheck the items from there to put them back on the list, but that's unrealistic with my catalog of ~ 150 items.
"Smart" categorization kept inventing excessively detailed categories leading to several one-item categories clogging up the list.
Sometimes unsuccessful synchronization would "forget" added items on the list. Which is funny because I didn't have anything to synchronize with!

Now what

I probably could spend some time on searching for an app that'd suit me better, but… Look, I'm a programmer. Writing code is what I do! And I wanted to play with Android development since forever, and the recent exposure to Kotlin gave me all the reasons I didn't really need in the first place :-)

Here's a laundry list of what I want from a shopping list:

Automatic ordering based on the order in which I buy things. I've had this idea ever since I was using Out Of Milk, because ordering manually sucks, and it feels like something computers should be able to do well, right? However it's really not trivial to implement, if you think of it. So it was my main challenge and a trigger to actually start the project.
Fuzzy search for suggested items. I'm used to typing 3-4 characters in my Sublime Text to go to every file or identifier in a project. I want the same service here.
Smart sorting of suggested items. It could take into account closeness of matching, frequency and recency of buying.
Multiple lists with separate histories. Different stores have different order of aisles, and I buy different things in them. A single list won't cut it.
Renaming and annotating items. I get annoyed by typos and spelling errors, I want to correct them. And sometimes I want to add a short note to an item (like a particular brand of cheese, or a reminder that I need two cartons of milk this time).
Color-coded categories, to give visual aid in scanning what otherwise would be a plain list of strings. They don't have to be terribly detailed.
Less of buttons, check boxes and dialogs. I want to interact with the content itself as much as possible. Swiping items off the list instead of clicking a checkbox. Having lists themselves in a carousel, instead of choosing their names from a <select>, etc. Oh, and no settings, if I can get away with it!
Undo. It's really annoying to accidentally swipe off something covered by your thumb only to realize it's not what you intended, and now you have no clue what it was.
GPS pinning. This is one aspirational feature I'll probably tackle last, if ever. I want to pin a list to a particular geo location, so the app would automatically select it when I'm at this store again.
Also, no tracking, ads or other such bullshit. Should be self-explanatory :-) Not having some ugly API SDK making network calls at startup should really help with performance.

Current status

I actually first started working on it at the end of 2019 and made good progress into 2020… but then something got in the way.

commit 8ca7b341801db3fda2e6fdbb5c1436d2b917b123
Author: Ivan Sagalaev <maniac@softwaremaniacs.org>
Date:   Fri Dec 4 20:43:11 2020 -0800

    Remove .idea/* from under git

commit a7d58b20d051b47cdc79868f578c85ba831c4801
Author: Ivan Sagalaev <maniac@softwaremaniacs.org>
Date:   Sun Jan 26 22:21:46 2020 -0800

    Rename `actualRecency` -> `recencyScore`

Yeah… Anyway, after making an effort to restart the project I'm making good progress again and actually feel really happy about it all!

Swiping right to "Buy" a thing

About a month ago I started dogfooding the app and was able to deleted Out Of Milk from my phone (So long and thanks for all the fish!) I've got the first five features mostly done, but there's nothing like actually using it that keeps showing me various edge cases I could never think about. I love this process :-)

Crucially, I can now add "Tomatoes" by just typing "t", "m" — and have them as the first suggestion.

The app looks pretty rudimentary, as you'd expect at this stage. But really, this time I want to not just fool around and dump the code somewhere in the open, I actually want to make a finished, sellable product out of it. Going to be a fun adventure! (Technically, me and my wife already tried selling my shareware tools at some time in the previous century, but we managed to only sell about two copies, so it doesn't count.)

Wish me luck :-)

Misconception about OSS support

2019-06-03T22:33:08.462329+00:00

You wouldn't think a free syntax highlighting library would be a strong dependency for the development process of a business, and yet I'm waking up on a Monday to a flurry of comments and even one personal email from engineers eager to ask me to work for free for their employers.

So of course I took time to scathingly turn it into a teachable moment.

https://github.com/highlightjs/highlight.js/issues/1984#issuecomment-466941892:

I would like if you revert the change. It is currently blocking a lot of build from other people

Let me take this as an opportunity to explain something about the current sorry state of relationship between businesses and open source projects. (Yeah, I know, but people still don't get it.)

highlight.js is not a business, it's a hobby.

It means that whatever gets pushed to this repository or npm should be assumed to be the result of someone having fooled around and gone away for a weekend with their family. Or for a busy working day at their job.

If a business has made a decision to rely on this artifact for anything requiring any sort of stability (i.e. "blocking a lot of build from other people"), it made a stupid and uninformed decision. Or more realistically, it simply relies on maintainers feeling ashamed enough to quickly fix problems when they happen. Even more realistically, it just accepts the fact that their engineers are going to deal with maintainers by soliciting free support, because it has always worked this way. I, for one, don't feel any urge at all supporting someone's misplaced expectations :-)

So, dear fellow engineers, please take this build hiccup as an opportunity to explain to your particular business people that their entire intellectual property is a thin layer on top of a shaky foundation of open-source code lazily maintained by hobbyists or paid for by other businesses having their own goals in mind. Mention the leftpad story for more effect.

If they really want stability they have to invest in it.

… by, for example, hiring engineers to deal with myriad of dependencies, maintain local stable forks, contribute patches upstream, or whatever — the key point is that it should not look like it "just works" on fairy dust.

highlight.js turns 10

2016-08-17T06:27:06.934000+00:00

Almost exactly ten years ago on August 14 I wrote on this very blog (albeit in a different language):

So on yesterday's night I got worked up and decided to try and write [it]. But on a condition of not dragging it on for many days if it didn't work out on the first take, I've got enough on my mind as it is.

It did work out. Which makes August 14 the official birthday of highlight.js! Although it wasn't until 5 days later when the first meaningful commit was recorded. Using any form of source control was only an afterthought for me back then :-)

Quick flash back

Switched through 3 version control systems (Subversion, Bazaar, Git).
Made 71 (seventy-one!) public releases, with a regular 6 week cadence for the past year.
166 languages and 77 styles created by 216 contributors and 3 core developers.
Accumulated 8062 stars on Github.
Went from being a single .js file to be provided as a custom-built package, a node.js library and served from two independent CDNs.
Acquired a mighty 490-strong unit test suite.

Identity

With the obligatory self-congratulatory stuff out of the way, let me now get to the main purpose of this anniversary post: explaining what makes highlight.js different among other highlighters. I'm not going to talk about obvious features listed on the front page of highlightjs.org. I'll try to document the philosophy that up until this point I was only referring to in various places, but never was able to put together.

I'll try to keep it short (otherwise I'll never finish this post!)

It is my deep conviction that highlighting should make code more readable instead of simply making it… fun, for the lack of better word.

Let me explain by example. Here's some things that serve towards better readability when highlighted:

Keywords, because they define the overall structure of the code and because they need prominent highlighting simply because they otherwise look too much like user variables.
Function and class titles at the place of declaration, because they effectively define a domain-specific language, an API. They have a very distinct semantics.
Built-ins and special literals, because it helps to know what in the code belongs to the language and what is defined by the user.

And these are the things highlighting which makes no sense, in my humblest opinion:

CamelCase identifiers, because it's not consistent: you get identifiers of the same nature either highlighted or not simply because they happen to be named differently.
.method() calls, because I, frankly, can't even invent a plausible reason of why they should be highlighted in any way.
Punctuation, because it significantly increases the amount of color clutter in any given snippet which makes it hard on the eyes.

I have a hypothesis that the only reason why these things get highlighted traditionally is simply due to the fact that they could easily be picked up by a regexp :-)

In highlight.js we sometimes go to great lengths to highlight what makes sense instead of what's easy ("semantics highlighting?"). In lisps we highlight the first thing in parentheses, regardless of it being or not being built-in, and we have special rules to not highlight them in quoted lists and even in argument lists in lambdas in Scheme. In VimScript we try our best to distinguish between strings and line comments even though they seem to be deliberately designed to trip up parsers. And we recognize quite a few ways of spelling out attributes in HTML.

The downside of this is that highlight.js is heavier and probably slower than it could've been. These were the reasons why we recently lost a bid on replacing the incumbent highlighting library on Stack Overflow. I still think they made a mistake :-)

Because quality beats lightness!

Come join us!

Of course no code base is ideal, especially a 10 year old one, there's always so much to do! However, since our way of dealing with the stress of Open Source maintenance is to not have it happening to us, the development of highlight.js goes at a rather leisurely pace. Which means we've accumulated quite a few plans without any reasonable expectation of when they might happen.

There's a new exciting parser in the making. We'd like to do an overhaul of our build system and packaging. There are plans to have pluggable renderers in addition to HTML.

You could be the one taking one of those over and covering yourself with great glory! If interested, drop me a line at maniac@softwaremaniacs.org.

Cadence for highlight.js

2015-09-09T19:22:55.312000+00:00

We're now doing releases of highlight.js on a cadence of 6 weeks. The latest release 8.8 was the second in a row (which is what technically allows me to write "are now doing").

The reason for that is we (well, mostly me) had a certain difficulty deciding when to actually release something. We don't develop new grand features on a regular basis, all that's happening is bug fixes, new language definitions and new styles. And releasing a new version for every little change is going to annoy end users and drive downstream maintainers mad. So releases tended to happen pretty much by chance. Like someone would ask on a random GitHub issue when is the next release and I would think, why not right now?

This anarchic approach actually worked for some time while the project wasn't going too fast. But as this has changed in the recent couple of years and as I've had left users stranded waiting for a new release for months on a couple of occasions I though it's time to get more serious.

Our release process is now quite simple, too. A maintainer only has to document the changes, update the version number and push it all to GitHub. GitHub then pings a certain API handler on highlightjs.org and the site does everything else:

updates the code,
builds a CDN package and pushes it to GitHub from where two independent CDN providers pick it up, also automatically,
builds and pushes a package to npmjs.org,
updates the live demo and various metadata (version number, language count, etc),
pre-builds site's caches used for dynamic custom builds,
publishes version-related news from the CHANGES file,
restarts itself,
goes on social media and spends a day generating and over-excited buzz about the release (OK, probably not this :-) ).

The process is still fragile but bugs are getting fixed and it's anyway immensely simpler than doing it all manually.

See you next on October, 20th!

Styles unification: first results

2015-05-06T21:37:53.791000+00:00

Yesterday I gathered some willpower and began working on a long awaited (by myself, at the least) style unification in highlight.js. Here's the first taste of why I think it is important.

Let's take one of the recently added style — the "Android Studio" — and see how it displays two config languages that happen to not count as "hot" these days: Apache and .Ini:

Section headers, variable expansions, rewrite flags aren't highlighted at all.
Pre-defined literals ("True", "on") are highlighted in .Ini unisg the same color as directive names in Apache.

To fix this particular case I had to define semantics for classes "section", "meta", "variable", "name" and "literal", and dropped all the Apache- and .Ini-specific rules from styles.

Here's how it looks now, nice and consistent:

There's a looong road ahead but after it's done designing a new style will be a matter of using a relatively short list of well-documented classes with a good guarantee that all languages will look decent.

I learned C# in 4 days!

2018-10-23T16:34:10.711863+00:00

You know those crazy books, "Learn whatever programming in 21 days"? I mean, who can afford spending that much time, right?

Some background

I have a friend who employs a very particular workflow for dealing with his digital photos. It often involves renaming and merging files from different cameras into a single chronologically ordered event, relying on natural sorting of file names in Windows Explorer. File names are constructed of picture time fields and running counters, like "2015-02-06_001.jpg".

This is of course too tedious to do by hand, so he was very happy with a small specialized Windows utility that I wrote for him a few years ago when Windows XP ruled the world and I still programmed in Delphi. The program worked fine until, with the natural flow of time, the world switched to Unicode and newer Windows started to display question marks in place of Cyrillic characters in the program's UI. This made it rather unusable. There were also other small and not so small imperfections about the program that, as I understand, added considerable factor of irritation to the act of processing photos. ("And when it happens upon a panoramic shot you can as well go and pour yourself some coffee because UI is frozen for minutes while loading the preview…")

So a year ago when we've been visiting his family for Christmas he nagged me, politely but emphatically, about at least making the UI readable again and also, just may be, fixing some of the most outrageous annoyances uncovered over the years of usage. The only problem was… I've lost the source code! I know, it might sound utterly unbelievable these days but it was written in the era before GitHub, and back in those days I've been using — wait for it — Zip drives to store my backups. Which in hindsight turned out to be suboptimal: they fail.

All this, however, provided me with a unique opportunity for making a really good Christmas gift this year…

I suppose there exist people out there who could come up instantly with a perfect gift idea for any of their dozens of friends upon being woken up in the middle of the day, but most of us seem to be destined to endure the agony of scratching the bottom of the void bowl of "what on Earth should we give them this time that won't suck like the last time!" So I was pretty much stoked when some weeks before we were about to leave for the trip it hit me that I actually could write the same program from scratch!

And I'm happy to say that ultimately the idea did work out as intended and at some point it has even been uttered that it was "the best gift ever!"

The best thing though is that now I can actually maintain the code (which I'm doing once a week these days) and not feel sorry for writing another half-working utility. Software is a process, after all.

The endeavor

So I had to learn how to write Windows GUI apps, again. Going back to Delphi was pretty much out of the question as even back in the time it was already loosing the mind share to quickly rising C# and I simply assumed that by now this process has completed. Besides, I actually wanted to learn how Windows GUI programming is "officially" done these days. (Notwithstanding the fact that we're still talking about traditional desktop software, not Metro tiles.)

The lazy evaluation phase took me a couple of weeks, during which I only figured out which of the three-letter acronyms I need to know: WPF, MVVM, C#. The actual design and implementation with ongoing research took 4 days — literally. The most helpful resources along the way were WPF Tutorial and Stack Overflow (of course).

Most importantly though, it was rigorous planning and doing design ahead of coding that allowed me to get the thing done. Here's a few snapshots of my whiteboard with the UI mock-up and current tasks divided by priority:

And though this entire article is not of particular practical importance — I'm simply sharing my emotions here — there is one point I'd really like to drive home:

Planning works. Always.

If you're one of those who doesn't "believe" in it, and for whom "plans never work", I say you most certainly are just doing it wrong and fixing it is a matter of learning how. Indulge yourself.

C# and WPF

I'll say from the get go that I can't presume on having an accurate opinion about a mainstream language after spending just 4 days with it. This is only my first impression.

It feels to me like a modern Delphi, which is probably not surprising given that both were invented by the same Anders Hejlsberg. Type inference makes static typing a lot more palatable, however the time spent on satisfying the compiler's complaints about inconsistent types still feels to me like the time lost. I was pleasantly surprised though by some nice things making their way into a 10+ year old language: lambdas, += for registering event listeners, LINQ — this is all very handy.

But overall, for a Pythonista, the language still feels way too verbose and ceremonious. Want to display a regular public attribute in UI? Oh, just turn it into a property with a getter and a setter and an accompanying separate private field of the same type. A dozen or so lines of code to satisfy a convention — not cool.

Likewise, I can't compare WPF to any modern UI framework as I didn't use any (which is a shame, really). From this position, what immediately feels right about WPF is the data binding concept. Instead of writing disjoint pieces of imperative code updating disjoint pieces of UI and trying doing it in the right order and not forgetting anything, you now define relations like "this ListView shows this list from my data model" and "this action is enabled when these conditions are met and it is bound to these UI controls". And all the controls' state is updated pretty much automatically. I believe it's that thing they call "reactive programming" these days…

The GUI editor is unusable. It took me probably only half a day before I completely switched to editing XAML by hand, and as I understand it's how it's done in practice. Here's a simple example why the editor sucks. XAML layout works best by dividing your window into panels, some of which are of fixed size while others automatically fill available space. Only the GUI editor doesn't do that, instead it gives all panels fixed sizes in pixels, thus defeating the purpose completely. So, surprisingly the old Delphi GUI editor remains the best in my limited opinion: it was usable and it did the right things by default most of the time.

The code

I didn't publish it anywhere yet but I will once I figure out SSH keys on Windows and choose proper licensing. I'm very interested in a code review from someone versed in WPF/C# but what I don't want to do though is maintain it as a proper project with contribution and such, it's just too much hassle.

ijson 2.0

2014-10-13T06:37:27.377000+00:00

Yesterday I released version 2.0 of the streaming JSON parser ijson. It mostly includes bug fixes accumulated over the last year and the only reason to change the major part of the version number was that import ijson doesn't do any discovery magic anymore.

Import

Previously, when you did import ijson it used to first go on a trial-and-error search for the latest version of the C library yajl and if none found used the Python backend as a fallback. This approach proved to be buggy and unpredictable: simply moving your app into another environment might have introduced different behavior, like being significantly slower on a machine without yajl or exposing bugs present in one backend but not the other.

So, following the "explicit is better than implicit" commandment I dropped the discovery, so import ijson now always loads the safe pure Python backend. You can explicitly import any of them with import ijson.backends.<name> as ijson.

You might argue that import ijson is still not explicit enough but I didn't want to force users to always use a full backend name. Because "practicality beats purity".

Other changes

Fixed breakage when a multi-byte UTF-8 characters was split by a buffer boundary.
Python backend now accepts custom buffer size as an argument.
Always return integer values as 'type int' even if spelled like 1.0 or 1E2 in JSON.
Use Wheels for a distribution format.

Also the lexer is now reimplemented as a generator and simplified a little bit, it's now only 46 lines of code. Funny thing, though: this change made it slightly faster on CPython but slightly slower on PyPy. Looks like PyPy really likes objects and doesn't mind all the self.something references and myriads of method calls. Go figure :-).

highlight.js: what's next

2015-08-30T19:39:17.611000+00:00

This is a loosely ordered dump of ideas about the future of highlight.js presented for purposes of information and discussion. The project is already big enough that the best I can do for it is not writing code but trying to get people interested in joining in. Let's see if I can show you that this project is not just about herding a bunch of regexes :-)

Testing

Our current "test suite" is well past being adequate. It started its life as a demo page that accidentally assumed along the way some rudimentary testing responsibilities. A good demo is short, neat and beautiful while a good test should be comprehensive. Our suite is unfortunately neither: it's big and ugly and at the same time it doesn't actually evaluate tests, relying instead on a human to notice that something is wrong with those few dozens of languages.

So before we go any further we need a test suite that would:

test language detection on small and non-obvious fragments
compare produced markup against control samples with all supported language features
perform special tests for different settings and features of the library

I'd say it's a nice big project in its own right!

Class name unification

One of the early design principles for highlight.js was having just a few common class names in order to have universal styles that would work for any language. Unfortunately this principle wasn't strongly enforced. We now have language-specific classes, language-specific style rules and whole language-specific styles. This is a maintenance nightmare as the number of unique conditions that should be visually tested is a production of the number of unique language features and the number of styles. And since this is an impossible amount of work, we usually test only a small subset of languages and styles and rely on pure luck for the rest which is a majority.

One way to deal with that is to confine class names to a very generic fixed set and force languages to use only that. Apart from reducing the amount of mess it will also enable an interesting side feature — automatically generated styles. If the semantics of class names is fixed we could intelligently group them and assign to those groups a few distinct font/color combinations provided by a user.

And this is certainly going to be completely backwards incompatible.

De-specializing keywords

Currently we parse keywords differently than the rest of language features. They use their own completely independent parsing pass. This gives us speed (which was the main reason for introducing it) and a neat definition syntax at the price of code size and complexity. The speed advantage is largely irrelevant by now, as browsers became much faster than six years ago. So it is a good time to throw that special code away.

The syntax will change, so instead of this:

{
  keywords: 'if for while ... ',
  contains: [
    STRINGS,
    NUMBERS
  ]
}

We're going to have something like this:

{
  contains: [
    {
      className: 'keyword',
      beginWords: 'if for while ... ',
    },
    STRINGS,
    NUMBERS
  ]
}

It's not by any means final, it just shows the idea that I want keywords to become a regular parsing mode.

Complex modes

Our current syntax can't express things defined as a sequence of other things, like this:

function ::= <title> '=' <params> '->' <body>

The problem here is that you don't know that you're in a function definition until you get to the -> symbol. Our parser can start a new parsing mode based only on a single starting lexeme: an opening quote, a keyword, a number, etc.

To work around that we use a horrible kludge:

Match the whole body of the mode with a single regex.
Start a new mode at the beginning of the matched string.
Return the whole thing back to the parser.
Parse it again, by the rules of the new mode.

Not only it's ugly, it also works only when we're lucky that the whole body can actually be parsed by a regex.

So we need to implement a logic allowing the parser to treat anything that matches just the first lexeme as a beginning of the mode, start parsing it and fall back if it doesn't work out.

Pipe dreams

These I'm posting mostly for fun. There's no plan, not even any certainty that they're actually needed. But, hey, may be there's something to it, still :-)

A background feedback mechanism for reporting language usage/detection statistics directly from highlighted code on other sites.
Balance keywords relevance based on their usage frequency using machine learning instead of human guessing.
Group all languages into more groups for more convenient download (like "Scripting", "Scientific"… "Toy", "Dead", "Weird" etc.)

Interested?

If you're interested in helping with any of those things please drop a message to our developer discussion group. Thank you!

New life of Marcus

2012-10-22T01:29:01.485000+00:00

A while ago I reported on switching this blog to a custom software named Marcus. Despite its source code being available in the open I didn't intend developing it into a full-blown project for two reasons: a) maintaining it would have taken much more time than I could afford and b) being completely anal about my own blog software I didn't want to piss off contributors by constantly rejecting all the features they would propose. Anyway, if someone felt so compelled they could take the code and start developing it on their own.

Which is exactly what happened. Mikhail Andreev took my old code, put it on GitHub and, as far as I can see, already added quite a bit to it. The project got a different name (by my own request) — django-marcus. It is also available on PyPI.

I'm honored someone deemed my code useful and glad that it won't end up being neglected after all. All hail to Open Source!

HTTP and JSON in highlight.js

2012-05-10T09:02:40.440000+00:00

Fresh from the oven, highlight.js now has a pretty cool feature that, to the best of my knowledge, is not supported by any other syntax highlighter. Namely, we can now recognize and highlight HTTP request headers and its body if it happens to be code in a language we know. This is intended for all sorts of API docs that often present the entire HTTP payload transferring some kind of JSON or XML.

Story

The feature was born out of a conversation with a user asking a very strange question: "How to disable highlighting for only a certain part of the code snippet." I couldn't even imagine why anyone would want to have more than one language in one code snippet until he provided a simple example with an HTTP prologue and a chunk of JSON payload. The actual problem was that highlight.js simply didn't know JSON at that time. But it was also obvious that even if it knew the language of the body the headers could completely break language detection or just pick up random part of body highlighting like incidentally matching keywords, numbers etc.

So I answered the user with apologies that we can't help him right now but we might look into this problem at some point in the future. It turned out "some point in the future" came later that evening when I realized that we already have two key ingredients to solve it: highlighting nested languages (used for JavaScript in HTML for example) and language detection. It was just the matter of putting them together.

Outcome

We now have the language "HTTP" that knows how to highlight request lines with a query string inside it, status lines with a numeric code, headers and their values.

We also have a strictly defined "JSON" language that knows pretty much all of JSON. Many thanks to Douglas Crockford for making it so limited and simple to parse. The strict definition makes auto-detection very reliable.

Both languages are now in the so-called "common" set which means they will be available in the CDN-hosted version by default in the next release.

Problems

Since no heuristics is completely reliable it would be nice to have some way to specify the sub-language inside a snippet in the same way as it now possible for the whole snippet. The hard part is to invent a way that doesn't suck :-). If you have any ideas — please share!

The other problem is obviously that the code is still very fresh and inevitably contains bugs. So get the source, build it, test it and let us know. Thank you!

Sponsoring in highlight.js

2012-04-11T21:53:29.547000+00:00

I want to draw your attention to an interesting offer made by Adam Kennedy from Kaggle to sponsor the development of syntax highlighting for the R language:

highlight.js came to our attention with the addition of MATLAB support, as it is one of the two dominant languages used by our community. We plan to switch to highlight.js from prettify.js (and already have in a dev branch).

Further, we would like to sponsor the addition to highlight.js of the primary language used by our community, the R statistical computing language ( http://www.r-project.org/ ).

I think this is a nice opportunity to help a good project and make some money along the way. I'm sure Adam will be happy to clarify any details, so reply to the group if you're interested.

From the highlight.js part there is a language definition guide to get you started. Of course I'm always happy to explain how things work in the highlighter in the hopes of getting more contributors on board and sharing maintenance :-).

Also I'm pretty excited about this thing in general. One of my focus since… well… pretty much since the inception of highlight.js was encouraging other people to contribute to the library. This way we've got such unique languages among syntax highlighters as MEL, RenderMan and Axapta, to name just a few. This sponsoring is a good precedent and if it works out to the mutual satisfaction of the parties I hope it won't be the last.

Rainbow.js — a new kid on the highlighters' block

2012-09-07T08:35:24.504000+00:00

There was a small spike in my referrers stats that led me to a new JavaScript highlighting library — rainbow.js. And since I love bashing other highlighters I couldn't resist this time too :-).

Oh, but be sure that all of this is intended of course as a constructive criticism only!

Size claim

It says upfront that it's 1.2K in size. It isn't. If you include 5 languages it currently supports — it's 8.1K. Which is still impressive given that highlight.js is 11.8K with the same languages.

Features

It doesn't have any beside highlighting itself. No user markup, no line numbers, no language detection, etc. But as far as I understand, it's a design goal. And I can only wish the author a lot of patience in ~~telling people to shut up~~ carefully evaluating feature requests!

Correctness

This is where things get ugly, unfortunately. I loaded up my test suite and on the spot found these:

prefixed strings in Python (r"", u"") aren't detected
tripple-quote strings in Python are detected wrong (first two quotes are treated as strings)
backslash escapes in strings aren't detected which can break the whole further highlighting in cases like "a \" b"
names of old-style Python classes are not recognized (because of the lack of parens after the names)
doctype declaration in HTML is treated as a tag
tag attributes in HTML aren't detected reliably, like "checked" here: <input checked type="checkbox">
in the CSS snippet {margin: 1cm 2cm 1.3cm 4cm;} "1." and "4cm" are not recognized as values
in div {width: 100%} "100%" is not recognized as a value
in the selector p[lang=en] all "p", "lang" and "en" are detected as tags
in JavaScript literal regexps are not distinguished from devision operators which leads to all sorts of breakage

I'm sure there are many other bugs because…

Speculation

… rainbow.js employs generically defined lexing for all supported languages. Which is good for keeping the library fit and slender but won't work for all the sheer insanity of syntaxes that humanity cared to invent over the latest half a century.

There are backslash escapes and double-quote escapes for strings. PHP, Ruby, Shell all allow embedded code within certain types of strings to a certain extent. JavaScript has literal regexps that clash with division. Pascal has different syntax for hex numbers. Lines starting with # are comments in many languages but in C they're preprocessor directives. And don't even get me started on Perl…

All in all I think that current design of rainbow.js won't allow it to grow past a family of not-too-conflicting language syntaxes. Which puts it in the same position as Google Code Prettify. Which for me means that we don't have to worry about this competition yet. But Google should :-).

Anyway I wish the best of luck to Craig Campbell in his endeavor!

Envy

I totally envy their site design!!!

Completely unfair comparison of Javascript syntax highlighters

2012-03-23T07:00:17.503000+00:00

During the time before latest release of highlight.js 6.0 I decided — for the first time in more than 4 years — to actually look at other highlighting libraries. Sure I knew of their existence before but nonetheless never felt compelled to do any serious comparison because highlight.js is a fun project and I'm quite happy with the result. In fact this comparison has also been made for fun more than for anything else. I just wondered how actually good (or bad) highlight.js was looking among similar libraries.

I decided not to take into account highly subjective things like visual appeal (I'm not a good judge here), installation simplicity and documentation clarity (don't know how to measure them). Also I didn't evaluate number of supported languages. While it is a measurable quantity it doesn't mean much for an end user: if a tool doesn't support the language you need you don't care about dozens of others that it does support. Instead I concentrated on universally measurable things that make sense to everyone: size, speed and correctness.

Why "completely unfair" then, you ask? Because I knew who'd win before I even started :-).

Contenders

If you go to trouble of searching the Internet for "javascript syntax highlighter" you'll inevitably stumble upon hoards of posts all ingeniously similarly titled "N useful/beautiful javascript tools" where N varies from 4 to 20-something. Those were circulating the network for years but, predictably, aren't a very good source of information because they don't actually evaluate usefulness or beauty of solutions they link to.

So I've just picked up those names that I've got used to seeing around in blogs and forums where people try to find such a tool:

SyntaxHighlighter by Alex Gorbachev, used on MDN and others.
SHJS — a library built to be compatible with GNU source-highlight language definitions.
Google Code Prettify — a highlighter used on Google Code and Stack Overflow.
and finally highlight.js originally written by me, used on a popular Russian tech site Habrahabr.ru and others.

I've compiled an enterprisey-looking matrix of features supported by these libraries. It isn't intended for comparison per se because there are different use-cases and sometimes lack of features is a feature too. It's here to give you a general idea on what goal each one can serve.

	highlight.js	SyntaxHighlighter	SHJS	Google Code Prettify
User markup in code snippets	yes	no ¹⁾	yes	yes
Line numbers	no	yes	no	yes
Striped background	no	yes	no	yes
Replacing indenting TABs with spaces	yes	yes	no	no
Language detection	yes	no	no	yes ²⁾
Multi-language code	yes	yes ³⁾	no	yes
Arbitrary HTML container for code	yes	no	no	no
HTML5 compatibility ⁴⁾	yes	no	no	no

Notes:

SyntaxHighlighter doesn't support arbitrary markup but has two special features that cover some use-cases: turning URLs into links and highlighting lines of code that require attention.
Prettify doesn't actually do any detection. Instead it employs an interesting approach of generalized highlighting that works independent of language. Though this makes it more prone to errors than the heuristic detection mechanism found in highlight.js.
I wasn't able to configure SyntaxHighlighter to do this but I attribute it to my lack of persistence. It works fine on the demo page.
Surely one couldn't expect being taken seriously these days without shoving trendy "HTML5" moniker somewhere! What it actually means here is that highlight.js automatically recognizes code snippets marked up according to HTML5 recommendation with <pre><code class="language-something"> .. </code></pre>.

Test case

The test page consists of code snippets using 7 popular languages: Python, Ruby, PHP, XML, HTML, CSS and Javascript. The "completely unfair" part of the article shows up here full-scale since those snippets come from highlight.js' own test suit! Anyway I think it was a good idea to use them because they were designed to be short and to exercise as many features of a language as possible. Here are four versions of the test case using highlight.js, SyntaxHighlighter, SHJS and Google Code Prettify in all their styled-by-default glory.

Size

All libraries have their way to include only required languages definitions on the page: simple linking to language files, on-demand loading, packing into a single file. Also all of them provide minified/packed production versions of files. Gzip compression wasn't used for no specific reason. The following table shows the overall size of all Javascript needed to highlight test snippets.

	highlight.js	SyntaxHighlighter	SHJS	Google Code Prettify
Size (KB)	16.4	34.6	16.8	19.2

I didn't include CSS into calculation because it's not actually required: a site can define highlighting style within its main stylesheet.

Speed

To be honest modern browsers have made this test irrelevant. All highlighters are pretty fast to the point where highlighting is applied instantly. The only exception was SHJS that was configured to load language files on-demand which led in a couple of test runs to raw un-highlighted code being visible for a split-second. It doesn't tell anything bad about the speed of SHJS itself but rather shows that on-demand loading was a bad idea for the task.

I've measured the speed of highlighting using Firebug. It wasn't as straight-forward as counting size because there are more things to take into account here. After some tinkering I've decided on the following method:

To represent the most common real-world case all files are loaded from cache but the browser still performs DNS lookups and establishes TCP connections for each file.
Total load time is defined by DOMContentLoaded event for highlight.js and by onload event for the rest. This may seem unfair but I just did what libraries suggest in their docs.
The time of highlighting itself is measured with Firebug's profiler. Since profiling affects performance this time cannot be simply added to the load time and should be considered separately.

	highlight.js	SyntaxHighlighter	SHJS	Google Code Prettify
Load time (msecs)	870	1394	1008	1007
Highlighting time (msecs)	55	67	54	72

Richness and correctness

Here is where things get interesting. Size and speed turned out not to affect user experience significantly but the difference in richness and correctness is plainly visible. There won't be any numbers though, just some notes.

I should note that the notion of "correctness" differs from library to library. While there are plain bugs there are also missing features that could be left out deliberately. Here I tried to adhere to my personal views on the subject and you may well be in disagreement with me. That's fine!

SyntaxHighlighter doesn't produce very rich highlighting to begin with. No Python decorators, no Javascript regexps, no CSS @-rules etc… Also it seems to being downright unable to highlight things that require more sophisticated parsing than a regular grammar, like names in function and class definitions. This is not bad by itself. The result still looks useful and leaves fewer places to screw up :-). But there are some issues with correctness anyway:

no multi-line strings in PHP
value-less attributes in HTML tags aren't recognized
within CSS @-rules seemingly random words are recognized as "values" (whatever it could mean)

Not much is highlighted in Javascript.

SHJS was looking promising since it uses language definitions from the GNU source-highlight project and I thought those guys would do their job rather meticulously. But in practice it mishandled highlighting the most of all others:

names of old-style classes in Python aren't highlighted (those in new-style classes do)
class inheritance in Ruby badly breaks the whole line
#{} constructs in Ruby strings aren't recognized
PHP throw keyword is not highlighted
tags are highlighted inside CDATA-escaped sections in XML
unquoted attribute values in HTML tags aren't recognized
@-rules in CSS break the whole highlighting flow
"$" isn't considered part of identifiers in Javascript

Class inheritance (A < B) in Ruby breaks the whole line.

Google Code Prettify works very well both in terms of richness and correctness. It can highlight CSS and Javascript within HTML, recognizes Python decorators, Javascript regexps. Speaking of the latter, it was Prettify where I borrowed ideas on how to implement those in highlight.js.

I've found very few issues with it:

tags highlighted inside CDATA-escaped sections in XML
@font-face in CSS is not recognized as @-rule
Ruby highlighting is also simplistic but doesn't cause such severe problems as in SHJS

That <not> inside CDATA shouldn't be highlighted as tag.

As for highlight.js, it's pushed down to the end of the comparison for a reason :-). Obviously there won't be any correctness issues since I used code snippets from its own test suit which it successfully passes. Of course it doesn't in any way mean it's bug-free. But where the library really stands out is highlighting richness. It just knows much more about languages than others. Here are just those features visible only in this very test case that are unique to highlight.js:

raw Python strings
Ruby inheritance, #{} things, quoted symbols, symbolic function names etc.
yardoc in Ruby comments
phpdoc in PHP comments
classes, ids, tags and attributes in CSS selectors

Some of the recognized features (like variables in PHP) are deliberately not styled to maintain visual sanity. Most of these features (and those in other languages) are the result of elaborate effort of many highlight.js contributors in defining most intricate parsing rules (just look at Perl definition for example).

HTML with emedded Javascript and CSS. All sorts of ways to define tag attributes are supported.

Completely balanced conclusion

If you need a solid syntax highlighter (and don't care about line numbers or striped backgrounds) use highlight.js. It is small, fast, rich and correct!

And if you don't like something about it — contribute!

highlight.js 6.0 beta

2011-04-26T09:31:29.684000+00:00

В порыве борьбы с прокрастинацией занялся задачкой, которую давно откладывал — рефакторингом определений языков в highlight.js в новый синтаксис. Да так удачно занялся, что решил заодно и другие мелкие задачки, которые планировал на версию 6.0. И вот без лишних слов представляю бету новой большой версии и прошу её потестировать.

Ссылки

К тестированию предлагаются:

Проект на GitHub. Исходники, тулзы, тесты.
Полная упакованная библиотека. 90 КБ, все языки.
Упакованная версия с 12 популярными языками. 26 КБ, содержит HTML/XML, Javascript, CSS, PHP, Ruby, Perl, Python, C++, C#, Java, SQL, Bash.
Архив стилей. Для удобства запаковал отдельно.

Ставьте к себе на сайты, ловите баги, пишите в рассылку или в баг-трекер.

Синтаксис

Главная новость этой версии касается не пользователей библиотеки, а разработчиков. Синтаксис определения языков стал проще структурно, умолчания стали более логичными и пропали некоторые атрибуты, нужные раньше для обработки краевых исключительных случаев. Вот упрощённый пример для наглядности.

Было:

defaultMode: {
  contains: ['string'],
  modes: [
    {
      className: 'string',
      begin: '"', end: '"',
      contains: ['escape']
    },
    {
      className: 'escape', noMarkup: true,
      begin: '\\\\.', end: hljs.IMMEDIATE_RE
    }
  ]
}

Стало:

defaultMode: {
  contains: [
    {
      className: 'string',
      begin: '"', end: '"',
      contains: [{begin: '\\\\.'}]
    }
  ]
}

Поменялось вот что:

определения режимов modes и их вложенности contains слились в одну структуру
hljs.IMMEDIATE_RE стал дефолтным значением для регулярок
вместо указания className одновременно с noMarkup стало можно не указывать className

По большей части код стал более красивым и читаемым, хотя и не без изъянов: прямо сейчас определение Руби насчитывает десять переменных для строк, которые таскаются хвостом по всему файлу :-).

Конвертация всех языков в новый синтаксис была самой долгой и нудной задачей, и именно из-за этого я решил выложить новую версию сначала в виде беты — не верю, чтобы ничего не сломалось, даже несмотря на то, что внутренние тесты проходят. Пользуясь случаем, хочу сказать отдельное спасибо Валерию Хиоре за конвертацию своего определения Objective C!

Тулзы

Точнее, теперь — "тулза". Два скрипта, которые паковали и собирали языки в финальную сборку, стали одним, которым стало удобней пользоваться, в том числе и при отладке.

Языки

В этой версии 4 новых языка:

Haskell авторства Джереми Халла
Erlang в двух видах — модуль и REPL — коллективного авторства Николая Захарова, Дмитрия Ковеги и Сергея Игнатова
Objective C от Валерий Хиоры
Vala от Антоно Васильева

Общее количество языков таким образом достигло 40!

Кроме того, два старых языка — HTML и CSS — подверглись радикальному изменению. Я решил, что два отдельных определения HTML и XML не имеют смысла и объединил их в одно. А заодно выкинул длинные списки ключевых слов из HTML и CSS, потому что синтаксис обоих языков задуман расширяемым и не зависит от конкретных ключевых слов. Теперь названия тегов и атрибутов раскрашиваются всегда, даже если они нестандартные.

Самое приятное, что выкидывание ключевых слов вместе с переходом на новый синтаксис позволило новой версии библиотеки быть меньше, даже с учётом четырёх совершенно новых языков!

Инфраструктура

Переезд на GitHub себя вполне оправдал: появились новые контрибьюторы! Причём, как хостинг кода, он настолько хорош, что даже скрашивает мне переезд на git, как на новую для меня VCS.

А вот с группой для обсуждений всё сложнее. По большей части там тихо, а те обсуждения, которые велись, вполне могли бы вестись и в частной переписке. Если подумать, то это и не удивительно, потому что автор у ядра хайлайтера с самого начала был один, оно пережило несколько переписываний, и сейчас, наверное, кроме меня, ни один человек этого кода хорошо не знает. Тем не менее, я не думаю, что от группы надо отказываться, потому что каши она не просит, и лучше, если она есть и не нужна, чем вдруг понадобилась — а нету.

Что дальше

План простой и очевидный: я хочу подождать неделю-другую сообщений о багах, починить их (а ещё лучше — просто вмёрджить патчи от самих репортеров) и выпустить финальную версию.

Ещё, как я вскользь упоминал в Твиттере, мне очень хочется получить стили, основанные на палитре Solarized. Сам я за это вряд ли возьмусь, поэтому просто ещё раз протранслирую здесь эту просьбу. Если вам нравится хайлайтер и вы любите внимание к мелочам, ваш вклад будет очень ценен сообществу!

highlight.js открывается

2011-01-03T08:21:43.808000+00:00

Хотя код highlight.js всегда был открыт, библиотека никогда не была в полном смысле слова проектом. Не было общего места общения разработчиков, wiki с документацией и баг-тракинга. Вместо этого я просто принимал по почте новые языки, патчи и отвечал на вопросы. Причём часто делал это очень медленно. Несмотря на это, хайлайтер умудрился стать самым большим из моих проектов, если считать по количеству контрибьюторов!

И вот я, наконец, решил перестать мешать ему развиваться и сделал из него нормальный проект.

Основные вещи:

Разработческая документация в публичной wiki
Код на GitHub
Гуглогруппа для обсуждений разработки

Хотя git я не шибко люблю в сравнении с bzr, код я таки переложил на GitHub — просто уступив общественному мнению. Из этого неявно исходит, что моя долгосрочная цель в том, чтобы перестать писать код в этом проекте, а сплавить эту задачу заинтересованном сообществу разработчиков. Буду сидеть, аки царь, и только патчи вливать :-).

Wiki открыта сейчас всем, и я уже страдаю от периодического спама. Если не удастся его эффективно побороть, придётся, видимо, ввести какую-нибудь регистрацию.

Последняя нерешённая проблема — где вести баг-тракинг. По этому поводу я стартовал дискуссию в группе. Язык группы — английский.

Вливайтесь!

Хостинг для highlight.js

2010-09-27T23:05:37.959000+00:00

Теперь highlight.js хостится на Яндексе, и его не обязательно скачивать, можно просто линковать напрямую с yandex.st. Этот архив, правда, содержит не все языки, потому что тогда бы он был неприлично большой. Поэтому я выбрал языки, которые чаще всего скачивались, и взял столько их, чтобы итоговый архив не превышал 30К. В итоге в финал попали: HTML/XML, Javascript, CSS, PHP, Ruby, Perl, Python, C++, C#, Java, SQL, Bash (да, Bash!).

А ещё там же хостятся ещё и стилевые темки, к которым тоже напрямую можно линковаться. Как это делать, описано в инструкции, повторяться не буду.

Надеюсь, что это поможет хайлайтеру распростаниться на блогохостингах вроде blogspot.com, где у народа вечные проблемы, куда бы файл положить. Да и в принципе рекомендую всем, кому хватает языков, перейти на хостенную версию, чтобы эффективней использовать браузерный кеш ваших пользователей.

P.S. Меня где-то в твиттерах спрашивали, почему не Google. Всё очень просто — я не был уверен, что меня там захотят захостить и, должен признаться, не сразу нашёл на странице, кого куда спрашивать. А с коллегами из собственной компании мне было поговорить, конечно, проще :-). Спасибо!