highlight.js: what's next

This is a loosely ordered dump of ideas about the future of highlight.js presented for purposes of information and discussion. The project is already big enough that the best I can do for it is not writing code but trying to get people interested in joining in. Let's see if I can show you that this project is not just about herding a bunch of regexes :-)

Testing

Our current "test suite" is well past being adequate. It started its life as a demo page that accidentally assumed along the way some rudimentary testing responsibilities. A good demo is short, neat and beautiful while a good test should be comprehensive. Our suite is unfortunately neither: it's big and ugly and at the same time it doesn't actually evaluate tests, relying instead on a human to notice that something is wrong with those few dozens of languages.

So before we go any further we need a test suite that would:

test language detection on small and non-obvious fragments
compare produced markup against control samples with all supported language features
perform special tests for different settings and features of the library

I'd say it's a nice big project in its own right!

Class name unification

One of the early design principles for highlight.js was having just a few common class names in order to have universal styles that would work for any language. Unfortunately this principle wasn't strongly enforced. We now have language-specific classes, language-specific style rules and whole language-specific styles. This is a maintenance nightmare as the number of unique conditions that should be visually tested is a production of the number of unique language features and the number of styles. And since this is an impossible amount of work, we usually test only a small subset of languages and styles and rely on pure luck for the rest which is a majority.

One way to deal with that is to confine class names to a very generic fixed set and force languages to use only that. Apart from reducing the amount of mess it will also enable an interesting side feature — automatically generated styles. If the semantics of class names is fixed we could intelligently group them and assign to those groups a few distinct font/color combinations provided by a user.

And this is certainly going to be completely backwards incompatible.

De-specializing keywords

Currently we parse keywords differently than the rest of language features. They use their own completely independent parsing pass. This gives us speed (which was the main reason for introducing it) and a neat definition syntax at the price of code size and complexity. The speed advantage is largely irrelevant by now, as browsers became much faster than six years ago. So it is a good time to throw that special code away.

The syntax will change, so instead of this:

{
  keywords: 'if for while ... ',
  contains: [
    STRINGS,
    NUMBERS
  ]
}

We're going to have something like this:

{
  contains: [
    {
      className: 'keyword',
      beginWords: 'if for while ... ',
    },
    STRINGS,
    NUMBERS
  ]
}

It's not by any means final, it just shows the idea that I want keywords to become a regular parsing mode.

Complex modes

Our current syntax can't express things defined as a sequence of other things, like this:

function ::= <title> '=' <params> '->' <body>

The problem here is that you don't know that you're in a function definition until you get to the -> symbol. Our parser can start a new parsing mode based only on a single starting lexeme: an opening quote, a keyword, a number, etc.

To work around that we use a horrible kludge:

Match the whole body of the mode with a single regex.
Start a new mode at the beginning of the matched string.
Return the whole thing back to the parser.
Parse it again, by the rules of the new mode.

Not only it's ugly, it also works only when we're lucky that the whole body can actually be parsed by a regex.

So we need to implement a logic allowing the parser to treat anything that matches just the first lexeme as a beginning of the mode, start parsing it and fall back if it doesn't work out.

Pipe dreams

These I'm posting mostly for fun. There's no plan, not even any certainty that they're actually needed. But, hey, may be there's something to it, still :-)

A background feedback mechanism for reporting language usage/detection statistics directly from highlighted code on other sites.
Balance keywords relevance based on their usage frequency using machine learning instead of human guessing.
Group all languages into more groups for more convenient download (like "Scripting", "Scientific"… "Toy", "Dead", "Weird" etc.)

Interested?

If you're interested in helping with any of those things please drop a message to our developer discussion group. Thank you!