It's better late than never. Last night I finally released the next version of highlight.js contaning many nice additions, some of which were ready and waiting for as long as half a year.

New languages

All in all, highlight.js currently supports no less than 32 different languages!

Fixes for existing languages

Existing languages also got their share of bugfixes and improvements. You can find every detail in the Bazaar log but changes for two particular languages worth a special note.

Loren Segal did a profound make-over for the Ruby definition and added highlighting for YARD inline documentation (of his own invention). Also by his suggestion highlight.js got a new language mode attribute — displayClassName which can be used in generated markup instead of the usual className. The need for this appeared when he'd refactored the definition for function titles into a different mode and had to name it somewhat differently — "ftitle" instead of "title". This broke styling of the highlighted text and to fix this we had two options: either provide .ftitle selector in addition to all .title selctors in all stylesheets, or invent a new attribute to let the new mode appear in markup as "title". We did the latter.

Another change involved the definition of SQL and it's what kept me working late last night. SQL was always too "greedy" when it comes to automatic language detection. Here's for example a snippet of code that was detected as SQL:

from django.utils import translation

A short introduction into the highlighter architecture is due here.

The highlighter doesn't need to interpret a program written in a language but only needs to highlight it. Thus a highlighting definition doesn't describe the whole syntax of a language. The highlighter thinks of a language as of a big pile of plain text with some chunks of special structures like strings and comments which are called "modes". A mode defines among other things a set of keywords that it can contain. Most languages have all their keywords defined in a basic default mode that represents that pile of plain text in-between strings, comments and other special things.

SQL is a good example of such a language. The problem with it is that it has a lot of keywords — 217 (compare with Python that has only 37). It means that you can fairly easy find SQL keywords in other programs that use them as variable names. SQL keywords in the snippet above are "from", "translation" and "language". And if you look at it as Python there are only "from" and "import". So 4 keyword instances versus 2 easily make a win for SQL here.

To fight this issue I have one pretty reliable method. You should turn a language definition from being "a pile of plain text with chunks of special modes" into "all languages contain only strictly defined modes". The trick here is to figure out how to define it without actually implementing a full language parser. After some pondering over the issue I realized that SQL consists largely of just two things: comments and SQL statements, and those statements, in turn, contain all the keywords. An SQL statement is defined as a thing beginning with one of the few reserved words ("select", "insert", "alter" etc. — 22 in total) and ending with a semicolon or end of file.

Applying this to our snippet one can see that it doesn't contain a single SQL statement because it doesn't contain words that would start one. So starting from current version the highlighter should detect SQL only when it really is. Modulo bugs :-)

Working as a library

I refactored the initialization code of the highlighter to make it more usable as library as opposed to a stand-alone script. Before that it had to be initialized with a call to initHighlightingOnLoad that would hook on the event of page loading and then would find code blocks and highlight them. This approach had some drawbacks:

Now all this can be done manually. Here's an example of initializing the highlighting using jQuery's page load event:

$(document).ready(function() {
  $('div.pre').each(function(i, e) {hljs.highlightBlock(e, '    ')});

The function highlightBlock accepts a DOM element with the text of code as a first argument and optionally a replacement for tabulation characters as a second argument. This function is also used to highlight any block of code at any moment in the page's lifetime.

WordPress plugin

In this version I've dropped support for a WordPress plugin. Since my blog now runs on a custom software I can't debug and test the plugin anymore. It would be nice if someone would pick up its maintenance. The code is still available in Bazaar repository and should actually work as is.

However someone wrote me some time ago saying that the plugin might have a security vulnerability because it doesn't check for nonces. Unfortunately I couldn't pull myself to figure out what it actually is, is the hole real and how to fix it.

Comments: 2 (feed)

  1. Nick Lutsiuk

    Your English is fine, you don't have to put too much effort to improve on that. But here's my bit of nitpicking:

    Last night I've finally released

    Last night I finally released

    when he's refactored

    when he'd refactored

    let the new mode to appear

    let the new mode appear OR allow the new mode to appear

    all languages contains only of

    all the language consists only of OR all the language contains only

    Anyway, a couple more blog posts like this and you're going to have to put away that apologetic paragraph at the bottom. : )

  2. Ivan Sagalaev

    Nick Lutsiuk, thanks!

    Fixed everything. Though the last one was not a genuine mistake but just an artifact left from reformulation.

Add comment

Format with markdown