Programming language code consists of parts with different rules of parsing: keywords like for or if don't make sense inside strings, strings may contain backslash-escaped symbols like \” and comments usually don't contain anything interesting except the end of the comment.
In highlight.js such parts are called modes.
Each mode consists of:
The parser's work is to look for modes and their keywords. Upon finding it wraps them into the markup <span class=”…”>…</span> and puts the name of the mode (“string”, “comment”, “number”) or a keyword group name (“keyword”, “literal”, “built-in”) as the span's class name.
A language definition is a javascript object containing description of modes and keywords. Its structure is as follows:
some_lang = {
case_insensitive: true, // global language properties
defaultMode: {
// definition of keywords, lexing, containing sub-modes, etc…
contains: [
// sub-modes of the default mode
]
}
}
The default mode is the one in which the parser starts to process a language. Usually this mode accounts for the most part of the code and describes all language keywords. A notable exception here is HTML in which a default mode is just a user text that doesn't contain any keywords, and most interesting parsing happens inside tags.
In a simple case keywords are defined by a javascript object whose keys are keywords themselves and values usually equal to 1 (this sets keyword's relevance, but more on this later):
keywords: {'else': 1, 'for': 1, 'if': 1, 'while': 1}
A language can contain different kinds of “keywords” that may not be called as such by the language spec but are very close to them from the point of view of a syntax highlighter. These are all sorts of “literals”, “built-ins”, “symbols” and such. To define such keyword groups the attribute keywords becomes an object of nested named objects:
keywords: {
'keyword': {'else': 1, 'for': 1, 'if': 1, 'while': 1},
'literal': {'false': 1, 'true': 1, 'null': 1}
}
To detect keywords highlight.js breaks the processed chunk of code into separate words — a process called lexing. The “word” here is defined by the regexp [a-zA-Z][a-zA-Z0-9_]* that works for keywords in most languages. Different lexing rules can be defines by a lexems attribute:
defaultMode: {
lexems: '-[a-z]+',
keywords: {'-import': 1, …}
}
Each mode can contain other modes that are listed in the contains attribute:
defaultMode: {
lexems: '...',
keywords: {...},
contains: [
hljs.QUOTE_STRING_MODE,
hljs.C_LINE_COMMENT,
{ ... custom mode definition ... }
]
}
A mode can reference itself in the contains array by using a special keyword 'self'. This is used to define nested modes:
{
className: 'object',
begin: '{', end: '}',
contains: [hljs.QUOTE_STRING_MODE, 'self']
}
Modes usually generate actual highlighting markup — <span> elements with specific class names that are defined by the className attribute:
defaultMode: {
contains: [
{
className: 'string',
// ... other attributes
},
{
className: 'number',
// ...
}
]
}
Names are not required to be unique, it's quite common to have several definitions with the same name. For example, many languages have various syntaxes for strings, comments etc…
Sometimes modes are defined only to support specific parsing rules and aren't needed in the final markup. A classic example is an escaping sequence inside strings allowing them to contain an ending quote.
{
className: 'string',
begin: '"', end: '"',
contains: [{begin: '\\\\.', end: hljs.IMMEDIATE_RE}],
}
For such modes className attribute should be omitted so they won't generate excessive markup.
Other useful attributes are defined in the mode reference.
highlight.js tries to automatically detect the language of a code fragment. The heuristics is essentially simple: it tries to highlight a fragment with all the language definitions and the one that yields most specific modes and keywords wins. The job of a language definition is to help this heuristics by hinting relative relevance (or irrelevance) of modes.
This is best illustrated by example. Python has special kinds of strings defined by prefix letters before the quotes: r”…”, u”…”. If a code fragment contains such strings there is a good chance that it's in Python. So these string modes are given high relevance:
{
className: 'string',
begin: 'r"', end: '"',
relevance: 10
}
On the other hand, conventional strings in plain single or double quotes aren't specific to any language and it makes sense to bring their relevance to zero to lessen statistical noise:
{
className: 'string',
begin: '"', end: '"',
relevance: 0
}
The default value for relevance is 1. When setting an explicit value it's recommended to use either 10 or 0.
Keywords also influence relevance. Their weight is given by a value corresponding to keyword attributes. For simple keywords that can as well be plain identifiers in other languages this value is usually 1:
{'for': 1, 'if': 1, 'while': 1}
However there are really unique inventions that aren't really expected to be used as sane names for variables. For example reinterpret_cast is a clear indicator that we're looking at C++. It's worth to set relevance of such keywords a bit higher.
Note that keyword relevance should not be set to 0 because it cancels recognition of the keyword altogether.
Another way to improve language detection is to define illegal symbols for a mode. For example in Python first line of class definition (class MyClass(object):) cannot contain symbol ”{” or a newline. Presence of these symbols clearly shows that the language is not Python and the parser can drop this attempt early.
Illegal symbols are defined as a a single regular expression:
{
className: 'class',
illegal: '[${]'
}
Many languages share common modes and regular expressions. Such expressions are defined in core highlight.js code at the end under “Common regexps” and “Common modes” titles. Use them when possible.
Follow Contributor checklist.