diff options
Diffstat (limited to 'node_modules/highlight.js/docs/language-guide.rst')
-rw-r--r-- | node_modules/highlight.js/docs/language-guide.rst | 264 |
1 files changed, 264 insertions, 0 deletions
diff --git a/node_modules/highlight.js/docs/language-guide.rst b/node_modules/highlight.js/docs/language-guide.rst new file mode 100644 index 000000000..f48c748be --- /dev/null +++ b/node_modules/highlight.js/docs/language-guide.rst @@ -0,0 +1,264 @@ +Language definition guide +========================= + +Highlighting overview +--------------------- + +Programming language code consists of parts with different rules of parsing: keywords like ``for`` or ``if`` +don't make sense inside strings, strings may contain backslash-escaped symbols like ``\"`` +and comments usually don't contain anything interesting except the end of the comment. + +In highlight.js such parts are called "modes". + +Each mode consists of: + +* starting condition +* ending condition +* list of contained sub-modes +* lexing rules and keywords +* …exotic stuff like another language inside a language + +The parser's work is to look for modes and their keywords. +Upon finding, it wraps them into the markup ``<span class="...">...</span>`` +and puts the name of the mode ("string", "comment", "number") +or a keyword group name ("keyword", "literal", "built-in") as the span's class name. + + +General syntax +-------------- + +A language definition is a JavaScript object describing the default parsing mode for the language. +This default mode contains sub-modes which in turn contain other sub-modes, effectively making the language definition a tree of modes. + +Here's an example: + +:: + + { + case_insensitive: true, // language is case-insensitive + keywords: 'for if while', + contains: [ + { + className: 'string', + begin: '"', end: '"' + }, + hljs.COMMENT( + '/\\*', // begin + '\\*/', // end + { + contains: [ + { + className: 'doc', begin: '@\\w+' + } + ] + } + ) + ] + } + +Usually the default mode accounts for the majority of the code and describes all language keywords. +A notable exception here is XML in which a default mode is just a user text that doesn't contain any keywords, +and most interesting parsing happens inside tags. + + +Keywords +-------- + +In the simple case language keywords are defined in a string, separated by space: + +:: + + { + keywords: 'else for if while' + } + +Some languages have different kinds of "keywords" that might not be called as such by the language spec +but are very close to them from the point of view of a syntax highlighter. These are all sorts of "literals", "built-ins", "symbols" and such. +To define such keyword groups the attribute ``keywords`` becomes an object each property of which defines its own group of keywords: + +:: + + { + keywords: { + keyword: 'else for if while', + literal: 'false true null' + } + } + +The group name becomes then a class name in a generated markup enabling different styling for different kinds of keywords. + +To detect keywords highlight.js breaks the processed chunk of code into separate words — a process called lexing. +The "word" here is defined by the regexp ``[a-zA-Z][a-zA-Z0-9_]*`` that works for keywords in most languages. +Different lexing rules can be defined by the ``lexemes`` attribute: + +:: + + { + lexemes '-[a-z]+', + keywords: '-import -export' + } + + +Sub-modes +--------- + +Sub-modes are listed in the ``contains`` attribute: + +:: + + { + keywords: '...', + contains: [ + hljs.QUOTE_STRING_MODE, + hljs.C_LINE_COMMENT, + { ... custom mode definition ... } + ] + } + +A mode can reference itself in the ``contains`` array by using a special keyword ``'self``'. +This is commonly used to define nested modes: + +:: + + { + className: 'object', + begin: '{', end: '}', + contains: [hljs.QUOTE_STRING_MODE, 'self'] + } + + +Comments +-------- + +To define custom comments it is recommended to use a built-in helper function ``hljs.COMMENT`` instead of describing the mode directly, as it also defines a few default sub-modes that improve language detection and do other nice things. + +Parameters for the function are: + +:: + + hljs.COMMENT( + begin, // begin regex + end, // end regex + extra // optional object with extra attributes to override defaults + // (for example {relevance: 0}) + ) + + +Markup generation +----------------- + +Modes usually generate actual highlighting markup — ``<span>`` elements with specific class names that are defined by the ``className`` attribute: + +:: + + { + contains: [ + { + className: 'string', + // ... other attributes + }, + { + className: 'number', + // ... + } + ] + } + +Names are not required to be unique, it's quite common to have several definitions with the same name. +For example, many languages have various syntaxes for strings, comments, etc… + +Sometimes modes are defined only to support specific parsing rules and aren't needed in the final markup. +A classic example is an escaping sequence inside strings allowing them to contain an ending quote. + +:: + + { + className: 'string', + begin: '"', end: '"', + contains: [{begin: '\\\\.'}], + } + +For such modes ``className`` attribute should be omitted so they won't generate excessive markup. + + +Mode attributes +--------------- + +Other useful attributes are defined in the :doc:`mode reference </reference>`. + + +.. _relevance: + +Relevance +--------- + +Highlight.js tries to automatically detect the language of a code fragment. +The heuristics is essentially simple: it tries to highlight a fragment with all the language definitions +and the one that yields most specific modes and keywords wins. The job of a language definition +is to help this heuristics by hinting relative relevance (or irrelevance) of modes. + +This is best illustrated by example. Python has special kinds of strings defined by prefix letters before the quotes: +``r"..."``, ``u"..."``. If a code fragment contains such strings there is a good chance that it's in Python. +So these string modes are given high relevance: + +:: + + { + className: 'string', + begin: 'r"', end: '"', + relevance: 10 + } + +On the other hand, conventional strings in plain single or double quotes aren't specific to any language +and it makes sense to bring their relevance to zero to lessen statistical noise: + +:: + + { + className: 'string', + begin: '"', end: '"', + relevance: 0 + } + +The default value for relevance is 1. When setting an explicit value it's recommended to use either 10 or 0. + +Keywords also influence relevance. Each of them usually has a relevance of 1, but there are some unique names +that aren't likely to be found outside of their languages, even in the form of variable names. +For example just having ``reinterpret_cast`` somewhere in the code is a good indicator that we're looking at C++. +It's worth to set relevance of such keywords a bit higher. This is done with a pipe: + +:: + + { + keywords: 'for if reinterpret_cast|10' + } + + +Illegal symbols +--------------- + +Another way to improve language detection is to define illegal symbols for a mode. +For example in Python first line of class definition (``class MyClass(object):``) cannot contain symbol "{" or a newline. +Presence of these symbols clearly shows that the language is not Python and the parser can drop this attempt early. + +Illegal symbols are defined as a a single regular expression: + +:: + + { + className: 'class', + illegal: '[${]' + } + + +Pre-defined modes and regular expressions +----------------------------------------- + +Many languages share common modes and regular expressions. Such expressions are defined in core highlight.js code +at the end under "Common regexps" and "Common modes" titles. Use them when possible. + + +Contributing +------------ + +Follow the :doc:`contributor checklist </language-contribution>`. |