Tokenizr
Flexible String Tokenization Library for JavaScript
About
Tokenizr is a small JavaScript library, providing powerful and flexible string tokenization functionality. It is intended to be be used as the underlying "lexical scanner" in a Recursive Descent based "syntax parser", but can be used for other parsing purposes, too. Its distinct features are:- Efficient Iteration:
- Stacked States:
- Regular Expression Matching:
- Match Repeating:
- Match Rejecting:
- Match Ignoring:
- Match Accepting:
- Shared Context Data:
- Token Text and Value:
- Debug Mode:
- Nestable Transactions:
- Token Look-Ahead:
Installation
Node environments (with NPM package manager):
$ npm install tokenizr
Browser environments (with Bower package manager):
$ bower install tokenizr
Usage
Suppose we have a configuration filesample.cfg
:foo {
baz = 1 // sample comment
bar {
quux = 42
hello = "hello \"world\"!"
}
quux = 7
}
Then we can write a lexical scanner in ECMAScript 6 (under Node.js) for the tokens like this:
import fs from "fs"
import Tokenizr from "tokenizr"
let lexer = new Tokenizr()
lexer.rule(/[a-zA-Z_][a-zA-Z0-9_]*/, (ctx, match) => {
ctx.accept("id")
})
lexer.rule(/[+-]?[0-9]+/, (ctx, match) => {
ctx.accept("number", parseInt(match[0]))
})
lexer.rule(/"((?:\\"|[^\r\n])*)"/, (ctx, match) => {
ctx.accept("string", match[1].replace(/\\"/g, "\""))
})
lexer.rule(/\/\/[^\r\n]*\r?\n/, (ctx, match) => {
ctx.ignore()
})
lexer.rule(/[ \t\r\n]+/, (ctx, match) => {
ctx.ignore()
})
lexer.rule(/./, (ctx, match) => {
ctx.accept("char")
})
let cfg = fs.readFileSync("sample.cfg", "utf8")
lexer.input(cfg)
lexer.debug(true)
lexer.tokens().forEach((token) => {
console.log(token.toString())
})
The output of running this sample program is:
<type: id, value: "foo", text: "foo", pos: 0, line: 1, column: 1>
<type: char, value: "{", text: "{", pos: 4, line: 1, column: 5>
<type: id, value: "baz", text: "baz", pos: 10, line: 2, column: 5>
<type: char, value: "=", text: "=", pos: 14, line: 2, column: 9>
<type: number, value: 1, text: "1", pos: 16, line: 2, column: 11>
<type: id, value: "bar", text: "bar", pos: 40, line: 3, column: 5>
<type: char, value: "{", text: "{", pos: 44, line: 3, column: 9>
<type: id, value: "quux", text: "quux", pos: 54, line: 4, column: 9>
<type: char, value: "=", text: "=", pos: 59, line: 4, column: 14>
<type: number, value: 42, text: "42", pos: 61, line: 4, column: 16>
<type: id, value: "hello", text: "hello", pos: 72, line: 5, column: 9>
<type: char, value: "=", text: "=", pos: 78, line: 5, column: 15>
<type: string, value: "hello \"world\"!", text: "\"hello \\\"world\\\"!\"", pos: 80, line: 5, column: 17>
<type: char, value: "}", text: "}", pos: 103, line: 6, column: 5>
<type: id, value: "quux", text: "quux", pos: 109, line: 7, column: 5>
<type: char, value: "=", text: "=", pos: 114, line: 7, column: 10>
<type: number, value: 7, text: "7", pos: 116, line: 7, column: 12>
<type: char, value: "}", text: "}", pos: 118, line: 8, column: 1>
<type: EOF, value: "", text: "", pos: 122, line: 9, column: 1>
If you want to combine multiple single-char plaintext tokens into a multi-char plaintext token, you can use the following code fragment:
let plaintext = ""
lexer.before((ctx, match, rule) => {
if (rule.name !== "plaintext" && plaintext !== "") {
ctx.accept("plaintext", plaintext)
plaintext = ""
}
})
lexer.rule(/./, (ctx, match) => {
plaintext += match[0]
ctx.ignore()
}, "plaintext")
lexer.finish((ctx) => {
if (plaintext !== "")
ctx.accept("plaintext", plaintext)
})
With the additional help of an Abstract Syntax Tree (AST) library like ASTy and a query library like ASTq you can write powerful Recursive Descent based parsers which parse such a token stream into an AST and then query and process the AST.
Application Programming Interface (API)
Class Tokenizr
This is the main API class for establishing a lexical scanner.- Constructor:
Tokenizr(): Tokenizr
- Method:
Tokenizr#reset(): Tokenizr
- Method:
Tokenizr#debug(enable: Boolean): Tokenizr
- Method:
Tokenizr#input(input: String): Tokenizr
reset()
operation beforehand.- Method:
Tokenizr#push(state: String): Tokenizr
- Method:
Tokenizr#pop(): String
default
) cannot be popped.- Method:
Tokenizr#state(state: String): Tokenizr
Tokenizr#state(): String
Set or get the state on the top of the state stack. Use this to initialy start tokenizing with a custom state. The initial state is named
default
.- Method:
Tokenizr#tag(tag: String): Tokenizr
- Method:
Tokenizr#tagged(tag: String): Boolean
- Method:
Tokenizr#untag(tag: String): Tokenizr
- Method:
Tokenizr#before(action: (ctx: ActionContext, match: Array[String], rule: { state: String, pattern: RegExp, action: Function, name: String }) => Void): Tokenizr
Tokenizr#rule()
) is called. This can be used
to execute a common action just before all rule actions. The rule
argument is the Tokenizr#rule()
information of the particular rule
which is executed.- Method:
Tokenizr#after(action: (ctx: ActionContext, match: Array[String], rule: { state: String, pattern: RegExp, action: Function, name: String }) => Void): Tokenizr
Tokenizr#rule()
) is called. This can be used
to execute a common action just after all rule actions. The rule
argument is the Tokenizr#rule()
information of the particular rule
which is executed.- Method:
Tokenizr#finish(action: (ctx: ActionContext) => Void): Tokenizr
EOF
token is emitted. This can be used to execute a common action just
after the last rule action was called.- Method:
Tokenizr#rule(state?: String, pattern: RegExp, action: (ctx: ActionContext, match: Array[String]) => Void, name: String): Tokenizr
name
, which executes its action
in case
the current tokenization state is one of the states (and all of the
currently set tags) in state
(by default the rule matches all states
if state
is not specified) and the next input characters match
against the pattern
. The exact syntax of state
is
<state>[ #<tag> #<tag> ...][, <state>[ #<tag> #<tag> ...], ...]
, i.e.,
it is one or more comma-separated state matches (OR-combined) and each state
match has exactly one state and zero or more space-separated tags
(AND-combined). The ctx
argument provides a context object for token
repeating/rejecting/ignoring/accepting, the match
argument is the
result of the underlying RegExp#exec
call.- Method:
Tokenizr#token(): Tokenizr.Token
- Method:
Tokenizr#tokens(): Array[Tokenizr.Token]
Tokenizr#token()
.- Method:
Tokenizr#skip(next?: Number): Tokenizr
next
number of following tokens with Tokenizr#token()
.- Method:
Tokenizr#consume(type: String, value?: String): Tokenizr.Token
Tokenizr.Token#isA
) the next token. If it matches
type
and optionally also value
, consume it. If it does not match,
throw a Tokenizr.ParsingError
. This is the primary function used in
Recursive Descent parsers.- Method:
Tokenizr#peek(offset?: Number): Tokenizr.Token
- Method:
Tokenizr#begin(): Tokenizr
Tokenizr#commit()
or
Tokenizr#rollback()
are called, all consumed tokens will
be internally remembered and be either thrown away (on
Tokenizr#commit()
) or pushed back (on Tokenizr#rollback()
). This
can be used multiple times and this way supports nested transactions.
It is intended to be used for tokenizing alternatives.- Method:
Tokenizr#depth(): Number
- Method:
Tokenizr#commit(): Tokenizr
- Method:
Tokenizr#rollback(): Tokenizr
- Method:
Tokenizr#alternatives(...alternatives: Array[() => any]): any
this
in each callback function points to the
Tokenizr
object on which alternatives
was called.- Method:
Tokenizr#error(message: String): Tokenizr.ParsingError
Tokenizr.ParsingError
, based
on the current input character stream position, and with
Tokenizr.ParsingError#message
set to message
.Class Tokenizr.Token
This is the class of all returned tokens.- Property:
Tokenizr.Token#type: String
Tokenizr.ActionContext#accept()
.- Property:
Tokenizr.Token#value: any
Tokenizr.Token#text
, but can be any pre-processed value
as specified on Tokenizr.ActionContext#accept()
.- Property:
Tokenizr.Token#text: String
- Property:
Tokenizr.Token#pos: Number
- Property:
Tokenizr.Token#line: Number
- Property:
Tokenizr.Token#column: Number
- Method:
Tokenizr.Token#toString(colorize?: (type: String, value: String) => String): String
colorize
callback can be used to colorize the output.- Method:
Tokenizr.Token#isA(type: String, value?: any): String
type
and optionally
a particular value
. This is especially used internally by
Tokenizr#consume()
.Class Tokenizr.ParsingError
This is the class of all thrown exceptions related to parsing.- Property:
Tokenizr.ParsingError#name: String
ParsingError
to be complaint to
the JavaScript Error
class specification.- Property:
Tokenizr.ParsingError#message: String
- Property:
Tokenizr.ParsingError#pos: Number
- Property:
Tokenizr.ParsingError#line: Number
- Property:
Tokenizr.ParsingError#column: Number
- Property:
Tokenizr.ParsingError#input: String
- Method:
Tokenizr.ParsingError#toString(): String
Class Tokenizr.ActionContext
This is the class of all rule action contexts.- Method:
Tokenizr.ActionContext#data(key: String, value?: any): any
key
) to the action
context for sharing data between two or more rules.- Method:
Tokenizr.ActionContext#info(): { line: number, column: number, pos: number, len: number }
- Method:
Tokenizr.ActionContext#push(state: String): Tokenizr
Tokenizr.ActionContext#pop(): String
Method:
Tokenizr.ActionContext#state(state: String): Tokenizr.ActionContext
Method:
Tokenizr.ActionContext#state(): String
Method:
Tokenizr.ActionContext#tag(tag: String): Tokenizr.ActionContext
Method:
Tokenizr.ActionContext#tagged(tag: String): Boolean
Method:
Tokenizr.ActionContext#untag(tag: String): Tokenizr.ActionContext
Methods just passed-through to the attached
Tokenizr
object. See above for details.- Method:
Tokenizr.ActionContext#repeat(): Tokenizr.ActionContext
Tokenizr.ActionContext#state()
or this will lead to an endless loop, of course!- Method:
Tokenizr.ActionContext#reject(): Tokenizr.ActionContext
- Method:
Tokenizr.ActionContext#ignore(): Tokenizr.ActionContext
- Method:
Tokenizr.ActionContext#accept(type: String, value?: any): Tokenizr.ActionContext
type
and
optionally with a different value
(usually a pre-processed variant
of the matched text). This function can be called multiple times to
produce one or more distinct tokens in sequence.- Method:
Tokenizr.ActionContext#stop(): Tokenizr.ActionContext
Tokenizr#token()
method immediately starts to return null
.RegExp Flag Support
Thepattern
passed to Tokenizr.{before,after,rule}()
has to be a
regular JavaScript RegExp
objects. Internally, Tokenizr creates a copy
of this object by skipping its g
(global) and y
(sticky) flags and
taking over its m
(multiline), s
(dotAll), i
(ignoreCase), and u
(unicode) flags.Implementation Notice
Although Tokenizr is written in ECMAScript 6, it is transpiled to ECMAScript 5 and this way runs in really all(!) current (as of 2015) JavaScript environments, of course.Internally, Tokenizr scans the input string in a read-only fashion by leveraging
RegExp
's g
flag (global, for ECMAScript <= 5
environments) or y
flag (sticky, for ECMAScript >= 2015 environments)
in combination with RegExp
's lastIndex
field.Alternatives
The following alternatives are known:- moo:
A medium powerful tokenizer/lexer. It provides nearly the same functionality than Tokenizr.
- lex:
License
Copyright (c) 2015-2024 Dr. Ralf S. Engelschall (http://engelschall.com/)Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.