Automa_v1.0Beta
Theory
Theory of regular expression
I know how to use regular expression, but I didn't know the underpinning theory.
the atoms of RegExp: empty string
re""
is a valid RegExp; single symbol match (from a finite alphabet)re"p"
atoms can be combined: alternation
|
, concatenation*
, repetitionR*
other operations can be constructed from the primitives above:
R?
is""|R
,R+
isRR*
Julia PCRE lib (and many popular regex libs) can NOT be constructed from the above atomic operations, so they are slower than the theoretical sound RegExps, but they support operations like backreferences and lookbehind (which Automa NOT)
Automa convert RegExp to
finite automata
, then implemented in code
finite automata
NFA: non-deterministic finite automata
has
ɛ
edge that can move from nodeA to nodeB with no symbol costcan have multiple transitions from a state
can have multiple final states
DFA: deterministic finite automata
don't have
ɛ
edgecan only have one transition from a state
can only have one final state
easier to be simulated in code, and faster
can be optimised (by collapse nodes)
convert NFA with N nodes to DFA could result in \(2^N\) nodes (worst case), there are mainly two ways to solve this problem:
powerset construction, like in
Automa.jl
, counstruct the DFA directly, and accpet the worst-case of \(Ø(2^N)\) (because the worst case is rare, and the construction happens during compile time)adaptive construction, like the
ripgrep
CLI (in which the construction happens at the runtime ). It constructs the DFA on the fly and caches it. If the DFA size grows too large, flush the cache, if the cache flushed too often, it falls back to construct NFA.
Automa in a nutshell
Automa
create Julia Expr to simulate DFA, then using metaprogramming to generate Julia Function, then let the function optimized by Julia and LLVMcan put Julia code into the DFA simulation, Currently, only two kinds of code supported:
actions
andpreconditions
actions
: julia code executed during state transitionspreconditions
: julia code returnsBool
value, and checked before state transition
Regex
use
@re_str
macro:re"ABC[DEF]"
matches individual bytes, not characters:
re"Æ" == re"\xc3\x86"
as the concatenation of two bytessupports:
Literal symbols:
re"ABC"
,re"\xfe\xa2"
,re"Θ"
|
,[]
,[^]
,&
,+
,?
,()
operations for Regex:
*
for concatenation:re"ABC" * re"DEF"
isre"ABCDEF"
|
for alternation:re"ABC" | re"DEF"
isre"ABC|DEF"
&
for intersection:re"A[AB]C+D?" & re"ABC"
isre"ABC"
\
for difference:A \ B
means match A but not B!
for inversion:!re"ABC"
means match anything butABC
opt == ?
;rep == *
;rep1 == +
:opt(re"a" * rep(re"b")*re"c") == re"(ab*c)"
Automa.RegExp.RE ∈ AbstractString
: regex type in Automa
Validators
Text validators
Used for test only, because the julia PCRE built-in regex engine is more suitable for text match.
Buffer validator
generate_buffer_validator(name::Symbol, regexp::RE; goto=true; docstring=true)
function generate_buffer_validator
to generate a function that matches certain regex in the buffer:
IO validators
generate_io_validator(funcname::Symbol, regex::RE; goto::Bool=false)
function generate_io_validator
to generate a function that matches certain regex in the IO stream for large files that cannot read into buffer. Load TranscodingStreams
before Automa
to use the IO validator.
matches, return
nothing
else, return
(byte, (line, column))
:byte
: the first errant byteline
: the line number of the errant bytecolumn
: the column number of the errant byte, is0
if the errant byte is a newline\n
if the errant is
EOF
, byte isnothing
, line and column points to the last byte
Machines
Machine
is a type that represents an optimised DFA, we can convert regex to machine (regex -> NFA -> DFA -> Machine
) through comple()
:
Tokenizers (lexers)
Tokenizer/Lexer breaks down an input text into smaller chunks, and classifies them as one of several tokens (called tokenization
or lexing
).
Making and using a tokenizer
make_tokenizer(
tokens::Union{
AbstractVector{RE},
Tuple{E, AbstractVector{Pair{E, RE}}}
};
goto::Bool=true,
version::Int=1,
unambiguous=false
) where E
make_tokenizer
on it. If tokens
is an abstract vector, create an iterator of integer tokens with the error token being zero and the non-error tokens being the index in the vector. Else, tokens
is the error token followed by token => regex
pairs. See the relevant other methods of make_tokenizer
, and compile
.
Example:
julia> make_tokenizer([re"abc", re"def") |> eval
julia> collect(tokenize(Int, "abcxyzdef123"))
4-element Vector{Tuple{Int64, Int32, UInt32}}:
(1, 3, 0x00000001)
(4, 3, 0x00000003)
(7, 3, 0x00000002)
(10, 3, 0x00000003)
Using enums as tokens
tokens = [
:lparens => re"\(",
:rparens => re"\)",
:comma => re",",
:quot => re"\"",
:space => re" +",
:letters => re"[a-zA-Z]+"
]
@eval @enum Token error $(first.(tokens)...)
make_tokenizer((error,
[Token(i) => j for (i,j) in enumerate(last.(tokens))]
)) |> eval
Token disambiguation
make_tokenizer([re"[ab]+", re"ab*", re"ab"]) |> eval
in the above case, input ab
will match all three regex, there are 2 rules to assign tokens:
as long as possible: so
ab
will be assigned tore"[ab]+"
, since it is one 2 bytes token rather than two 1 byte tokens;high index tokens are preferred: so
a
will be asssiagned tore"ab*"
rather thanre"[ab]+"
, sincere"ab*"
has higher index (2
);use
unambiguous=true
to turn off the priority rules, so the ambiguous tokens will cause errors.