Automa v1.0 break release Read the new doc here


Theory of regular expression

I know how to use regular expression, but I didn't know the underpinning theory.

  • the atoms of RegExp: empty string re"" is a valid RegExp; single symbol match (from a finite alphabet)re"p"

  • atoms can be combined: alternation|, concatenation*, repetitionR*

  • other operations can be constructed from the primitives above:

    • R? is ""|R, R+ is RR*

  • Julia PCRE lib (and many popular regex libs) can NOT be constructed from the above atomic operations, so they are slower than the theoretical sound RegExps, but they support operations like backreferences and lookbehind (which Automa NOT)

  • Automa convert RegExp to finite automata, then implemented in code

finite automata

  • NFA: non-deterministic finite automata

    • has ɛ edge that can move from nodeA to nodeB with no symbol cost

    • can have multiple transitions from a state

    • can have multiple final states

  • DFA: deterministic finite automata

    • don't have ɛ edge

    • can only have one transition from a state

    • can only have one final state

    • easier to be simulated in code, and faster

    • can be optimised (by collapse nodes)

    • convert NFA with N nodes to DFA could result in \(2^N\) nodes (worst case), there are mainly two ways to solve this problem:

      • powerset construction, like in Automa.jl, counstruct the DFA directly, and accpet the worst-case of \(Ø(2^N)\) (because the worst case is rare, and the construction happens during compile time)

      • adaptive construction, like the ripgrep CLI (in which the construction happens at the runtime ). It constructs the DFA on the fly and caches it. If the DFA size grows too large, flush the cache, if the cache flushed too often, it falls back to construct NFA.

Automa in a nutshell

  • Automa create Julia Expr to simulate DFA, then using metaprogramming to generate Julia Function, then let the function optimized by Julia and LLVM

  • can put Julia code into the DFA simulation, Currently, only two kinds of code supported: actions and preconditions

  • actions: julia code executed during state transitions

  • preconditions: julia code returns Bool value, and checked before state transition


  • use @re_str macro: re"ABC[DEF]"

  • matches individual bytes, not characters: re"Æ" == re"\xc3\x86" as the concatenation of two bytes

  • supports:

    • Literal symbols: re"ABC",re"\xfe\xa2", re"Θ"

    • |, [], [^], &, +, ?, ()

  • operations for Regex:

    • * for concatenation: re"ABC" * re"DEF" is re"ABCDEF"

    • | for alternation: re"ABC" | re"DEF" is re"ABC|DEF"

    • & for intersection: re"A[AB]C+D?" & re"ABC" is re"ABC"

    • \ for difference: A \ B means match A but not B

    • ! for inversion: !re"ABC" means match anything but ABC

    • opt == ?; rep == *; rep1 == +: opt(re"a" * rep(re"b")*re"c") == re"(ab*c)"

  • Automa.RegExp.RE ∈ AbstractString: regex type in Automa


Text validators

Used for test only, because the julia PCRE built-in regex engine is more suitable for text match.


fasta_regex = let
    header = re"[a-z]+"
    seqline = re"[ACGT]+"
    record = '>' * header * '\n' * rep1(seqline) * '\n'


Buffer validator

generate_buffer_validator(name::Symbol, regexp::RE; goto=true; docstring=true)

function generate_buffer_validator to generate a function that matches certain regex in the buffer:


# generate a function names validate_fasta:
eval(generate_buffer_validator(:validate_fasta, fasta_regex));
# use it:
validate_fasta(">hello\nTAGAGA\nTAGAG") # missing trailing newline, return 0
validate_fasta(">helloXXX") # Error at byte index 7, return 7
validate_fasta(">hello\nTAGAGA\nTAGAG\n") # it matches, return nothing


IO validators

generate_io_validator(funcname::Symbol, regex::RE; goto::Bool=false)

function generate_io_validator to generate a function that matches certain regex in the IO stream for large files that cannot read into buffer. Load TranscodingStreamsbefore Automa to use the IO validator.

  • matches, return nothing

  • else, return (byte, (line, column)):

    • byte: the first errant byte

    • line: the line number of the errant byte

    • column: the column number of the errant byte, is 0 if the errant byte is a newline \n

    • if the errant is EOF, byte is nothing, line and column points to the last byte


Machine is a type that represents an optimised DFA, we can convert regex to machine (regex -> NFA -> DFA -> Machine) through comple():


compile(re::RE; optimize::Bool=true, unambiguous::Bool=true)::Machine
    # `optimize`: whether to optimise the DFA
    # `unambiguous`: whether to create FSM if the actions are not deterministic, if false, it returns the longest token, if true, it errors.

    # example:
    machine = let
        header = re"[a-z]+"
        seqline = re"[ACGT]+"
        record = re">" * header * '\n' * rep1(seqline * '\n')
    @assert machine isa Automa.Machine


Tokenizers (lexers)

Tokenizer/Lexer breaks down an input text into smaller chunks, and classifies them as one of several tokens (called tokenization or lexing).

Making and using a tokenizer


Tokenizer{E, D, C}

Lazy iterator of tokens of type E over data of type D.

Tokenizer works on any buffer-like object that defines pointer and sizeof. When iterated, it will return a 3-tuple of integers:

1. the 1-based starting index of the token in the buffer 2. the length of the token in bytes * The third is 3. the token kind: The index in the input list tokens.

Un-tokenizable data will be emitted as the "error token" with index zero. The Int C parameter allows multiple tokenizers to be created with the otherwise same type parameters.


tokenize(::Type{E}, data, version=1)

Create a Tokenizer{E, typeof(data), version}, iterating tokens of type E over data.


    tokens::Tuple{E, AbstractVector{E}}= [ integers ],
    goto=true, version=1
) where E

Create code which when evaluated, defines Base.iterate(::Tokenizer{E, D, $version}). tokens is a tuple of a vector of non-error tokens of length machine.n_tokens, and the error token, which will be emitted for data that cannot be tokenized.



julia> machine = compile([re"a", re"b"]);

julia> make_tokenizer(machine; tokens=(0x00, [0x01, 0x02])) |> eval

julia> iter = tokenize(UInt8, "abxxxba"); typeof(iter)
Tokenizer{UInt8, String, 1}

julia> collect(iter)
5-element Vector{Tuple{Int64, Int32, UInt8}}:
 (1, 1, 0x01)
 (2, 1, 0x02)
 (3, 3, 0x00)
 (6, 1, 0x02)
 (7, 1, 0x01)


        Tuple{E, AbstractVector{Pair{E, RE}}}
) where E

Convenience function for both compiling a tokenizer, then running make_tokenizer on it. If tokens is an abstract vector, create an iterator of integer tokens with the error token being zero and the non-error tokens being the index in the vector. Else, tokens is the error token followed by token => regex pairs. See the relevant other methods of make_tokenizer, and compile.



julia> make_tokenizer([re"abc", re"def") |> eval

julia> collect(tokenize(Int, "abcxyzdef123"))
4-element Vector{Tuple{Int64, Int32, UInt32}}:
 (1, 3, 0x00000001)
 (4, 3, 0x00000003)
 (7, 3, 0x00000002)
 (10, 3, 0x00000003)


Using enums as tokens


tokens = [
    :lparens => re"\(",
    :rparens => re"\)",
    :comma => re",",
    :quot => re"\"",
    :space => re" +",
    :letters => re"[a-zA-Z]+"
@eval @enum Token error $(first.(tokens)...)
    [Token(i) => j for (i,j) in enumerate(last.(tokens))]
)) |> eval


Token disambiguation


make_tokenizer([re"[ab]+", re"ab*", re"ab"]) |> eval


in the above case, input ab will match all three regex, there are 2 rules to assign tokens:

  1. as long as possible: so ab will be assigned to re"[ab]+", since it is one 2 bytes token rather than two 1 byte tokens;

  2. high index tokens are preferred: so a will be asssiagned to re"ab*" rather than re"[ab]+", since re"ab*" has higher index (2);

  3. use unambiguous=true to turn off the priority rules, so the ambiguous tokens will cause errors.

TBC ....