Skip to content

deniskyashif/thesis

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Lexical Analysis using a Bimachine

Lexer Generator API

[Implementation]

Usage

Generate a lexer by providing a set of lexical rules.

var lexer = Lexer.Create(new[]
{
    new Rule("[0-9]+\\.?[0-9]*", "NUM"),
    new Rule("[+*/-]", "OP"),
    new Rule("=", "EQ"),
    new Rule("[ \t\r\n]", "WS")
});

Set the input either as a string

lexer.Input = new InputStream("3.14+1.86=5");

or as a stream.

using var reader = new StreamReader("path/to/file.txt");
lexer.Input = new InputStream(reader);

Perform the tokenization.

foreach (Token token in lexer.GetNextToken())
    Console.WriteLine(token);
3.14+1.86=5
[@0,0:3='3.14',<NUM>]
[@1,4:4='+',<OP>]
[@2,5:8='1.86',<NUM>]
[@3,9:9='=',<EQ>]
[@4,10:10='5',<NUM>]

Lexer.GetNextToken() returns an iterator which yields a sequence of tokens one at a time. The Token class contains the following properties:

public class Token
{
    public int Index { get; set; }
    public (int Start, int End) Position { get; set; }
    public string Text { get; set; }
    public string Type { get; set; }

    public override string ToString() =>
        $"[@{Index},{Position.Start}:{Position.End}='{Text}',<{Type}>]";
}

Import/Export

Depending on the grammar, the lexer construction might take some time. Hence, once built, it can be exported as a binary file and then loaded in a project.

var lexer = Lexer.Create(grammar);
lexer.ExportToFile("lexout");
var lexer = Lexer.LoadFromFile("lexout");

Example Lexers

[Tests]

  • Arithmetic expression
  • JSON
  • Regular expression
  • Tokenizer for the English language

Building Blocks

Regular Expression Parser

[Implementation] [Tests]

Supported syntax

Expression Example
Concatenation ab
Union a|b
Zero-or-more a*
One-or-more a+
Optional a?
Grouping (a|b)*c
Char class [a-z0-9,]
Negative char class [^a-z0-9,]
Match exactly 2 'a's a{2}
Match at least 2 'a's a{2,}
Match between 2 and 4 'a's a{2,4}
Escaping (matches "a*") a\*
Escaping in char classes (match 'a', '-' or 'z') [a\-z]

Finite-state device construction and operations

Text Rewriters based on Regular Relations

[Implementations] [Tests]

  • Optional rewrite transducer
  • Obligatory rewrite transducer
  • Leftmost-longest match rewrite transducer

Misc

Example usages of the APIs can be found in the unit tests' project.

GraphViz

The finite state devices Fsa, Dfsa and Fst can be exported as GraphViz.

var fst = new RegExp("a+").Automaton
  .Product(new RegExp("b").Automaton)
  .ToRealTime()
  .Transducer
  .PseudoMinimal();

Console.WriteLine(fst.ToGraphViz());
digraph G {
    rankdir=LR; size="8,5"
    node [shape=circle]
    -1 [label= "", shape=none,height=.0,width=.0];
    -1 -> 0;
    0 -> 1 [label="a/b"]
    0 -> 0 [label="a/ε"]
    1 -> 1 [label="a/ε"]
    1 [shape = doublecircle]
}

References

About

Lexical Analysis using a Bimachine

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published