Skip to content

zyj0021/tokenizer

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

33 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Tokenizer - Data Extraction Library

GitHub Stars GitHub Issues NuGet Version NuGet Downloads

Tokenizer is a .NET Standard and .NET Framework library that allows you to extract information from text using predefined patterns. Tokens embedded within patterns are extracted, validated and transformed before being returned as a strongly typed object:

var pattern = @"First Name: {FirstName}, Last Name: {LastName}, Enrolled: {Enrolled:ToDateTime('dd MMM yyyy')}";
var input = @"First Name: Alice, Last Name: Smith, Enrolled: 16 Jan 2018";

var student = new Tokenizer().Parse<Student>(pattern, input);

Assert.AreEqual("Alice", student.FirstName);
Assert.AreEqual("Smith", student.LastName);
Assert.AreEqual(new DateTime(2018, 1, 16), student.Enrolled);

Tokens work by matching the preceding text (preamble) in your input. When a match is found, the text after the preamble is taken and used to populate the token. Text is taken up to a terminator, or until the next token begins.

In Order Processing

Tokens can be processed either in the order they appear in the input pattern, or in any order. If processing in order, a token can be marked as optional with the ? suffix to allow matching to continue if it is not present in the input.

var pattern = 
@"---
# Tokens must appear in defined order
OutOfOrder: false
---
First Name: {FirstName}
Middle Name: {MiddleName?}
Last Name: {LastName}";

var input = 
@"First Name: Alice
Last Name: Smith";

var student = new Tokenizer().Parse<Student>(pattern, input);

Assert.AreEqual("Alice", student.FirstName);
Assert.IsNull(student.MiddleName);
Assert.AreEqual("Smith", student.LastName);

Line Handling

Multiple tokens can appear on the same line of text, or tokens can span multiple lines of text if desired. Windows and Unix line endings are automatically handled in patterns and input.

var pattern = 
@"Comments:
{Comment:Trim()}

Name:
{Name}";

var input = 
@"Comments:
10/10
Would parse text again.

Name:
Bob";

var review = new Tokenizer().Parse<Review>(pattern, input);

Assert.AreEqual("10/10\nWould parse text again.", review.Comment);
Assert.AreEqual("Bob", review.Name);

New Line Termination

When data is embedded in a single line, appending the $ symbol to the end of the Token name will match to the end of the current line:

var pattern = @"Name: {Name$}
Age: {Age:IsNumeric()}";

var input = @"Name: Bob
Surname: Jones
Age: 31";

var person = new Tokenizer().Parse<Person>(pattern, input);

Assert.AreEqual(person.Name, "Bob");  // Not: "Bob\nSurname: Jones"
Assert.AreEqual(person.Age, 31);

Repeating

Lists and repeating data elements can be extracted multiple by appending the * suffix to the token. Tokenizer will populate an underlying List<> or IList<> on the target object.

var pattern = 
@"Name: {Manager.Name}
Employee: {Manager.Manages*}
Number: {Manager.Number}";

var input = 
@"Name: Sue
Employee: Alice
Employee: Bob
Employee: Charles
Number: 1234";

var result = new Tokenizer().Parse<Manager>(pattern, input);

Assert.AreEqual("Sue", result.Name);
Assert.AreEqual(3, result.Manages.Count);
Assert.AreEqual("Alice", result.Manages[0]);
Assert.AreEqual("Bob", result.Manages[1]);
Assert.AreEqual("Charles", result.Manages[2]);
Assert.AreEqual(1234, result.Number);

Repeating tokens are also treated as optional tokens.

Configuration

Tokenizer configuration can be set either globally, per instance or per pattern.

// Global configuration
TokenizerOptions.Defaults.TrimTrailingWhiteSpace = false;

// Instance configuration
var tokenizer = new Tokenizer();
tokenizer.Options.TrimTrailingWhiteSpace = true;

// Front matter configuration
var pattern = @"---
# Trim Whitespace
TrimTrailingWhitespace: true
---
First Name: {FirstName}
Last Name: {LastName}
...";

Configuration Front Matter

Tokenizer templates are configurable via an embedded Front Matter section. The options set in the Front Matter will effect the parsing of that template only, and override both Global and instance settings.

The Front Matter section is optional. It is processed between matching --- sequences at the start of the template pattern. Within the Front Matter, lines starting with the hash sign (#) are treated as comments.

---
# Treat missing properties on the target object as exceptions
ThrowExceptionOnMissingProperty: true

# Do a case insensitive compare when matching tokens to property names on the target
CaseSensitive: false
---
First Name: {FirstName}
Middle Names: {MiddleNames*}
Last Name: {LastName}

Configuration directives and their effects are listed in the Wiki.

Data Transformations

Extracted data can be transformed before being set on the target object.

var pattern = "Name: {Name:Trim(),ToLower()}";
var input = "Name:      Alice      ";

var person = new Tokenizer().Parse<Person>(pattern, input);

Assert.AreEqual(person.Name, "alice");

Multiple transformations (and validators) can be chained together using the , symbol and are executed in the order they are specified. It is easy to implement and register your own token transformers by implementing the ITokenTransformer interface. See the Wiki for details how, and a list of built in transformers and their usage.

Data Validation

Token validation functions are run against extracted content before it's mapped to the target object. If a validation returns false, then the token is not mapped, and the input content is searched for another match.

var pattern = "Age: {Age:IsNumeric}";
var input = "Age: Ten, Age: 11";

var person = new Tokenizer().Parse<Person>(pattern, input);

Assert.AreEqual(person.Age, 11);

It is easy to implement and register your own token validators by implementing the ITokenValidator interface. See the Wiki for details how, and a list of built in validators and their usage.

About

.NET Tokenization Library

Resources

Stars

Watchers

Forks

Packages

No packages published

Languages

  • C# 100.0%