This is configurable in that you can modify TokenizerSettings.CharTypes[] array to specify which characters are which type, along with other settings such as whether to look for comments or not.
WARNING: This is not internationalized. This treats all characters beyond the 7-bit ASCII range (decimal 127) as Word characters.
There are two main ways to use this: 1) Parse the entire stream at once and get an List of Tokens (see the Tokenize* methods), and 2) call NextToken() successively. This reads from a TextReader, which you can set directly, and this also provides some convenient methods to parse files and strings. This returns an Eof token if the end of the input is reached.
Here's an example of the NextToken() style of use: StreamTokenizer tokenizer = new StreamTokenizer(); tokenizer.GrabWhitespace = true; tokenizer.Verbosity = VerbosityLevel.Debug; // just for debugging tokenizer.TextReader = File.OpenText(fileName); Token token; while (tokenizer.NextToken(out token)) log.Info("Token = '{0}'", token);
Here's an example of the Tokenize... style of use: StreamTokenizer tokenizer = new StreamTokenizer("some string"); List tokens = new List(); if (!tokenizer.Tokenize(tokens)) { // error handling } foreach (Token t in tokens) Console.WriteLine("t = {0}", t);
Comment delimiters are hardcoded (// and /*), not affected by char type table.
This sets line numbers in the tokens it produces. These numbers are normally the line on which the token starts. There is one known caveat, and that is that when GrabWhitespace setting is true, and a whitespace token contains a newline, that token's line number will be set to the following line rather than the line on which the token started.