RNNSharp

RNNSharp is a toolkit of deep recurrent neural network which is widely used for many different kinds of tasks, such as sequence labeling. It's written by C# language and based on .NET framework 4.6 or above version.

This page will introduces you about what is RNNSharp, how it works and how to use it. To get the demo package, please access release page and download the package.

Overview

RNNSharp supports many different types of deep recurrent neural network (aka DeepRNN) structures.In the aspect of historical memory, it supports BPTT(BackPropagation Through Time) and LSTM(Long Short-Term Memory) structures. And in respect of output layer structure, RNNSharp supports native output layer and recurrent CRFs[1]. In additional, RNNSharp also support forward RNN and bi-directional RNN structures.

For BPTT and LSTM, BPTT-RNN is usually called as "simple RNN", since the structure of its hidden layer node is very simple. It's not good at preserving long time historical memory. LSTM-RNN is more complex than BPTT-RNN, since its hidden layer node has inner-structure which helps it to save very long time historical memory. In general, LSTM has better performance than BPTT on longer sequences.

For native RNN output, many widely experiments and applications have proved that it has better results than tranditional algorithms, such as MMEM, for online sequence labeling tasks, such as speech recognition, auto suggestion and so on.

For RNN-CRF, based on native RNN outputs and their transition, we compute CRF output for entire sequence. Compred with native RNN, RNN-CRF has better performance for many different types of sequence labeling tasks in offline, such as word segmentation, named entity recognition and so on. With the similar feature set, it has better performance than linear CRF.

For bi-directional RNN, the output result combines the result of both forward RNN and backward RNN. It usually has better performance than single-directional RNN.

Here is an example of deep bi-directional RNN-CRF network. It contains 3 hidden layers, 1 native RNN output layer and 1 CRF output layer.

Here is the inner structure of one bi-directional hidden layer.

Supported Feature Types

RNNSharp supports four types of feature set. They are template features, context template features, run time feature and word embedding features. These features are controlled by configuration file, the following paragraph will introduce what these features are and how to use them in configuration file.

Template Features

Template features are generated by templates. By given templates, according corpus, the features are generated automatically. The template feature is binary feature. If the feature exists in current token, its value will be 1, otherwise, the value will be 0. It's similar as CRFSharp features. In RNNSharp, TFeatureBin.exe is the console tool to generate this type of features.

In template file, each line describes one template which consists of prefix, id and rule-string. The prefix indicates template type. So far, RNNSharp supports U-type feature, so the prefix is always as "U". Id is used to distinguish different templates. And rule-string is the feature body.

# Unigram
U01:%x[-1,0]
U02:%x[0,0]
U03:%x[1,0]
U04:%x[-1,0]/%x[0,0]
U05:%x[0,0]/%x[1,0]
U06:%x[-1,0]/%x[1,0]
U07:%x[-1,1]
U08:%x[0,1]
U09:%x[1,1]
U10:%x[-1,1]/%x[0,1]
U11:%x[0,1]/%x[1,1]
U12:%x[-1,1]/%x[1,1]
U13:C%x[-1,0]/%x[-1,1]
U14:C%x[0,0]/%x[0,1]
U15:C%x[1,0]/%x[1,1]

The rule-string has two types, one is constant string, and the other is macro. The simplest macro format is {“%x[row,col]”}. Row specifies the offset between current focusing token and generate feature token in row. Col specifies the absolute column position in corpus. Moreover, combined macro is also supported, for example: {“%x[row1, col1]/%x[row2, col2]”}. When we build feature set, macro will be replaced as specific string. Here is an example of training data:

Word	Pos	Tag
!	PUN	S
Tokyo	NNP	S_LOCATION
and	CC	S
New	NNP	B_LOCATION
York	NNP	E_LOCATION
are	VBP	S
major	JJ	S
financial	JJ	S
centers	NNS	S
.	PUN	S

       |      |

! | PUN | S p | FW | S ' | PUN | S y | NN | S h | FW | S 44 | CD | S University | NNP | B_ORGANIZATION of | IN | M_ORGANIZATION Texas | NNP | M_ORGANIZATION Austin | NNP | E_ORGANIZATION

According above templates, assuming current focusing token is “York NNP E_LOCATION”, below features are generated:

U01:New
U02:York
U03:are
U04:New/York
U05:York/are
U06:New/are
U07:NNP
U08:NNP
U09:are
U10:NNP/NNP
U11:NNP/VBP
U12:NNP/VBP
U13:CNew/NNP
U14:CYork/NNP
U15:Care/VBP

Although U07 and U08, U11 and U12’s rule-string are the same, we can still distinguish them by id string.

In feature configuration file, keyword TFEATURE_FILENAME is used to specify the file name of template feature set in binary format

Context Template Features

Context template features are based on template features and combined with context. Here is an example, if the settings is "-1,0,1", this feature will combine the features of current token with its previous token and next token. For instance, if the sentence is "how are you". the generated feature set will be {Feature("how"), Feature("are"), Feature("you")}.

In feature configuration file, keyword TFEATURE_CONTEXT is used to specify the tokens' context range for the feature.

Word Embedding Features

Word embedding features are used to describe the features of given token. It's very useful when we only have small labeled corpus, but have lots of unlabeled corpus. This feature is generated by Txt2Vec project. With lots of unlabeled corpus, Txt2Vec is able to generate vectors for each token. Note that, the token's granularity between word embedding feature and RNN training corpus should be consistent, otherwise, tokens in training corpus are not able to be matched with the feature. For more detailed information about how to generate word embedding features, please visit Txt2Vec homepage.

In RNNSharp, this feature also supports context feature. It will combine all features of given contexts into a single word embedding feature.

In feature configuration, it has three keywords: WORDEMBEDDING_FILENAME is used to specify the encoded word embedding data file name generated by Txt2Vec. WORDEMBEDDING_CONTEXT is used to specify the token's context range. And WORDEMBEDDING_COLUMN is used to specify the column index applied the feature in corpus

Run Time Features

Compared with other features generated offline, this feature is generated in run time. It uses the result of previous tokens as run time feature for current token. This feature is only available for forward-RNN, bi-directional RNN does not support it.

In feature configuration, keyword RTFEATURE_CONTEXT is used to specify the context range of this feature.

Feature Configuration File

The configuration file contains many settings related to different types of features introduced in above. Here is an example about what the file looks like. In console tool, use -ftrfile as parameter to specify feature configuration file

#The file name of template feature set
TFEATURE_FILENAME:tfeatures

#The context range of template feature set. In below example, the context is current token, next token and next after next token
TFEATURE_CONTEXT: 0,1,2

#The word embedding model generated by Txt2Vec. If embedding model is raw text format, we should use WORDEMBEDDING_RAW_FILENAME instead of WORDEMBEDDING_FILENAME as keyword
WORDEMBEDDING_FILENAME:word_vector.bin

#The context range of word embedding. In below example, the context is current token, previous token and next token
WORDEMBEDDING_CONTEXT: -1,0,1

#The column index for word embedding feature
WORDEMBEDDING_COLUMN: 0

#The context range of run time feature. In below exampl, RNNSharp will use the output of previous token as run time feature for current token
RTFEATURE_CONTEXT: -1

Training file format

Training corpus contains many records to describe what the model should be. For each record, it's included one or many tokens, and each token has one or many dimension features to describe itself.

In training file, each record can be represented as a matrix and ends with an empty line. In the matrix, each row describes one token and its features, and each column represents a feature in one dimension. In entire training corpus, the number of column must be fixed.

When RNNSharp encodes, if the column size is N, according template file describes, the first N-1 columns will be used as input data for binary feature set generation and model training. The Nth column (aka last column) is the answer of current token, which the model should output.

There is an example (a bigger training example file is at release section, you can see and download it there):

Word	Pos	Tag
!	PUN	S
Tokyo	NNP	S_LOCATION
and	CC	S
New	NNP	B_LOCATION
York	NNP	E_LOCATION
are	VBP	S
major	JJ	S
financial	JJ	S
centers	NNS	S
.	PUN	S

       |      |

! | PUN | S p | FW | S ' | PUN | S y | NN | S h | FW | S 44 | CD | S University | NNP | B_ORGANIZATION of | IN | M_ORGANIZATION Texas | NNP | M_ORGANIZATION Austin | NNP | E_ORGANIZATION

In above example, we designed output answer as "POS_TYPE" format. POS means the position of the term in the chunk or named entity, TYPE means the output type of the term.

The example is for labeling named entities in records. It has two records and each token has three columns. The first column is the term of a token, the second column is correpsonding token’s pos-tag, and the third column indicates the named entity type of the token. The first and the second columns are input data for model training, and the third column is the model ideal output as answer.

For POS, it supports four types as follows:
S : the chunk has only one term
B : the first term of the chunk
M : one of the middle term in the chunk
E : the last term of the chunk

For TYPE, the example contains some types as follows:
ORGANIZATION : the name of one organization
LOCATION : the name of one location
For output answer without TYPE, it's just a normal term, not a named entity.

Test file format

Test file has the similar format as training file. The only different between them is the last column. In test file, all columns are features for model decoding.

Tag Mapping File

This file contains available result tags of the model. For readable, RNNSharp uses tag name in corpus, however, for high efficiency in encoding and decoding, tag names are mapped into integer values. The mapping is defined in a file (-tagfile as parameter in console tool). Each line is one tag name.

Console Tool

RNNSharpConsole

RNNSharpConsole.exe is a console tool for recurrent neural network encoding and decoding. The tool has two running modes. "train" mode is for model training and "test" mode is for output tag predicting from test corpus by given encoded model.

Encode Model

In this mode, the console tool can encode a RNN model by given feature set and training/validated corpus. The usage as follows:

RNNSharpConsole.exe -mode train
Parameters for training RNN based model
-trainfile : training corpus file
-validfile : validated corpus for training
-modelfile : encoded model file
-modeltype : model structure: simple RNN is 0, LSTM-RNN is 1, default is 0
-ftrfile : feature configuration file
-tagfile : supported output tagid-name list file
-alpha : learning rate, default is 0.1
-dropout : hidden layer node drop out ratio, default is 0
-bptt : the step for back-propagation through time, default is 4
-layersize : the size of each hidden layer, default is 200 for a single layer. If you want to have more than one layer, each layer size is split by character ',' For example: "-layersize = 200,100" means the neural network has two hidden layers, the first hidden layer size is 200, and the second hidden layer size is 100
-crf <0/1>: training model by standard RNN(0) or RNN-CRF(1), default is 0
-maxiter : maximum iteration for training. 0 is no limition, default is 20
-savestep : save temporary model after every sentence, default is 0
-dir : RNN directional: 0 - Forward RNN, 1 - Bi-directional RNN, default is 0
-vq : Model vector quantization, 0 is disable, 1 is enable. default is 0

Example: RNNSharpConsole.exe -mode train -trainfile train.txt -validfile valid.txt -modelfile model.bin -tagfile tags.txt -layersize 200,100 -modeltype 0 -alpha 0.1 -bptt 4 -crf 1 -maxiter 20 -savestep 200K -dir 1

Above command line will train a bi-directional recurrent neural network with CRF output. The network has two BPTT hidden layers and one output layer. The first hidden layer size is 200 and the second hidden layer size is 100

Decode Model

In this mode, the console tool is used to predict output tags of given corpus. The usage as follows:

RNNSharpConsole.exe -mode test
Parameters for predicting iTagId tag from given corpus
-testfile : training corpus file
-modelfile : encoded model file
-tagfile : supported output tagid-name list file
-ftrfile : feature configuration file
-outfile : result output file

Example: RNNSharpConsole.exe -mode test -testfile test.txt -modelfile model.bin -tagfile tags.txt -ftrfile features.txt -outfile result.txt

TFeatureBin

It's used to generate template feature set by given template and corpus files. For high performance accessing and save memory cost, the indexed feature set is built as double array in trie-tree by AdvUtils. The tool supports three modes as follows:

TFeatureBin.exe
The tool is to generate template feature from corpus and index them into file
-mode : support extract,index and build modes
extract : extract features from corpus and save them as raw text feature list
index : build indexed feature set from raw text feature list
build : extract features from corpus and generate indexed feature set

Build mode

This mode is to extract features from given corpus according templates, and then build indexed feature set. The usage of this mode as follows：

TFeatureBin.exe -mode build
This mode is to extract feature from corpus and generate indexed feature set
-template : feature template file
-inputfile : file used to generate features
-ftrfile : generated indexed feature file
-minfreq : min-frequency of feature

Example: TFeatureBin.exe -mode build -template template.txt -inputfile train.txt -ftrfile tfeature -minfreq 3

In above example, feature set is extracted from train.txt and build them into tfeature file as indexed feature set.

Extract mode

This mode is only to extract features from given corpus and save them into a raw text file. The different between build mode and extract mode is that extract mode builds feature set as raw text format, not indexed binary format. The usage of extract mode as follows:

TFeatureBin.exe -mode extract
This mode is to extract features from corpus and save them as text feature list
-template : feature template file
-inputfile : file used to generate features
-ftrfile : generated feature list file in raw text format
-minfreq : min-frequency of feature

Example: TFeatureBin.exe -mode extract -template template.txt -inputfile train.txt -ftrfile features.txt -minfreq 3

In above example, according templates, feature set is extracted from train.txt and save them into features.txt as raw text format. The format of output raw text file is "feature string \t frequency in corpus". Here is a few examples：

U01:仲恺 \t 123
U01:仲文 \t 10
U01:仲秋 \t 12

U01:仲恺 is feature string and 123 is the frequency that this feature in corpus.

Index mode

This mode is only to build indexed feature set by given templates and feature set in raw text format. The usage of this mode as follows：

TFeatureBin.exe -mode index
This mode is to build indexed feature set from raw text feature list
-template : feature template file
-inputfile : feature list in raw text format
-ftrfile : indexed feature set

Example: TFeatureBin.exe -mode index -template template.txt -inputfile features.txt -ftrfile features.bin

In above example, according templates, the raw text feature set, features.txt, will be indexed as features.bin file in binary format.

Performance

Here is peformance results on Chinese named entity recognizer task. You could get corpus, configuration and parameter files from RNNSharp demo package file in release section. The result is based on bi-directional BPTT-RNN model. The first hidden layer size is 200, and the second hidden layer size is 100. The result in below is from test corpus.

Parameter	Token Error	Sentence Error
1-hidden layer	5.53%	15.46%
1-hidden layer-CRF	5.51%	13.60%
2-hidden layers	5.47%	14.23%
2-hidden layers-CRF	5.40%	12.93%

Run on Linux/Mac

With Mono-project which is the third party .NET framework on Linux/Mac, RNNSharp is able to run on some non-Windows platforms without re-compile or modify, such as Linux, Mac and others.

APIs

The RNNSharp also provides some APIs for developers to leverage it into their projects. By download source code package and open RNNSharpConsole project, you will see how to use APIs in your project to encode and decode RNN models. Note that, before use RNNSharp APIs, you should add RNNSharp.dll as reference into your project.

Reference

[Recurrent Conditional Random Field For Language Understanding ](http://research.microsoft.com/pubs/210167/rcrf_v9.pdf)

[Recurrent Neural Networks for Language Understanding](http://research.microsoft.com/pubs/200236/RNN4LU.pdf)

[RNNLM - Recurrent Neural Network Language Modeling Toolkit](http://research.microsoft.com/pubs/175562/ASRU-Demo-2011.pdf)

Name		Name	Last commit message	Last commit date
Latest commit History 74 Commits
ConvertCorpus		ConvertCorpus
RNNSharp		RNNSharp
RNNSharpConsole		RNNSharpConsole
TFeatureBin		TFeatureBin
dll		dll
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
RNNSharp.sln		RNNSharp.sln
RNNSharp.v12.suo		RNNSharp.v12.suo
RNNSharpLayer.jpg		RNNSharpLayer.jpg
RNNSharpOverview.jpg		RNNSharpOverview.jpg

License

zero76114/RNNSharp

Folders and files

Latest commit

History

Repository files navigation