Skip to content

soheilstar-z/PersianVerbAnalyzer

 
 

Repository files navigation

In the name of Allah

The following package consists of a rule-based verb inflector in Persian developed by Mohammad Sadegh Rasooli. The code was mainly used for preprocessing the Persian dependency treebank.

Note!

How to use the code

The code is compatible with C# 3.5 or upper versions.

There are two options for getting a verb analyzed sentence:

  1. Without part of speech tags (without disambiguation, considering all the words as potential verbs). In SentenceAnalyzer.cs:

     public static VerbBasedSentence MakeVerbBasedSentence(string sentence)
    

    or

     public static VerbBasedSentence MakeVerbBasedSentence(string[] sentence)
    
  2. With part of speech and morphosyntactic tags (with a good accuracy): the pos tags are the same as Bijankhan corpus:

     public static VerbBasedSentence MakeVerbBasedSentence(string[] sentence, string[] posSentence, string[] lemmas, MorphoSyntacticFeatures[] morphoSyntacticFeatureses)
    

Sample Code

In the program.cs file there is a test output of a Persian sentence that can be used as a starting point.

var analyzer = new SentenceAnalyzer("../../../Data/VerbList.txt");
var sentence = "من دارم به شما می‌گویم که این صحبت‌ها به راحتی گفته نخواهد شد و من با شما صحبت زیاد خواهم کرد.";
var result = SentenceAnalyzer.MakeVerbBasedSentence(sentence);
var output = new StringBuilder();
foreach (var dependencyBasedToken in result.SentenceTokens)
{
    output.AppendLine(dependencyBasedToken.WordForm + "\t" + dependencyBasedToken.Lemma + "\t" +
                      dependencyBasedToken.CPOSTag
                      + "\t" + (dependencyBasedToken.HeadNumber+1).ToString() + "\t" +
                      dependencyBasedToken.DependencyRelation);
}
File.WriteAllText("../../../testOutPut.txt",output.ToString());

Output in testOutPut.txt:

من	_	_	0	_
دارم	داشت#دار	V	5	PROG
به	_	_	0	_
شما	_	_	0	_
می‌گویم	گفت#گو	V	0	_
که	_	_	0	_
این	_	_	0	_
صحبت‌ها	_	_	0	_
به	_	_	0	_
راحتی	_	_	0	_
گفته نخواهد شد	گفت#گو	V	0	_
و	_	_	0	_
من	_	_	0	_
با	_	_	0	_
شما	_	_	0	_
صحبت	_	_	18	NVE
زیاد	_	_	0	_
خواهم کرد	کرد#کن	V	0	_

Verb Dictionary Format

The file is tab-separated with the following fields:

  • verbType: integer

    1: simple, 2: prefix verb, 3: compound verb, 4: compound prefix verb , 5: prepositional compound prefix verb, 6: enclitic verb, 7: prepositional verb

  • transitivity: integer

    0: intransitive, 1: transitive, x 2: bitransitive

  • past tense root: string

    "-" if not present

  • present tense root: string

    "-" if not present

  • Non-verbal element: string

    "-" if not present

  • Prefix: string

    "-" if not present

  • Preposition: string

    "-" if not present

  • amrShodani: string

    "-" =true, *: false

  • vowelEnd: string

    End of present root vowel: U: ends with u, I: ends with ei, A: ends with a, ?: else

  • maziVowel: string

    Start vowel type of past tense root A: starts with "a" or "\ae", @: else

  • mozarehVowel: string

    Start vowel type of present tense root bU: starts with "bu", ba: start with "b\ae", bA: starts with "ba", A: starts with "a" or "\ae", !: else

Some Points

I assumed the character set is being refined when you pass array argument to the methods. As shown in the follwoing code, I used Virastyar library for refining characters and tokenizing strings.

public static VerbBasedSentence MakeVerbBasedSentence(string sentence)
{
    sentence = StringUtil.RefineAndFilterPersianWord(sentence); // using the refiner of Virastyar software
    var tokenized = PersianWordTokenizer.Tokenize(sentence,true); // using the tokenizer of Virastyar software
    return MakeVerbBasedSentence(tokenized.ToArray());
}

You can go to Virastyar official site in order to know more about its options http://virastyar.ir.

If you do not want to use it for your purposes you can clean the mentioned lines from the code

You can find a morphological-based POS tagger that can be used in your code. You can also use the tagger to help improve learner POS taggers such as HMM tagger.

I assumed that the writers use semi-space for verb inflections. In Bijankhan corpus, you can replace space with semi-space in words with verb tag.

About

A rule based analyzer for Persian verbs

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published