Skip to content
This repository has been archived by the owner on May 24, 2019. It is now read-only.

yassersouri/Corpus-Builder

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

68 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Corpus Format

CoNLL. You can find more info here. #Projects ##Corpus Builder ####References This project uses SCICT.PersianTools which is built by SCICT and is opensource under GPL. You can find the source code here. ####What it does We use the Program.cs file to RefineAllFiles that are inside the directory specified. Then we will separate all sentences and words and create new files in new directories. You can overwrite your current fies if you call it like SeparateAllSentencesAndWords(dir,dir);. ####Example Output Check here. ##Tagger ####References This project uses lots of SCICT.NLP DLLs which is built by SCICT and is opensource under GPL. You can find the source code here. It also uses YAXLib. Tagger.cs and Token.cs is written by Mohammad Hedayati. ####What it does Output of the Corpus Builder poject is a set of words in each line where each sentence ends with an empty line at the end. Each word then is tagged with the tagger and its Lemma POStag person and number is extracted. these information is then placed at the same line of the word with tab separated. ####Example Output Check here.

About

builds a persian corpus from text files generate in Crawler project. Format: CoNLL

Resources

Stars

Watchers

Forks

Packages

No packages published

Languages