Skip to content

A C# (Mono-Friendly) implementation of a Naive Bayes classifier trained to classify documents into 20 different subject areas

Notifications You must be signed in to change notification settings

sonuk/Naive-Bayes-Classifier

 
 

Repository files navigation

This is an implementation of a Naive Bayes Classifier for News classification.

Attributes
* There are 20 different topic areas
* 2000 documents, 100 from each News category
* filename include Category and index
* there are 10 splits on the data

An implementation of the Naive Bayes classifier was taken through several different stages:

1. Default implementation: identify word occurence frequency and then associate the training label with each document and the calculated likelihood.

2. Feature selection was implemented to extract out frequent occuring words within the vocabulary to remove noise.

3. Smoothing was applied to provide a non-zero count for infrequent words that may have a strong influence on the classification of the document, but is ignored due to its infrequency.

Accuracy changes from 1-3 were dramatic, in some cases jumping up over 10 percentage points:

The default accuracy for 1:
split	accuracy
split 1	0.62105
split 2	0.71809
split 3	0.67692
split 4	0.65104
split 5	0.60309
split 6	0.70588
split 7	0.67016
split 8	0.70213
split 9	0.65263
split 10	0.64249
mean	0.66435
std	0.03557

The updated accuracy for 2, for different count N of words that were removed from the vocabulary:
		N=30	n=50	N=100	N=200	N=1000
split1	0.71053	0.72105	0.74211	0.76316	0.79421
split2	0.81915	0.82447	0.81915	0.81383	0.76596
split3	0.75897	0.75897	0.76923	0.77436	0.72821
split4	0.74479	0.75521	0.77604	0.76563	0.74479
split5	0.74742	0.77320	0.77320	0.77320	0.75258
split6	0.80214	0.80749	0.81818	0.82888	0.75401
split7	0.77487	0.78534	0.81152	0.83246	0.80628
split8	0.79787	0.81383	0.81915	0.80319	0.79787
split9	0.74211	0.75263	0.74211	0.75789	0.74737
split10	0.78756	0.80311	0.79793	0.79793	0.74093

Finally, 3 was applied to 1 directly (not including feature selction), the result below shows only Split 2:
m		accuracy
10		0.75532
100		0.77128
1000	0.79255
5000	0.80319
10000	0.80319
15000	0.80319
20000	0.81915
25000	0.81915
30000	0.82447
35000	0.82247

About

A C# (Mono-Friendly) implementation of a Naive Bayes classifier trained to classify documents into 20 different subject areas

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published