forked from jackfeichen/Naive-Bayes-Classifier
-
Notifications
You must be signed in to change notification settings - Fork 0
sonuk/Naive-Bayes-Classifier
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
This is an implementation of a Naive Bayes Classifier for News classification. Attributes * There are 20 different topic areas * 2000 documents, 100 from each News category * filename include Category and index * there are 10 splits on the data An implementation of the Naive Bayes classifier was taken through several different stages: 1. Default implementation: identify word occurence frequency and then associate the training label with each document and the calculated likelihood. 2. Feature selection was implemented to extract out frequent occuring words within the vocabulary to remove noise. 3. Smoothing was applied to provide a non-zero count for infrequent words that may have a strong influence on the classification of the document, but is ignored due to its infrequency. Accuracy changes from 1-3 were dramatic, in some cases jumping up over 10 percentage points: The default accuracy for 1: split accuracy split 1 0.62105 split 2 0.71809 split 3 0.67692 split 4 0.65104 split 5 0.60309 split 6 0.70588 split 7 0.67016 split 8 0.70213 split 9 0.65263 split 10 0.64249 mean 0.66435 std 0.03557 The updated accuracy for 2, for different count N of words that were removed from the vocabulary: N=30 n=50 N=100 N=200 N=1000 split1 0.71053 0.72105 0.74211 0.76316 0.79421 split2 0.81915 0.82447 0.81915 0.81383 0.76596 split3 0.75897 0.75897 0.76923 0.77436 0.72821 split4 0.74479 0.75521 0.77604 0.76563 0.74479 split5 0.74742 0.77320 0.77320 0.77320 0.75258 split6 0.80214 0.80749 0.81818 0.82888 0.75401 split7 0.77487 0.78534 0.81152 0.83246 0.80628 split8 0.79787 0.81383 0.81915 0.80319 0.79787 split9 0.74211 0.75263 0.74211 0.75789 0.74737 split10 0.78756 0.80311 0.79793 0.79793 0.74093 Finally, 3 was applied to 1 directly (not including feature selction), the result below shows only Split 2: m accuracy 10 0.75532 100 0.77128 1000 0.79255 5000 0.80319 10000 0.80319 15000 0.80319 20000 0.81915 25000 0.81915 30000 0.82447 35000 0.82247
About
A C# (Mono-Friendly) implementation of a Naive Bayes classifier trained to classify documents into 20 different subject areas
Resources
Stars
Watchers
Forks
Releases
No releases published
Packages 0
No packages published