Friday, July 11, 2014

Naive Bayes classifier using Mahout



Bayes was a Presbyterian priest who published "Tractatus Logicus" in 1795. Not far after his dead did anyone appreciate the value of his great work a century later in the scientific community. This is a study of Boolean Calculus (conditional probability).

This classifier is used in supervised learning data-mining.

In order to use Naive Bayes Classifier to classify the dataset and train it you need to use Hadoop to convert the data-set to Hadoop sequence files. Hadoop take the files as input files and generate one chunk-file.The Hadoop command below illustrate the use of Mahout to generate these files:

$./mahout seqdirectory -i ${WORK_DIR}/input_files -o ${WORK_DIR}/new_sequencefiles

This command take as input every files in the directory /input_files and transform them into a sequence  file.
Check the help command of mahout to find out more about the command:
$./mahout seqdirectory --help


The following Hadoop command examine the outcome of the sequence file:

$hadoop fs -text  ${WORK_DIR}/new_sequencefiles | more

A number of machine-learning and data-mining algorithms are based on the calculation of vectors that must be provided.

The naive Bayes algorithm does not work with words and raw-data, but works with weighted vectors associated with the documents. To transform the raw data or text to weighted vectors, the mahout-command provide an convenience way as follows:
./mahout seq2sparse -i ${WORK_DIR}/new_sequencefiles -o ${WORK_DIR}/new_vectorfiles
-lnorm -nv -wt tfidf

Refer to the helper manual for further understanding of -lnorm -nv -wt tfidf like L_2 norm, namedVector.

We now need to train the algorithm, The best approach is to split the vectors into 20-80 which the 20% of the vectors are preserved to test and the 80% is used to train the algorithm.
mahout provide a command line approach to split the weighted vectors:

$./mahout split
-i ${WORK_DIR}/new_vectorfiles/tfidf-vectors
--trainingOutput ${WORK_DIR}/new-train-vectors
--testOutput ${WORK_DIR}/new-test-vectors
--randomSelectionPct 40 --overwrite --sequenceFiles -xm sequential

After this command runned we will have two directories for training and test vectors.
We train first the Naive Bayes algorithm with the vectors from ${WORK_DIR}/new-train-vectors as follows:




Sqoop is an Apache software that is used to acquire data from RDBMS and import data into HDFS to prepare it to Mahout Analysis.


./mahout trainnb
-i ${WORK_DIR}/new-train-vectors -el
-o ${WORK_DIR}/model
-li ${WORK_DIR}/labelindex
-ow

This will generate a MODEL in form of binary file. This represents the weight matrix, the feature and label sums.

We test the algorithm against 20% of the initial input vectors:
./mahout testnb
-i ${WORK_DIR}/new-test-vectors
-m ${WORK_DIR}/model
-l ${WORK_DIR}/labelindex\
-ow -o ${WORK_DIR}/new-testing

After this command you will see a result of accuracy of the training. A result of accuracy at least 80% should be enough.

And use the following command to dump to a text file:
mahout vectordump -i ${WORK_DIR}/new-vectors/tfidf-vectors/
part-r-00000 –o
${WORK_DIR}/new-vectors/tfidf-vectors/part-r-00000dump

Checkout more on :
https://mahout.apache.org/users/classification/twenty-newsgroups.html










No comments:

Post a Comment