Friday, July 11, 2014

Naive Bayes classifier using Mahout



Bayes was a Presbyterian priest who published "Tractatus Logicus" in 1795. Not far after his dead did anyone appreciate the value of his great work a century later in the scientific community. This is a study of Boolean Calculus (conditional probability).

This classifier is used in supervised learning data-mining.

In order to use Naive Bayes Classifier to classify the dataset and train it you need to use Hadoop to convert the data-set to Hadoop sequence files. Hadoop take the files as input files and generate one chunk-file.The Hadoop command below illustrate the use of Mahout to generate these files:

$./mahout seqdirectory -i ${WORK_DIR}/input_files -o ${WORK_DIR}/new_sequencefiles

This command take as input every files in the directory /input_files and transform them into a sequence  file.
Check the help command of mahout to find out more about the command:
$./mahout seqdirectory --help


The following Hadoop command examine the outcome of the sequence file:

$hadoop fs -text  ${WORK_DIR}/new_sequencefiles | more

A number of machine-learning and data-mining algorithms are based on the calculation of vectors that must be provided.

The naive Bayes algorithm does not work with words and raw-data, but works with weighted vectors associated with the documents. To transform the raw data or text to weighted vectors, the mahout-command provide an convenience way as follows:
./mahout seq2sparse -i ${WORK_DIR}/new_sequencefiles -o ${WORK_DIR}/new_vectorfiles
-lnorm -nv -wt tfidf

Refer to the helper manual for further understanding of -lnorm -nv -wt tfidf like L_2 norm, namedVector.

We now need to train the algorithm, The best approach is to split the vectors into 20-80 which the 20% of the vectors are preserved to test and the 80% is used to train the algorithm.
mahout provide a command line approach to split the weighted vectors:

$./mahout split
-i ${WORK_DIR}/new_vectorfiles/tfidf-vectors
--trainingOutput ${WORK_DIR}/new-train-vectors
--testOutput ${WORK_DIR}/new-test-vectors
--randomSelectionPct 40 --overwrite --sequenceFiles -xm sequential

After this command runned we will have two directories for training and test vectors.
We train first the Naive Bayes algorithm with the vectors from ${WORK_DIR}/new-train-vectors as follows:




Sqoop is an Apache software that is used to acquire data from RDBMS and import data into HDFS to prepare it to Mahout Analysis.


./mahout trainnb
-i ${WORK_DIR}/new-train-vectors -el
-o ${WORK_DIR}/model
-li ${WORK_DIR}/labelindex
-ow

This will generate a MODEL in form of binary file. This represents the weight matrix, the feature and label sums.

We test the algorithm against 20% of the initial input vectors:
./mahout testnb
-i ${WORK_DIR}/new-test-vectors
-m ${WORK_DIR}/model
-l ${WORK_DIR}/labelindex\
-ow -o ${WORK_DIR}/new-testing

After this command you will see a result of accuracy of the training. A result of accuracy at least 80% should be enough.

And use the following command to dump to a text file:
mahout vectordump -i ${WORK_DIR}/new-vectors/tfidf-vectors/
part-r-00000 –o
${WORK_DIR}/new-vectors/tfidf-vectors/part-r-00000dump

Checkout more on :
https://mahout.apache.org/users/classification/twenty-newsgroups.html










Thursday, July 10, 2014

Bitbucket: how to avoid writing password everytime in git to Bitbucket?



This can not be avoided. The only way is to cache it to get rid of typing everytime:

$git config --global credential.helper "cache --timeout=3600"

Wednesday, July 9, 2014

Information Retrieval Tasks

Here is a list of typical Information Retrieval Tasks:



Information filtering 

Remove redundant or undesired information from an information stream in semi or fully automatic methods before presenting them to human users. 

Document summarization
Create a shortened version of a text in order to reduce the information overload.

Document clustering and categorization 
Group documents together based on their proximity (as defined by a suitable spatial model) in an unsupervised fashion.
Clustering is an unsupervised technique that does not assume a priori knowledge: data are grouped into categories on the basis of some measure of inherent similarity between instances, in such a way that objects in one cluster are very similar (compactness property) and objects in different clusters are different (separateness property).
Classification is a supervised technique that assigns a class to each data item by performing an initial training phase over a set of human annotated data and then a subsequent phase which applies the classification to the remaining elements. Learnings are done either unsupervised or supervised. Techniques such as naive Bayes, regression, decision trees, and support vector machines are used in classification problems. Unsupervised learning models require priori knowledge about the classifications. In contrast Unsupervised learning models does not require pre-knowledge about the classifications and intrinsic  similarities in data is used as measurement in clustering of data. K-means is an example of clustering analysis. Objects are tagged to the clustered data once the classes are determined. This is known as labeling. In other words, the labeling is performed by automatically, no human is involved.

Question answering (QA) 
Select relevant document portions to answer user’s queries formulated in natural language.



Recommending systems
A form of information filtering, by which interesting information items (e.g., songs, movies, or books) are presented to users based on their profile or their neighbors’ taste, neighborhood being defined by such aspects as geographical proximity, social acquaintance, or common interests.