Coded Guru

Wednesday, December 31, 2014

vagrant/machine.rb:153:in `action': wrong number of arguments (2 for 1) (ArgumentError)

Change the line from :

def action(name, **opts)

to:

def action(name, opts)

in the file located at /opt/vagrant/embedded/gems/gems/vagrant-1.7.1/lib/vagrant/machine.rb line 153.

This will be fixed in later versions of Vagrant after 1.7.1

Tuesday, December 30, 2014

Vagrant: Failed to mount folders in Linux guest. This is usually because the "vboxsf" file system is not available.

This is an issue of older version of VirtualBox. Download the latest virtual box from :
Virtual box Download site

You may need to do the following also if not resolved after upgrading to latest version:

$ vagrant plugin install vagrant-vbguest

Tuesday, October 14, 2014

Caused by: org.apache.axis2.AxisFault: Address information does not exist in the Endpoint Reference (EPR).The system cannot infer the transport mechanism.

You may get the error as in my case :

<property name="wsdlFilename" value="MyServiceWS20Service.wsdl" />

The axis can not find wsdl somewhere as resource or misspelling.

Friday, August 1, 2014

Get http:///var/run/docker.sock/v1.12/info: dial unix /var/run/docker.sock: no such file or directory

If you are like me running on MAC OSX and you get this error because you have not started the docker and exported the DOCKER_HOST.

$ docker info

2014/08/01 17:34:31 Get http:///var/run/docker.sock/v1.12/info: dial unix /var/run/docker.sock: no such file or directory

Do as following to fix it:
$boot2docker

You will receive something like the following :

2014/08/01 17:34:21 Started.

2014/08/01 17:34:21 To connect the Docker client to the Docker daemon, please set:

2014/08/01 17:34:21     export DOCKER_HOST=tcp://134.155.10.100:2375

Export DOCKER_HOST in your command line:
$export DOCKER_HOST=tcp://134.155.10.100:2375

Check if you d not get the error again or DOCKER is working properly:

$ docker info

Containers: 0

Images: 0

Storage Driver: aufs

 Root Dir: /mnt/sda1/var/lib/docker/aufs

 Dirs: 0

Execution Driver: native-0.2

Kernel Version: 3.15.3-tinycore64

Debug mode (server): true

Debug mode (client): false

Fds: 10

Goroutines: 10

EventsListeners: 0

Init Path: /usr/local/bin/docker

More good tips about docker:
https://gist.github.com/wsargent/7049221

Friday, July 11, 2014

Naive Bayes classifier using Mahout

Bayes was a Presbyterian priest who published "Tractatus Logicus" in 1795. Not far after his dead did anyone appreciate the value of his great work a century later in the scientific community. This is a study of Boolean Calculus (conditional probability).

This classifier is used in supervised learning data-mining.

In order to use Naive Bayes Classifier to classify the dataset and train it you need to use Hadoop to convert the data-set to Hadoop sequence files. Hadoop take the files as input files and generate one chunk-file.The Hadoop command below illustrate the use of Mahout to generate these files:

$./mahout seqdirectory -i ${WORK_DIR}/input_files -o ${WORK_DIR}/new_sequencefiles

This command take as input every files in the directory /input_files and transform them into a sequence file.
Check the help command of mahout to find out more about the command:
$./mahout seqdirectory --help

The following Hadoop command examine the outcome of the sequence file:

$hadoop fs -text ${WORK_DIR}/new_sequencefiles | more

A number of machine-learning and data-mining algorithms are based on the calculation of vectors that must be provided.

The naive Bayes algorithm does not work with words and raw-data, but works with weighted vectors associated with the documents. To transform the raw data or text to weighted vectors, the mahout-command provide an convenience way as follows:
./mahout seq2sparse -i ${WORK_DIR}/new_sequencefiles -o ${WORK_DIR}/new_vectorfiles
-lnorm -nv -wt tfidf

Refer to the helper manual for further understanding of -lnorm -nv -wt tfidf like L_2 norm, namedVector.

We now need to train the algorithm, The best approach is to split the vectors into 20-80 which the 20% of the vectors are preserved to test and the 80% is used to train the algorithm.
mahout provide a command line approach to split the weighted vectors:

$./mahout split
-i ${WORK_DIR}/new_vectorfiles/tfidf-vectors
--trainingOutput ${WORK_DIR}/new-train-vectors
--testOutput ${WORK_DIR}/new-test-vectors
--randomSelectionPct 40 --overwrite --sequenceFiles -xm sequential

After this command runned we will have two directories for training and test vectors.
We train first the Naive Bayes algorithm with the vectors from ${WORK_DIR}/new-train-vectors as follows:

Sqoop is an Apache software that is used to acquire data from RDBMS and import data into HDFS to prepare it to Mahout Analysis.

./mahout trainnb
-i ${WORK_DIR}/new-train-vectors -el
-o ${WORK_DIR}/model
-li ${WORK_DIR}/labelindex
-ow

This will generate a MODEL in form of binary file. This represents the weight matrix, the feature and label sums.

We test the algorithm against 20% of the initial input vectors:
./mahout testnb
-i ${WORK_DIR}/new-test-vectors
-m ${WORK_DIR}/model
-l ${WORK_DIR}/labelindex\
-ow -o ${WORK_DIR}/new-testing

After this command you will see a result of accuracy of the training. A result of accuracy at least 80% should be enough.

And use the following command to dump to a text file:
mahout vectordump -i ${WORK_DIR}/new-vectors/tfidf-vectors/
part-r-00000 –o
${WORK_DIR}/new-vectors/tfidf-vectors/part-r-00000dump

Checkout more on :
https://mahout.apache.org/users/classification/twenty-newsgroups.html

Thursday, July 10, 2014

Bitbucket: how to avoid writing password everytime in git to Bitbucket?

This can not be avoided. The only way is to cache it to get rid of typing everytime:

$git config --global credential.helper "cache --timeout=3600"

Wednesday, July 9, 2014

Information Retrieval Tasks

Here is a list of typical Information Retrieval Tasks:

Information filtering

Remove redundant or undesired information from an information stream in semi or fully automatic methods before presenting them to human users.

Document summarization
Create a shortened version of a text in order to reduce the information overload.

Document clustering and categorization
Group documents together based on their proximity (as defined by a suitable spatial model) in an unsupervised fashion.
Clustering is an unsupervised technique that does not assume a priori knowledge: data are grouped into categories on the basis of some measure of inherent similarity between instances, in such a way that objects in one cluster are very similar (compactness property) and objects in different clusters are different (separateness property).
Classification is a supervised technique that assigns a class to each data item by performing an initial training phase over a set of human annotated data and then a subsequent phase which applies the classification to the remaining elements. Learnings are done either unsupervised or supervised. Techniques such as naive Bayes, regression, decision trees, and support vector machines are used in classification problems. Unsupervised learning models require priori knowledge about the classifications. In contrast Unsupervised learning models does not require pre-knowledge about the classifications and intrinsic similarities in data is used as measurement in clustering of data. K-means is an example of clustering analysis. Objects are tagged to the clustered data once the classes are determined. This is known as labeling. In other words, the labeling is performed by automatically, no human is involved.

Question answering (QA)
Select relevant document portions to answer user’s queries formulated in natural language.

Recommending systems
A form of information filtering, by which interesting information items (e.g., songs, movies, or books) are presented to users based on their profile or their neighbors’ taste, neighborhood being defined by such aspects as geographical proximity, social acquaintance, or common interests.