You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mahout.apache.org by chirag lakhani <ch...@gmail.com> on 2015/01/05 23:28:12 UTC

example of hashing vectorizer for text data using mapreduce code

I am trying to emulate something similar to what was done in this chimpler
example

https://chimpler.wordpress.com/2013/03/13/using-the-mahout-naive-bayes-classifier-to-automatically-classify-twitter-messages/


If you have data like this

tech    308215054011194110      Limited 3-Box $20 BOGO, Supreme $9 BOGO,

art     308215054011194118      Purchase The Jeopardy! Book by Alex Trebek

apparel 308215054011194146      #Shopping #Bargain #Deals Designer
KATHY Van Zeeland



I would like to write map-reduce code that will take each record and
ultimately create a sequence file of mahout vectors that can then be used
by the Naive Bayes algorithm.  I have not been able to find any examples of
this seemingly basic task online.  A few things that confuse me about
writing such code is how do you call Lucene analyzers and vectorizers so
that they are consistent among each map-task.  Could someone provide either
an example of this online or some advice about how I would do such a
thing?  My understanding is that I would want the first column to be the
key and the vectorized form of the third column to be the value of this
sequence file.

Chimpler provides some code but it seems to be done using a local file
system instead of in the map-reduce framework.

Chirag

Re: example of hashing vectorizer for text data using mapreduce code

Posted by chirag lakhani <ch...@gmail.com>.

I believe I may have found a solution to this problem which I will try to
eventually put on github but now I am not sure how to run this on the
cluster.  I have created the code on my eclipse IDE as a maven project and
then copied the jar file to the Hadoop cluster (vectorCode-1.0.jar)

I know try to run it as follow

hadoop jar vectorCode-1.0.jar vectorCode.vectorMapReduce hdfs://
172.28.104.198/trainingFourColumns hdfs://
172.28.104.198/trainingMahoutVectors3


but I get the following error

Exception in thread "main" java.lang.NoClassDefFoundError:
org/apache/mahout/math/VectorWritable
at transactionCode.transactionMapReduce.main(transactionMapReduce.java:130)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.util.RunJar.main(RunJar.java:212)
Caused by: java.lang.ClassNotFoundException:
org.apache.mahout.math.VectorWritable
at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
at java.lang.ClassLoader.loadClass(ClassLoader.java:358)



It seems that the mahout vector libraries are not being included.  How
would I include them in my MapReduce job?

Chirag

On Mon, Jan 5, 2015 at 5:28 PM, chirag lakhani <ch...@gmail.com>
wrote:

> I am trying to emulate something similar to what was done in this chimpler
> example
>
>
> https://chimpler.wordpress.com/2013/03/13/using-the-mahout-naive-bayes-classifier-to-automatically-classify-twitter-messages/
>
>
> If you have data like this
>
> tech    308215054011194110      Limited 3-Box $20 BOGO, Supreme $9 BOGO,
>
> art     308215054011194118      Purchase The Jeopardy! Book by Alex Trebek
>
> apparel 308215054011194146      #Shopping #Bargain #Deals Designer KATHY Van Zeeland
>
>
>
> I would like to write map-reduce code that will take each record and
> ultimately create a sequence file of mahout vectors that can then be used
> by the Naive Bayes algorithm.  I have not been able to find any examples of
> this seemingly basic task online.  A few things that confuse me about
> writing such code is how do you call Lucene analyzers and vectorizers so
> that they are consistent among each map-task.  Could someone provide either
> an example of this online or some advice about how I would do such a
> thing?  My understanding is that I would want the first column to be the
> key and the vectorized form of the third column to be the value of this
> sequence file.
>
> Chimpler provides some code but it seems to be done using a local file
> system instead of in the map-reduce framework.
>
> Chirag
>
>
>