You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mahout.apache.org by Benson Margulies <bi...@gmail.com> on 2009/05/29 14:34:17 UTC

A hadoop novice meets mahout

OK, I've got some inputs, I want to run k-means, how do I feed the beast?

Re: A hadoop novice meets mahout

Posted by Benson Margulies <bi...@gmail.com>.

Oh, yikes. Please ignore my last message.

On Fri, May 29, 2009 at 12:28 PM, Benson Margulies <bi...@gmail.com>wrote:

> Jeff,
>
> The 'KMeans' job in SyntheticControl does not run KMeans. Presumably, the
> idea is to run Canopy (which it does) and then KMeans which it doesn't.
>
> Am I missing something?
>
> --benson
>
>
> On Fri, May 29, 2009 at 10:29 AM, Jeff Eastman <jdog@windwardsolutions.com
> > wrote:
>
>> Benson Margulies wrote:
>>
>>> OK, I've got some inputs, I want to run k-means, how do I feed the beast?
>>>
>>>
>>>
>> Make sure you can run the Synthetic Control example to get everything
>> wired together correctly: JDK, Hadoop, Mahout. See
>> http://cwiki.apache.org/MAHOUT/syntheticcontroldata.html. Then write an
>> input job to convert your data similar to
>> /Mahout/examples/src/main/java/org/apache/mahout/clustering/syntheticcontrol/canopy/InputDriver.java
>> and make a new job like
>> /Mahout/examples/src/main/java/org/apache/mahout/clustering/syntheticcontrol/kmeans/Job.java.
>> You will have a small adventure and then be operational.
>>
>> Have fun,
>> Jeff
>>
>
>

Re: A hadoop novice meets mahout

Posted by Benson Margulies <bi...@gmail.com>.

Jeff,

The 'KMeans' job in SyntheticControl does not run KMeans. Presumably, the
idea is to run Canopy (which it does) and then KMeans which it doesn't.

Am I missing something?

--benson


On Fri, May 29, 2009 at 10:29 AM, Jeff Eastman
<jd...@windwardsolutions.com>wrote:

> Benson Margulies wrote:
>
>> OK, I've got some inputs, I want to run k-means, how do I feed the beast?
>>
>>
>>
> Make sure you can run the Synthetic Control example to get everything wired
> together correctly: JDK, Hadoop, Mahout. See
> http://cwiki.apache.org/MAHOUT/syntheticcontroldata.html. Then write an
> input job to convert your data similar to
> /Mahout/examples/src/main/java/org/apache/mahout/clustering/syntheticcontrol/canopy/InputDriver.java
> and make a new job like
> /Mahout/examples/src/main/java/org/apache/mahout/clustering/syntheticcontrol/kmeans/Job.java.
> You will have a small adventure and then be operational.
>
> Have fun,
> Jeff
>

Re: A hadoop novice meets mahout

Posted by Benson Margulies <bi...@gmail.com>.

The Shashikant code ends up with a SparseVector. There must be some easy
easy way to pull in a SparseVector instead of a DenseVector. The
SparseVector reader wants a DataInput, and the InputMapper has a Text, but
perhaps a quick StringReader is all I need.

The code in the example

On Fri, May 29, 2009 at 12:00 PM, Grant Ingersoll <gs...@apache.org>wrote:

> I think Shashikant was using a modified form of Mahout that encoded the
> labels in the output.
>
> I think we're still a little bit away from having a utility that truly
> makes this straightforward to go from text to clusterable vectors.
>
> No doubt what is happening is the recognition of a need for some type of
> pipeline process that can work with multiple data sources and output various
> consumable formats and help select features.  Unfortunately, we aren't there
> just yet.
>
> -Grant
>
>
> On May 29, 2009, at 11:27 AM, Benson Margulies wrote:
>
>  I'll fish for a one more hint. I'm using the MAHOUT-126 code to turn text
>> into data via TF-IDF. What comes out of there is not in the same format as
>> your example data. This means that I need a different InputDriver? Is one
>> lying about for the format written by that DocumentVector class?
>>
>> On Fri, May 29, 2009 at 10:29 AM, Jeff Eastman
>> <jd...@windwardsolutions.com>wrote:
>>
>>  Benson Margulies wrote:
>>>
>>>  OK, I've got some inputs, I want to run k-means, how do I feed the
>>>> beast?
>>>>
>>>>
>>>>
>>>>  Make sure you can run the Synthetic Control example to get everything
>>> wired
>>> together correctly: JDK, Hadoop, Mahout. See
>>> http://cwiki.apache.org/MAHOUT/syntheticcontroldata.html. Then write an
>>> input job to convert your data similar to
>>>
>>> /Mahout/examples/src/main/java/org/apache/mahout/clustering/syntheticcontrol/canopy/InputDriver.java
>>> and make a new job like
>>>
>>> /Mahout/examples/src/main/java/org/apache/mahout/clustering/syntheticcontrol/kmeans/Job.java.
>>> You will have a small adventure and then be operational.
>>>
>>> Have fun,
>>> Jeff
>>>
>>>
>

Re: A hadoop novice meets mahout

Posted by Grant Ingersoll <gs...@apache.org>.

I think Shashikant was using a modified form of Mahout that encoded  
the labels in the output.

I think we're still a little bit away from having a utility that truly  
makes this straightforward to go from text to clusterable vectors.

No doubt what is happening is the recognition of a need for some type  
of pipeline process that can work with multiple data sources and  
output various consumable formats and help select features.   
Unfortunately, we aren't there just yet.

-Grant

On May 29, 2009, at 11:27 AM, Benson Margulies wrote:

> I'll fish for a one more hint. I'm using the MAHOUT-126 code to turn  
> text
> into data via TF-IDF. What comes out of there is not in the same  
> format as
> your example data. This means that I need a different InputDriver?  
> Is one
> lying about for the format written by that DocumentVector class?
>
> On Fri, May 29, 2009 at 10:29 AM, Jeff Eastman
> <jd...@windwardsolutions.com>wrote:
>
>> Benson Margulies wrote:
>>
>>> OK, I've got some inputs, I want to run k-means, how do I feed the  
>>> beast?
>>>
>>>
>>>
>> Make sure you can run the Synthetic Control example to get  
>> everything wired
>> together correctly: JDK, Hadoop, Mahout. See
>> http://cwiki.apache.org/MAHOUT/syntheticcontroldata.html. Then  
>> write an
>> input job to convert your data similar to
>> /Mahout/examples/src/main/java/org/apache/mahout/clustering/ 
>> syntheticcontrol/canopy/InputDriver.java
>> and make a new job like
>> /Mahout/examples/src/main/java/org/apache/mahout/clustering/ 
>> syntheticcontrol/kmeans/Job.java.
>> You will have a small adventure and then be operational.
>>
>> Have fun,
>> Jeff
>>

Re: A hadoop novice meets mahout

Posted by Benson Margulies <bi...@gmail.com>.

I'll fish for a one more hint. I'm using the MAHOUT-126 code to turn text
into data via TF-IDF. What comes out of there is not in the same format as
your example data. This means that I need a different InputDriver? Is one
lying about for the format written by that DocumentVector class?

On Fri, May 29, 2009 at 10:29 AM, Jeff Eastman
<jd...@windwardsolutions.com>wrote:

> Benson Margulies wrote:
>
>> OK, I've got some inputs, I want to run k-means, how do I feed the beast?
>>
>>
>>
> Make sure you can run the Synthetic Control example to get everything wired
> together correctly: JDK, Hadoop, Mahout. See
> http://cwiki.apache.org/MAHOUT/syntheticcontroldata.html. Then write an
> input job to convert your data similar to
> /Mahout/examples/src/main/java/org/apache/mahout/clustering/syntheticcontrol/canopy/InputDriver.java
> and make a new job like
> /Mahout/examples/src/main/java/org/apache/mahout/clustering/syntheticcontrol/kmeans/Job.java.
> You will have a small adventure and then be operational.
>
> Have fun,
> Jeff
>

Re: A hadoop novice meets mahout

Posted by Jeff Eastman <jd...@windwardsolutions.com>.

Benson Margulies wrote:
> OK, I've got some inputs, I want to run k-means, how do I feed the beast?
>
>   
Make sure you can run the Synthetic Control example to get everything 
wired together correctly: JDK, Hadoop, Mahout. See 
http://cwiki.apache.org/MAHOUT/syntheticcontroldata.html. Then write an 
input job to convert your data similar to 
/Mahout/examples/src/main/java/org/apache/mahout/clustering/syntheticcontrol/canopy/InputDriver.java 
and make a new job like 
/Mahout/examples/src/main/java/org/apache/mahout/clustering/syntheticcontrol/kmeans/Job.java. 
You will have a small adventure and then be operational.

Have fun,
Jeff

Re: A hadoop novice meets mahout

Posted by Benson Margulies <bi...@gmail.com>.

I confess that I assumed that it displayed the aftermath. If it runs the
job, I apologize for being lazy and I'll go from there.

On Fri, May 29, 2009 at 9:41 AM, Lukáš Vlček <lu...@gmail.com> wrote:

> Hi,
> did you look at DisplayKMeans.java in Mahout examples?
> (
>
> http://svn.apache.org/viewvc/lucene/mahout/trunk/examples/src/main/java/org/apache/mahout/clustering/kmeans/DisplayKMeans.java?view=co
> )
>
> Lukas
>
> On Fri, May 29, 2009 at 2:34 PM, Benson Margulies <bimargulies@gmail.com
> >wrote:
>
> > OK, I've got some inputs, I want to run k-means, how do I feed the beast?
> >
>
>
>
> --
> http://blog.lukas-vlcek.com/
>

Re: A hadoop novice meets mahout

Posted by Lukáš Vlček <lu...@gmail.com>.

Hi,
did you look at DisplayKMeans.java in Mahout examples?
(
http://svn.apache.org/viewvc/lucene/mahout/trunk/examples/src/main/java/org/apache/mahout/clustering/kmeans/DisplayKMeans.java?view=co
)

Lukas

On Fri, May 29, 2009 at 2:34 PM, Benson Margulies <bi...@gmail.com>wrote:

> OK, I've got some inputs, I want to run k-means, how do I feed the beast?
>

-- 
http://blog.lukas-vlcek.com/