You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mahout.apache.org by Benson Margulies <bi...@gmail.com> on 2009/05/29 14:34:17 UTC
A hadoop novice meets mahout
OK, I've got some inputs, I want to run k-means, how do I feed the beast?
Re: A hadoop novice meets mahout
Posted by Benson Margulies <bi...@gmail.com>.
Oh, yikes. Please ignore my last message.
On Fri, May 29, 2009 at 12:28 PM, Benson Margulies <bi...@gmail.com>wrote:
> Jeff,
>
> The 'KMeans' job in SyntheticControl does not run KMeans. Presumably, the
> idea is to run Canopy (which it does) and then KMeans which it doesn't.
>
> Am I missing something?
>
> --benson
>
>
> On Fri, May 29, 2009 at 10:29 AM, Jeff Eastman <jdog@windwardsolutions.com
> > wrote:
>
>> Benson Margulies wrote:
>>
>>> OK, I've got some inputs, I want to run k-means, how do I feed the beast?
>>>
>>>
>>>
>> Make sure you can run the Synthetic Control example to get everything
>> wired together correctly: JDK, Hadoop, Mahout. See
>> http://cwiki.apache.org/MAHOUT/syntheticcontroldata.html. Then write an
>> input job to convert your data similar to
>> /Mahout/examples/src/main/java/org/apache/mahout/clustering/syntheticcontrol/canopy/InputDriver.java
>> and make a new job like
>> /Mahout/examples/src/main/java/org/apache/mahout/clustering/syntheticcontrol/kmeans/Job.java.
>> You will have a small adventure and then be operational.
>>
>> Have fun,
>> Jeff
>>
>
>
Re: A hadoop novice meets mahout
Posted by Benson Margulies <bi...@gmail.com>.
Jeff,
The 'KMeans' job in SyntheticControl does not run KMeans. Presumably, the
idea is to run Canopy (which it does) and then KMeans which it doesn't.
Am I missing something?
--benson
On Fri, May 29, 2009 at 10:29 AM, Jeff Eastman
<jd...@windwardsolutions.com>wrote:
> Benson Margulies wrote:
>
>> OK, I've got some inputs, I want to run k-means, how do I feed the beast?
>>
>>
>>
> Make sure you can run the Synthetic Control example to get everything wired
> together correctly: JDK, Hadoop, Mahout. See
> http://cwiki.apache.org/MAHOUT/syntheticcontroldata.html. Then write an
> input job to convert your data similar to
> /Mahout/examples/src/main/java/org/apache/mahout/clustering/syntheticcontrol/canopy/InputDriver.java
> and make a new job like
> /Mahout/examples/src/main/java/org/apache/mahout/clustering/syntheticcontrol/kmeans/Job.java.
> You will have a small adventure and then be operational.
>
> Have fun,
> Jeff
>
Re: A hadoop novice meets mahout
Posted by Benson Margulies <bi...@gmail.com>.
The Shashikant code ends up with a SparseVector. There must be some easy
easy way to pull in a SparseVector instead of a DenseVector. The
SparseVector reader wants a DataInput, and the InputMapper has a Text, but
perhaps a quick StringReader is all I need.
The code in the example
On Fri, May 29, 2009 at 12:00 PM, Grant Ingersoll <gs...@apache.org>wrote:
> I think Shashikant was using a modified form of Mahout that encoded the
> labels in the output.
>
> I think we're still a little bit away from having a utility that truly
> makes this straightforward to go from text to clusterable vectors.
>
> No doubt what is happening is the recognition of a need for some type of
> pipeline process that can work with multiple data sources and output various
> consumable formats and help select features. Unfortunately, we aren't there
> just yet.
>
> -Grant
>
>
> On May 29, 2009, at 11:27 AM, Benson Margulies wrote:
>
> I'll fish for a one more hint. I'm using the MAHOUT-126 code to turn text
>> into data via TF-IDF. What comes out of there is not in the same format as
>> your example data. This means that I need a different InputDriver? Is one
>> lying about for the format written by that DocumentVector class?
>>
>> On Fri, May 29, 2009 at 10:29 AM, Jeff Eastman
>> <jd...@windwardsolutions.com>wrote:
>>
>> Benson Margulies wrote:
>>>
>>> OK, I've got some inputs, I want to run k-means, how do I feed the
>>>> beast?
>>>>
>>>>
>>>>
>>>> Make sure you can run the Synthetic Control example to get everything
>>> wired
>>> together correctly: JDK, Hadoop, Mahout. See
>>> http://cwiki.apache.org/MAHOUT/syntheticcontroldata.html. Then write an
>>> input job to convert your data similar to
>>>
>>> /Mahout/examples/src/main/java/org/apache/mahout/clustering/syntheticcontrol/canopy/InputDriver.java
>>> and make a new job like
>>>
>>> /Mahout/examples/src/main/java/org/apache/mahout/clustering/syntheticcontrol/kmeans/Job.java.
>>> You will have a small adventure and then be operational.
>>>
>>> Have fun,
>>> Jeff
>>>
>>>
>
Re: A hadoop novice meets mahout
Posted by Grant Ingersoll <gs...@apache.org>.
I think Shashikant was using a modified form of Mahout that encoded
the labels in the output.
I think we're still a little bit away from having a utility that truly
makes this straightforward to go from text to clusterable vectors.
No doubt what is happening is the recognition of a need for some type
of pipeline process that can work with multiple data sources and
output various consumable formats and help select features.
Unfortunately, we aren't there just yet.
-Grant
On May 29, 2009, at 11:27 AM, Benson Margulies wrote:
> I'll fish for a one more hint. I'm using the MAHOUT-126 code to turn
> text
> into data via TF-IDF. What comes out of there is not in the same
> format as
> your example data. This means that I need a different InputDriver?
> Is one
> lying about for the format written by that DocumentVector class?
>
> On Fri, May 29, 2009 at 10:29 AM, Jeff Eastman
> <jd...@windwardsolutions.com>wrote:
>
>> Benson Margulies wrote:
>>
>>> OK, I've got some inputs, I want to run k-means, how do I feed the
>>> beast?
>>>
>>>
>>>
>> Make sure you can run the Synthetic Control example to get
>> everything wired
>> together correctly: JDK, Hadoop, Mahout. See
>> http://cwiki.apache.org/MAHOUT/syntheticcontroldata.html. Then
>> write an
>> input job to convert your data similar to
>> /Mahout/examples/src/main/java/org/apache/mahout/clustering/
>> syntheticcontrol/canopy/InputDriver.java
>> and make a new job like
>> /Mahout/examples/src/main/java/org/apache/mahout/clustering/
>> syntheticcontrol/kmeans/Job.java.
>> You will have a small adventure and then be operational.
>>
>> Have fun,
>> Jeff
>>
Re: A hadoop novice meets mahout
Posted by Benson Margulies <bi...@gmail.com>.
I'll fish for a one more hint. I'm using the MAHOUT-126 code to turn text
into data via TF-IDF. What comes out of there is not in the same format as
your example data. This means that I need a different InputDriver? Is one
lying about for the format written by that DocumentVector class?
On Fri, May 29, 2009 at 10:29 AM, Jeff Eastman
<jd...@windwardsolutions.com>wrote:
> Benson Margulies wrote:
>
>> OK, I've got some inputs, I want to run k-means, how do I feed the beast?
>>
>>
>>
> Make sure you can run the Synthetic Control example to get everything wired
> together correctly: JDK, Hadoop, Mahout. See
> http://cwiki.apache.org/MAHOUT/syntheticcontroldata.html. Then write an
> input job to convert your data similar to
> /Mahout/examples/src/main/java/org/apache/mahout/clustering/syntheticcontrol/canopy/InputDriver.java
> and make a new job like
> /Mahout/examples/src/main/java/org/apache/mahout/clustering/syntheticcontrol/kmeans/Job.java.
> You will have a small adventure and then be operational.
>
> Have fun,
> Jeff
>
Re: A hadoop novice meets mahout
Posted by Jeff Eastman <jd...@windwardsolutions.com>.
Benson Margulies wrote:
> OK, I've got some inputs, I want to run k-means, how do I feed the beast?
>
>
Make sure you can run the Synthetic Control example to get everything
wired together correctly: JDK, Hadoop, Mahout. See
http://cwiki.apache.org/MAHOUT/syntheticcontroldata.html. Then write an
input job to convert your data similar to
/Mahout/examples/src/main/java/org/apache/mahout/clustering/syntheticcontrol/canopy/InputDriver.java
and make a new job like
/Mahout/examples/src/main/java/org/apache/mahout/clustering/syntheticcontrol/kmeans/Job.java.
You will have a small adventure and then be operational.
Have fun,
Jeff
Re: A hadoop novice meets mahout
Posted by Benson Margulies <bi...@gmail.com>.
I confess that I assumed that it displayed the aftermath. If it runs the
job, I apologize for being lazy and I'll go from there.
On Fri, May 29, 2009 at 9:41 AM, Lukáš Vlček <lu...@gmail.com> wrote:
> Hi,
> did you look at DisplayKMeans.java in Mahout examples?
> (
>
> http://svn.apache.org/viewvc/lucene/mahout/trunk/examples/src/main/java/org/apache/mahout/clustering/kmeans/DisplayKMeans.java?view=co
> )
>
> Lukas
>
> On Fri, May 29, 2009 at 2:34 PM, Benson Margulies <bimargulies@gmail.com
> >wrote:
>
> > OK, I've got some inputs, I want to run k-means, how do I feed the beast?
> >
>
>
>
> --
> http://blog.lukas-vlcek.com/
>
Re: A hadoop novice meets mahout
Posted by Lukáš Vlček <lu...@gmail.com>.
Hi,
did you look at DisplayKMeans.java in Mahout examples?
(
http://svn.apache.org/viewvc/lucene/mahout/trunk/examples/src/main/java/org/apache/mahout/clustering/kmeans/DisplayKMeans.java?view=co
)
Lukas
On Fri, May 29, 2009 at 2:34 PM, Benson Margulies <bi...@gmail.com>wrote:
> OK, I've got some inputs, I want to run k-means, how do I feed the beast?
>
--
http://blog.lukas-vlcek.com/