You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mahout.apache.org by Robert Stewart <bs...@gmail.com> on 2012/04/18 21:08:33 UTC

why so hard to get doc->cluster mapping?

I am running kmeans clustering on vectors extracted from a lucene index.

What I want as my end result is a mapping of document ID to the cluster for each document.  How can I get that output?  I see many other people also want this but I dont see enough detail in any solution that helps me enough to get it.

So far I do this:

./mahout lucene.vector -d ~/clusterdemo/solr/data/index/ -f text --idField id --output output.txt --dictOut dict.txt

./mahout kmeans -i output.txt -o kmeans -x 10 -k 100 -ow --clusters clusters -cl

./mahout clusterdump --dictionary dict.txt --seqFileDir kmeans/clusters-10-final --dictionaryType text --pointsDir kmeans/clusteredPoints --output dump

But what I see inside "dump" file does not contain any mapping from document ID to each cluster.  How can I get that?  Should not be this hard to get the most obvious/useful output IMO ;)

Thanks
Bob

Re: why so hard to get doc->cluster mapping?

Posted by Yuval Feinstein <yu...@citypath.com>.

Hi Robert.
I did similar stuff but using seq2sparse.
I believe that once you preserve the document id when creating an input
file for kmeans,
the clusterdump will give you the document id as part of its output.
I suggest that you try this flag when running ./mahout lucene.vector:

 --idField idField                                   The field in the index
                                                      containing the
index.  If
                                                      null, then the Lucene
                                                      internal doc id is
used
                                                      which is prone to
error
                                                      if the underlying
index
                                                      changes

Good luck,
Yuval


On Sun, Apr 22, 2012 at 8:13 AM, Paritosh Ranjan <pr...@xebia.com> wrote:

> The job is kmeans, class is KMeansDriver. The input there is supposed to
> be a org.apache.hadoop.fs.Path containing a sequence file.
> I see a txt file being passed.
>
>
> ./mahout kmeans -i output.txt -o kmeans -x 10 -k 100 -ow --clusters
> clusters -cl
>
>
> On 22-04-2012 04:55, Lance Norskog wrote:
>
>> Nothing should require a local or HDFS path. Which job/class is this?
>>
>> On Fri, Apr 20, 2012 at 3:17 AM, Paritosh Ranjan<pr...@xebia.com>
>>  wrote:
>>
>>> I am not sure about this, however I see a txt ( -i output.txt ) as the
>>> input
>>> to kmeans.  KMeans input is supposed to take a hdfs Path as input.
>>> Ignore if its a hdfs path.
>>>
>>>
>>> On 19-04-2012 03:23, Jeff Eastman wrote:
>>>
>>>> Are you running seq2sparse in there somewhere? It has a -nv option that
>>>> will produce NamedVectors in its vector output. These will pass through
>>>> the
>>>> clustering and be evident in the clusterdump output.
>>>>
>>>> On 4/18/12 3:08 PM, Robert Stewart wrote:
>>>>
>>>>> I am running kmeans clustering on vectors extracted from a lucene
>>>>> index.
>>>>>
>>>>> What I want as my end result is a mapping of document ID to the cluster
>>>>> for each document.  How can I get that output?  I see many other
>>>>> people also
>>>>> want this but I dont see enough detail in any solution that helps me
>>>>> enough
>>>>> to get it.
>>>>>
>>>>> So far I do this:
>>>>>
>>>>> ./mahout lucene.vector -d ~/clusterdemo/solr/data/index/ -f text
>>>>> --idField id --output output.txt --dictOut dict.txt
>>>>>
>>>>> ./mahout kmeans -i output.txt -o kmeans -x 10 -k 100 -ow --clusters
>>>>> clusters -cl
>>>>>
>>>>> ./mahout clusterdump --dictionary dict.txt --seqFileDir
>>>>> kmeans/clusters-10-final --dictionaryType text --pointsDir
>>>>> kmeans/clusteredPoints --output dump
>>>>>
>>>>> But what I see inside "dump" file does not contain any mapping from
>>>>> document ID to each cluster.  How can I get that?  Should not be this
>>>>> hard
>>>>> to get the most obvious/useful output IMO ;)
>>>>>
>>>>> Thanks
>>>>> Bob
>>>>>
>>>>>
>>>>>
>>>>>
>>
>>
>

Re: why so hard to get doc->cluster mapping?

Posted by Paritosh Ranjan <pr...@xebia.com>.

The job is kmeans, class is KMeansDriver. The input there is supposed to be a org.apache.hadoop.fs.Path containing a sequence file.
I see a txt file being passed.

./mahout kmeans -i output.txt -o kmeans -x 10 -k 100 -ow --clusters
clusters -cl


On 22-04-2012 04:55, Lance Norskog wrote:
> Nothing should require a local or HDFS path. Which job/class is this?
>
> On Fri, Apr 20, 2012 at 3:17 AM, Paritosh Ranjan<pr...@xebia.com>  wrote:
>> I am not sure about this, however I see a txt ( -i output.txt ) as the input
>> to kmeans.  KMeans input is supposed to take a hdfs Path as input.
>> Ignore if its a hdfs path.
>>
>>
>> On 19-04-2012 03:23, Jeff Eastman wrote:
>>> Are you running seq2sparse in there somewhere? It has a -nv option that
>>> will produce NamedVectors in its vector output. These will pass through the
>>> clustering and be evident in the clusterdump output.
>>>
>>> On 4/18/12 3:08 PM, Robert Stewart wrote:
>>>> I am running kmeans clustering on vectors extracted from a lucene index.
>>>>
>>>> What I want as my end result is a mapping of document ID to the cluster
>>>> for each document.  How can I get that output?  I see many other people also
>>>> want this but I dont see enough detail in any solution that helps me enough
>>>> to get it.
>>>>
>>>> So far I do this:
>>>>
>>>> ./mahout lucene.vector -d ~/clusterdemo/solr/data/index/ -f text
>>>> --idField id --output output.txt --dictOut dict.txt
>>>>
>>>> ./mahout kmeans -i output.txt -o kmeans -x 10 -k 100 -ow --clusters
>>>> clusters -cl
>>>>
>>>> ./mahout clusterdump --dictionary dict.txt --seqFileDir
>>>> kmeans/clusters-10-final --dictionaryType text --pointsDir
>>>> kmeans/clusteredPoints --output dump
>>>>
>>>> But what I see inside "dump" file does not contain any mapping from
>>>> document ID to each cluster.  How can I get that?  Should not be this hard
>>>> to get the most obvious/useful output IMO ;)
>>>>
>>>> Thanks
>>>> Bob
>>>>
>>>>
>>>>
>
>

Re: why so hard to get doc->cluster mapping?

Posted by Lance Norskog <go...@gmail.com>.

Nothing should require a local or HDFS path. Which job/class is this?

On Fri, Apr 20, 2012 at 3:17 AM, Paritosh Ranjan <pr...@xebia.com> wrote:
> I am not sure about this, however I see a txt ( -i output.txt ) as the input
> to kmeans.  KMeans input is supposed to take a hdfs Path as input.
> Ignore if its a hdfs path.
>
>
> On 19-04-2012 03:23, Jeff Eastman wrote:
>>
>> Are you running seq2sparse in there somewhere? It has a -nv option that
>> will produce NamedVectors in its vector output. These will pass through the
>> clustering and be evident in the clusterdump output.
>>
>> On 4/18/12 3:08 PM, Robert Stewart wrote:
>>>
>>> I am running kmeans clustering on vectors extracted from a lucene index.
>>>
>>> What I want as my end result is a mapping of document ID to the cluster
>>> for each document.  How can I get that output?  I see many other people also
>>> want this but I dont see enough detail in any solution that helps me enough
>>> to get it.
>>>
>>> So far I do this:
>>>
>>> ./mahout lucene.vector -d ~/clusterdemo/solr/data/index/ -f text
>>> --idField id --output output.txt --dictOut dict.txt
>>>
>>> ./mahout kmeans -i output.txt -o kmeans -x 10 -k 100 -ow --clusters
>>> clusters -cl
>>>
>>> ./mahout clusterdump --dictionary dict.txt --seqFileDir
>>> kmeans/clusters-10-final --dictionaryType text --pointsDir
>>> kmeans/clusteredPoints --output dump
>>>
>>> But what I see inside "dump" file does not contain any mapping from
>>> document ID to each cluster.  How can I get that?  Should not be this hard
>>> to get the most obvious/useful output IMO ;)
>>>
>>> Thanks
>>> Bob
>>>
>>>
>>>
>>
>



-- 
Lance Norskog
goksron@gmail.com

Re: why so hard to get doc->cluster mapping?

Posted by Paritosh Ranjan <pr...@xebia.com>.

I am not sure about this, however I see a txt ( -i output.txt ) as the 
input to kmeans.  KMeans input is supposed to take a hdfs Path as input.
Ignore if its a hdfs path.

On 19-04-2012 03:23, Jeff Eastman wrote:
> Are you running seq2sparse in there somewhere? It has a -nv option 
> that will produce NamedVectors in its vector output. These will pass 
> through the clustering and be evident in the clusterdump output.
>
> On 4/18/12 3:08 PM, Robert Stewart wrote:
>> I am running kmeans clustering on vectors extracted from a lucene index.
>>
>> What I want as my end result is a mapping of document ID to the 
>> cluster for each document.  How can I get that output?  I see many 
>> other people also want this but I dont see enough detail in any 
>> solution that helps me enough to get it.
>>
>> So far I do this:
>>
>> ./mahout lucene.vector -d ~/clusterdemo/solr/data/index/ -f text 
>> --idField id --output output.txt --dictOut dict.txt
>>
>> ./mahout kmeans -i output.txt -o kmeans -x 10 -k 100 -ow --clusters 
>> clusters -cl
>>
>> ./mahout clusterdump --dictionary dict.txt --seqFileDir 
>> kmeans/clusters-10-final --dictionaryType text --pointsDir 
>> kmeans/clusteredPoints --output dump
>>
>> But what I see inside "dump" file does not contain any mapping from 
>> document ID to each cluster.  How can I get that?  Should not be this 
>> hard to get the most obvious/useful output IMO ;)
>>
>> Thanks
>> Bob
>>
>>
>>
>

Re: why so hard to get doc->cluster mapping?

Posted by Jeff Eastman <jd...@windwardsolutions.com>.

Are you running seq2sparse in there somewhere? It has a -nv option that 
will produce NamedVectors in its vector output. These will pass through 
the clustering and be evident in the clusterdump output.

On 4/18/12 3:08 PM, Robert Stewart wrote:
> I am running kmeans clustering on vectors extracted from a lucene index.
>
> What I want as my end result is a mapping of document ID to the cluster for each document.  How can I get that output?  I see many other people also want this but I dont see enough detail in any solution that helps me enough to get it.
>
> So far I do this:
>
> ./mahout lucene.vector -d ~/clusterdemo/solr/data/index/ -f text --idField id --output output.txt --dictOut dict.txt
>
> ./mahout kmeans -i output.txt -o kmeans -x 10 -k 100 -ow --clusters clusters -cl
>
> ./mahout clusterdump --dictionary dict.txt --seqFileDir kmeans/clusters-10-final --dictionaryType text --pointsDir kmeans/clusteredPoints --output dump
>
> But what I see inside "dump" file does not contain any mapping from document ID to each cluster.  How can I get that?  Should not be this hard to get the most obvious/useful output IMO ;)
>
> Thanks
> Bob
>
>
>