You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mahout.apache.org by Phoenix Bai <ba...@gmail.com> on 2012/08/27 08:49:02 UTC

does seq2sparse or kmeans filter data ? I am losing data!

Hi All,

Good afternoon.
I run the following three steps and got the clustered data I expected.

My input data is 1124 object (it is in key:value format), However, from the
output, I only received 491 objects.
What happened to the 1124-491=633 objects?

I checked out the options of seq2sparse, kmeans, clusterdump, but I failed
to find any options indicate that it is used to filter out some data.
So, I wonder in which step does the data gone missing? is it ignored or
just filtered? when and how?
is there anyway that I could stop the filtering process?

Step1:
mahout seq2sparse -i /group/tbdev/zhimo.bmz/mahout/data/videotags_seq -o
/group/tbdev/zhimo.bmz/mahout/vectors/videotags-vectors -ow \
-a org.apache.lucene.analysis.WhitespaceAnalyzer \
-wt tfidf \
-x 90 \
-seq \
-n 2 \
-nv

Step2:
mahout kmeans -i
/group/tbdev/zhimo.bmz/mahout/vectors/videotags-vectors/tfidf-vectors/ \
-c
/group/tbdev/zhimo.bmz/mahout/vectors/videotags-vectors/videotags-initial-clusters
\
-o /group/tbdev/zhimo.bmz/mahout/output/videotags-kmeans-clusters \
-dm org.apache.mahout.common.distance.CosineDistanceMeasure \
-cd 0.5 \
-k 150 \
-x 20 \
-cl \
-ow

Step3:
mahout clusterdump \
--seqFileDir
/group/tbdev/zhimo.bmz/mahout/output/videotags-kmeans-clusters/clusters-2/ \
--pointsDir
/group/tbdev/zhimo.bmz/mahout/output/videotags-kmeans-clusters/clusteredPoints/part-m-00000
\
--output /home/zhimo.bmz/work/videotagsAnalyze_15.txt \
-d
/group/tbdev/zhimo.bmz/mahout/vectors/videotags-vectors/dictionary.file-0  \
-dt sequencefile


Please help!

Re: does seq2sparse or kmeans filter data ? I am losing data!

Posted by Phoenix Bai <ba...@gmail.com>.

Hi Jeff,

I found the cause.
there is this minSupport option of seq2sparse which is default to 2, is
filtering out most of the objects whose frequency of the featuring terms is
less than 2.

specifying minSupport to 1 solved the problem.

Thanks

On Tue, Aug 28, 2012 at 8:20 PM, Jeff Eastman <jd...@windwardsolutions.com>wrote:

> No idea at this point. K-means is actually an unsupervised classification
> algorithm and I am certain that it does not remove any of the offered
> points for training purposes. The ClusterClassificationDriver that is
> invoked to classify the offered points does accept a
> clusterClassificationThreshold**; however, and that can be used for
> outlier removal. If, for some reason, the distances to the final clusters
> were less than this threshold (default == 0) then points would be removed.
> But I don't see how this could occur and you are the only user to report
> such behavior.
>
> Given that you have a pretty small dataset, you might try running the
> k-means step in sequential mode. This would allow you to verify the results
> while running in memory and also allow you to use a debugger to investigate
> further. By setting a breakpoint in ClusterClassificationDriver.**shouldClassify()
> (you'd need to edit it a bit first) you could determine if this was
> removing any of your input points.
>
>
>
> On 8/27/12 10:26 PM, Phoenix Bai wrote:
>
>> Hi Jeff,
>>
>> first of all, thank you for your response.
>>
>> But unfortunately, I don`t think that is the cause. as I checked, there is
>> only one file part-m-00000 under directory clusteredPoints.
>>
>> $ hadoop fs -ls /bmz/mahout/output/videotags-**kmeans-clusters/**
>> clusteredPoints
>> Found 1 items
>> -rw-r-----   3 bmz dev      24608 2012-08-27 10:27
>> /bmz/mahout/output/videotags-**kmeans-clusters/**
>> clusteredPoints/part-m-00000
>>
>> so, what else could it be?
>> btw, since kmeans belongs to supervised learning, is it possible that it
>> take out some data to construct a training dataset?
>> just a guess and it seems unreasonable to do that.
>>
>> Thanks
>>
>> On Tue, Aug 28, 2012 at 12:39 AM, Jeff Eastman
>> <jd...@windwardsolutions.com>**wrote:
>>
>>  Offhand, I wonder why you are specifying only a single part-m-00000 file
>>> in your clusterdump step? If there are more than one part file (a usual
>>> case) then you might be missing some of the clustered points. If so, then
>>> using the directory instead might help:
>>>
>>> --pointsDir
>>> /group/tbdev/zhimo.bmz/mahout/****output/videotags-kmeans-****
>>> clusters/clusteredPoints
>>>
>>> \
>>>
>>>
>>>
>>>
>>> On 8/27/12 2:49 AM, Phoenix Bai wrote:
>>>
>>>  --pointsDir
>>>> /group/tbdev/zhimo.bmz/mahout/****output/videotags-kmeans-**
>>>> clusters/clusteredPoints/part-****m-00000
>>>> \
>>>>
>>>>
>>>
>

Re: does seq2sparse or kmeans filter data ? I am losing data!

Posted by Jeff Eastman <jd...@windwardsolutions.com>.

No idea at this point. K-means is actually an unsupervised 
classification algorithm and I am certain that it does not remove any of 
the offered points for training purposes. The 
ClusterClassificationDriver that is invoked to classify the offered 
points does accept a clusterClassificationThreshold; however, and that 
can be used for outlier removal. If, for some reason, the distances to 
the final clusters were less than this threshold (default == 0) then 
points would be removed. But I don't see how this could occur and you 
are the only user to report such behavior.

Given that you have a pretty small dataset, you might try running the 
k-means step in sequential mode. This would allow you to verify the 
results while running in memory and also allow you to use a debugger to 
investigate further. By setting a breakpoint in 
ClusterClassificationDriver.shouldClassify() (you'd need to edit it a 
bit first) you could determine if this was removing any of your input 
points.

On 8/27/12 10:26 PM, Phoenix Bai wrote:
> Hi Jeff,
>
> first of all, thank you for your response.
>
> But unfortunately, I don`t think that is the cause. as I checked, there is
> only one file part-m-00000 under directory clusteredPoints.
>
> $ hadoop fs -ls /bmz/mahout/output/videotags-kmeans-clusters/clusteredPoints
> Found 1 items
> -rw-r-----   3 bmz dev      24608 2012-08-27 10:27
> /bmz/mahout/output/videotags-kmeans-clusters/clusteredPoints/part-m-00000
>
> so, what else could it be?
> btw, since kmeans belongs to supervised learning, is it possible that it
> take out some data to construct a training dataset?
> just a guess and it seems unreasonable to do that.
>
> Thanks
>
> On Tue, Aug 28, 2012 at 12:39 AM, Jeff Eastman
> <jd...@windwardsolutions.com>wrote:
>
>> Offhand, I wonder why you are specifying only a single part-m-00000 file
>> in your clusterdump step? If there are more than one part file (a usual
>> case) then you might be missing some of the clustered points. If so, then
>> using the directory instead might help:
>>
>> --pointsDir
>> /group/tbdev/zhimo.bmz/mahout/**output/videotags-kmeans-**clusters/clusteredPoints
>> \
>>
>>
>>
>>
>> On 8/27/12 2:49 AM, Phoenix Bai wrote:
>>
>>> --pointsDir
>>> /group/tbdev/zhimo.bmz/mahout/**output/videotags-kmeans-**
>>> clusters/clusteredPoints/part-**m-00000
>>> \
>>>
>>

Re: does seq2sparse or kmeans filter data ? I am losing data!

Posted by Phoenix Bai <ba...@gmail.com>.

Hi Jeff,

first of all, thank you for your response.

But unfortunately, I don`t think that is the cause. as I checked, there is
only one file part-m-00000 under directory clusteredPoints.

$ hadoop fs -ls /bmz/mahout/output/videotags-kmeans-clusters/clusteredPoints
Found 1 items
-rw-r-----   3 bmz dev      24608 2012-08-27 10:27
/bmz/mahout/output/videotags-kmeans-clusters/clusteredPoints/part-m-00000

so, what else could it be?
btw, since kmeans belongs to supervised learning, is it possible that it
take out some data to construct a training dataset?
just a guess and it seems unreasonable to do that.

Thanks

On Tue, Aug 28, 2012 at 12:39 AM, Jeff Eastman
<jd...@windwardsolutions.com>wrote:

> Offhand, I wonder why you are specifying only a single part-m-00000 file
> in your clusterdump step? If there are more than one part file (a usual
> case) then you might be missing some of the clustered points. If so, then
> using the directory instead might help:
>
> --pointsDir
> /group/tbdev/zhimo.bmz/mahout/**output/videotags-kmeans-**clusters/clusteredPoints
> \
>
>
>
>
> On 8/27/12 2:49 AM, Phoenix Bai wrote:
>
>> --pointsDir
>> /group/tbdev/zhimo.bmz/mahout/**output/videotags-kmeans-**
>> clusters/clusteredPoints/part-**m-00000
>> \
>>
>
>

Re: does seq2sparse or kmeans filter data ? I am losing data!

Posted by Jeff Eastman <jd...@windwardsolutions.com>.

Offhand, I wonder why you are specifying only a single part-m-00000 file 
in your clusterdump step? If there are more than one part file (a usual 
case) then you might be missing some of the clustered points. If so, 
then using the directory instead might help:

--pointsDir
/group/tbdev/zhimo.bmz/mahout/output/videotags-kmeans-clusters/clusteredPoints \

On 8/27/12 2:49 AM, Phoenix Bai wrote:
> --pointsDir
> /group/tbdev/zhimo.bmz/mahout/output/videotags-kmeans-clusters/clusteredPoints/part-m-00000
> \