You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mahout.apache.org by gaurav redkar <ga...@gmail.com> on 2012/01/06 12:48:18 UTC

Help regarding ClusterOutputPostProcessor

Hello,

wen I ran the ClusterOutputPostProcessor on synthetic_control_data in
mapreduce mode, I observed that one directory contained points belonging to
2 other clusters and the directories relating to those 2 clusters were not
created as their "part- *" files were empty and the function
"movePartFilesToRespectiveDirectories()" was not able to create the
directories to put them into. I have converted the sequence file containing
the points belonging to those 3 clusters into text file(by changing the
output format to TextOutputFormat). Kindly find the attached part-file
which can be viewed.

Any suggestions as to why this might be happening...?

Note: The program runs fine in sequential mode.

Thanks.

Re: Help regarding ClusterOutputPostProcessor

Posted by Lance Norskog <go...@gmail.com>.

Apache mail throws away all attachments.

If you think that this is a bug, please file a JIRA. If you can change
ClusterOutputPostProcessorTest to test for this scenario, please
contribute it. With this it is possible to single-step map-reduce jobs
inside your IDE. Sometimes these directory manipulation problems are
hard to find.

Lance

On Fri, Jan 6, 2012 at 3:48 AM, gaurav redkar <ga...@gmail.com> wrote:
> Hello,
>
> wen I ran the ClusterOutputPostProcessor on synthetic_control_data in
> mapreduce mode, I observed that one directory contained points belonging to
> 2 other clusters and the directories relating to those 2 clusters were not
> created as their "part- *" files were empty and the function
> "movePartFilesToRespectiveDirectories()" was not able to create the
> directories to put them into. I have converted the sequence file containing
> the points belonging to those 3 clusters into text file(by changing the
> output format to TextOutputFormat). Kindly find the attached part-file which
> can be viewed.
>
> Any suggestions as to why this might be happening...?
>
> Note: The program runs fine in sequential mode.
>
> Thanks.
>
>



-- 
Lance Norskog
goksron@gmail.com

Re: Help regarding ClusterOutputPostProcessor

Posted by praneet mhatre <pr...@gmail.com>.

Great, that helps! I'll just go ahead with the output file then and see
what kind of results I get.

Thank you!

On Fri, Apr 27, 2012 at 12:36 PM, Jeff Eastman
<jd...@windwardsolutions.com>wrote:

> I think the answer to this question lies in how Dirichlet works: During
> each iteration, all points are assigned to clusters based upon a
> probabilistic assignment using a multinomial sampling of the cluster pdfs
> times a Dirichlet distribution mixture (see DirichletClusteringPolicy.**select()
> for exact details). The value of "n" in each cluster in a clusters-i
> directory is the number of points that were assigned to it during the i-th
> iteration. If you are running the postprocessor over the last iteration,
> then "n" would be the number of points assigned to it during the last
> iteration only.
>
> OTOH, when the ClusterOutputPostprocessor computes cluster assignments for
> each vector, it assigns only the cluster with the maximum pdf (the most
> likely cluster). Since each point is likely to be assigned to several of
> the possible clusters during the iterations it is not likely that "n" will
> ever agree with the COP assignment.
>
>
>
> On 4/27/12 4:16 AM, Paritosh Ranjan wrote:
>
>> To answer :
>>
>> "I was wondering if the clusteredPoints directory contains the correct
>> point assignment and if I could just use that for the purpose of my
>> project. "
>> I would say "Yes".
>>
>> If you will read the comments in the issue, you will find that
>>
>> "The number of members printed by the clusterdumper code match the number
>> of points generated by the ClusterOutputPostProcessor for each cluster.
>> Sadly this number does not match the value 'n' for that cluster in the
>> clusterdumper implementation. "
>>
>> So, the bug is most probably in the value of "n". Even other people have
>> faced it http://comments.gmane.org/**gmane.comp.apache.mahout.user/**
>> 10906 <http://comments.gmane.org/gmane.comp.apache.mahout.user/10906>.
>>
>> So, go ahead with the clusteredPoints.
>>
>> On 27-04-2012 03:40, praneet mhatre wrote:
>>
>>> Hi,
>>>
>>> I had a look at the JIRA and looks like the issue is still unresolved. I
>>> wanted to know if the suggestion that the postprocessor may be at fault
>>> has
>>> been verified.
>>>
>>> I am using Dirichlet clustering for a project of mine and I also noticed
>>> the mismatch between the number of points actually present in the cluster
>>> and the value of n. I was wondering if the clusteredPoints directory
>>> contains the correct point assignment and if I could just use that for
>>> the
>>> purpose of my project.
>>>
>>> Thanks!
>>>
>>> On Mon, Jan 30, 2012 at 8:25 PM, gaurav redkar<ga...@gmail.com>**
>>> wrote:
>>>
>>>  Hello. As Jeff mentioned, i created a JIRA issue. Kindly check out
>>>> MAHOUT-966<https://issues.**apache.org/jira/browse/MAHOUT-**966<https://issues.apache.org/jira/browse/MAHOUT-966>>
>>>>   and share
>>>> your inputs.
>>>>
>>>> Thanks,
>>>> Gaurav
>>>>
>>>> On Wed, Jan 25, 2012 at 8:51 PM, Jeff Eastman<jdog@**
>>>> windwardsolutions.com <jd...@windwardsolutions.com>
>>>>
>>>>> wrote:
>>>>> Mean Shift accumulates the pointIds of every point assigned to a
>>>>> cluster,
>>>>> so I would expect n= to be correct in the cluster dumper output. It is
>>>>>
>>>> most
>>>>
>>>>> likely the postprocessor is misbehaving. Please create a JIRA and
>>>>> attach
>>>>> your dataset and we will take a look at it.
>>>>>
>>>>> It would also be useful for you to include the exact CLI commands which
>>>>> you used to duplicate this problem.
>>>>>
>>>>>
>>>>> On 1/25/12 2:41 AM, gaurav redkar wrote:
>>>>>
>>>>>   Hello,
>>>>>>
>>>>>> I was able to rectify the afore-mentioned problem after i implemented
>>>>>> a
>>>>>> custom partitioner instead of using the default hash partitioner.  I
>>>>>>
>>>>> have
>>>>
>>>>> another issue though. After running the post processor the number of
>>>>>> points
>>>>>> that each cluster contains is not matching the number of points each
>>>>>> cluster should contain as stated by clusterdumper.
>>>>>>
>>>>>>
>>>>>> MSV-287{ n=90 c=[0.05195, 0.05675, 0.07151, 0.05713, 0.06946,...}
>>>>>>
>>>>>> MSV-145{ n=90 c=[0.93685, 0.93071, 0.93641, 0.94629, 0.94409,..}
>>>>>> the n mentioned in clusters-n-final against each cluster is different
>>>>>>
>>>>> from
>>>>
>>>>> the number of points actually contained in d directory for each
>>>>>> cluster.
>>>>>> Any idea why is this happening ...?
>>>>>>
>>>>>> PS: the dataset on which i tested the algorithm has 1000 records with
>>>>>>
>>>>> 200
>>>>
>>>>> attributes per record. I can share the dataset that i have used if
>>>>>>
>>>>> needed.
>>>>
>>>>> Thanks,
>>>>>>
>>>>>> Gaurav
>>>>>>
>>>>>> On Fri, Jan 6, 2012 at 6:12 PM, Paritosh Ranjan<pr...@xebia.com>
>>>>>>  wrote:
>>>>>>
>>>>>>  ClusterOutputProcessorDriver has options to run either sequentially
>>>>>> or
>>>>>>
>>>>>>> in
>>>>>>> a mapreduce way.
>>>>>>>
>>>>>>> If the clustering was done sequetially, then ClusterOutputProcessor
>>>>>>> should
>>>>>>> be run sequentially, and if the clustering was done in a mapreduce
>>>>>>> way,
>>>>>>> then run the ClusterOutputPostProcessor with option mapreduce=true.
>>>>>>>
>>>>>>> If you have already tried this, and its still now working, then
>>>>>>> filing
>>>>>>>
>>>>>> a
>>>>
>>>>> bug (as Lance mentioned) would be appropriate.
>>>>>>>
>>>>>>>
>>>>>>> On 06-01-2012 17:18, gaurav redkar wrote:
>>>>>>>
>>>>>>>   Hello,
>>>>>>>
>>>>>>>> wen I ran the ClusterOutputPostProcessor on synthetic_control_data
>>>>>>>> in
>>>>>>>> mapreduce mode, I observed that one directory contained points
>>>>>>>> belonging to
>>>>>>>> 2 other clusters and the directories relating to those 2 clusters
>>>>>>>> were
>>>>>>>> not
>>>>>>>> created as their "part- *" files were empty and the function "**
>>>>>>>> movePartFilesToRespectiveDirec******tories()" was not able to
>>>>>>>> create the
>>>>>>>>
>>>>>>>> directories to put them into. I have converted the sequence file
>>>>>>>> containing
>>>>>>>> the points belonging to those 3 clusters into text file(by changing
>>>>>>>>
>>>>>>> the
>>>>
>>>>> output format to TextOutputFormat). Kindly find the attached part-file
>>>>>>>> which can be viewed.
>>>>>>>> Any suggestions as to why this might be happening...?
>>>>>>>> Note: The program runs fine in sequential mode.
>>>>>>>> Thanks.
>>>>>>>>
>>>>>>>>
>>>>>>>> No virus found in this message.
>>>>>>>> Checked by AVG - www.avg.com<http://www.avg.com****>
>>>>>>>> Version: 10.0.1416 / Virus Database: 2109/4125 - Release Date:
>>>>>>>>
>>>>>>> 01/05/12
>>>>
>>>>>
>>>>>>>>
>>>>>>>>
>>>
>>>
>>
>>
>>
>


-- 
Praneet Mhatre
Graduate Student
Donald Bren School of ICS
University of California, Irvine

Re: Help regarding ClusterOutputPostProcessor

Posted by Jeff Eastman <jd...@windwardsolutions.com>.

I think the answer to this question lies in how Dirichlet works: During 
each iteration, all points are assigned to clusters based upon a 
probabilistic assignment using a multinomial sampling of the cluster 
pdfs times a Dirichlet distribution mixture (see 
DirichletClusteringPolicy.select() for exact details). The value of "n" 
in each cluster in a clusters-i directory is the number of points that 
were assigned to it during the i-th iteration. If you are running the 
postprocessor over the last iteration, then "n" would be the number of 
points assigned to it during the last iteration only.

OTOH, when the ClusterOutputPostprocessor computes cluster assignments 
for each vector, it assigns only the cluster with the maximum pdf (the 
most likely cluster). Since each point is likely to be assigned to 
several of the possible clusters during the iterations it is not likely 
that "n" will ever agree with the COP assignment.


On 4/27/12 4:16 AM, Paritosh Ranjan wrote:
> To answer :
>
> "I was wondering if the clusteredPoints directory contains the correct 
> point assignment and if I could just use that for the purpose of my 
> project. "
> I would say "Yes".
>
> If you will read the comments in the issue, you will find that
>
> "The number of members printed by the clusterdumper code match the 
> number of points generated by the ClusterOutputPostProcessor for each 
> cluster. Sadly this number does not match the value 'n' for that 
> cluster in the clusterdumper implementation. "
>
> So, the bug is most probably in the value of "n". Even other people 
> have faced it 
> http://comments.gmane.org/gmane.comp.apache.mahout.user/10906.
>
> So, go ahead with the clusteredPoints.
>
> On 27-04-2012 03:40, praneet mhatre wrote:
>> Hi,
>>
>> I had a look at the JIRA and looks like the issue is still unresolved. I
>> wanted to know if the suggestion that the postprocessor may be at 
>> fault has
>> been verified.
>>
>> I am using Dirichlet clustering for a project of mine and I also noticed
>> the mismatch between the number of points actually present in the 
>> cluster
>> and the value of n. I was wondering if the clusteredPoints directory
>> contains the correct point assignment and if I could just use that 
>> for the
>> purpose of my project.
>>
>> Thanks!
>>
>> On Mon, Jan 30, 2012 at 8:25 PM, gaurav 
>> redkar<ga...@gmail.com>wrote:
>>
>>> Hello. As Jeff mentioned, i created a JIRA issue. Kindly check out
>>> MAHOUT-966<https://issues.apache.org/jira/browse/MAHOUT-966>   and 
>>> share
>>> your inputs.
>>>
>>> Thanks,
>>> Gaurav
>>>
>>> On Wed, Jan 25, 2012 at 8:51 PM, Jeff 
>>> Eastman<jdog@windwardsolutions.com
>>>> wrote:
>>>> Mean Shift accumulates the pointIds of every point assigned to a 
>>>> cluster,
>>>> so I would expect n= to be correct in the cluster dumper output. It is
>>> most
>>>> likely the postprocessor is misbehaving. Please create a JIRA and 
>>>> attach
>>>> your dataset and we will take a look at it.
>>>>
>>>> It would also be useful for you to include the exact CLI commands 
>>>> which
>>>> you used to duplicate this problem.
>>>>
>>>>
>>>> On 1/25/12 2:41 AM, gaurav redkar wrote:
>>>>
>>>>>   Hello,
>>>>>
>>>>> I was able to rectify the afore-mentioned problem after i 
>>>>> implemented a
>>>>> custom partitioner instead of using the default hash partitioner.  I
>>> have
>>>>> another issue though. After running the post processor the number of
>>>>> points
>>>>> that each cluster contains is not matching the number of points each
>>>>> cluster should contain as stated by clusterdumper.
>>>>>
>>>>>
>>>>> MSV-287{ n=90 c=[0.05195, 0.05675, 0.07151, 0.05713, 0.06946,...}
>>>>>
>>>>> MSV-145{ n=90 c=[0.93685, 0.93071, 0.93641, 0.94629, 0.94409,..}
>>>>> the n mentioned in clusters-n-final against each cluster is different
>>> from
>>>>> the number of points actually contained in d directory for each 
>>>>> cluster.
>>>>> Any idea why is this happening ...?
>>>>>
>>>>> PS: the dataset on which i tested the algorithm has 1000 records with
>>> 200
>>>>> attributes per record. I can share the dataset that i have used if
>>> needed.
>>>>> Thanks,
>>>>>
>>>>> Gaurav
>>>>>
>>>>> On Fri, Jan 6, 2012 at 6:12 PM, Paritosh Ranjan<pr...@xebia.com>
>>>>>   wrote:
>>>>>
>>>>>   ClusterOutputProcessorDriver has options to run either 
>>>>> sequentially or
>>>>>> in
>>>>>> a mapreduce way.
>>>>>>
>>>>>> If the clustering was done sequetially, then ClusterOutputProcessor
>>>>>> should
>>>>>> be run sequentially, and if the clustering was done in a 
>>>>>> mapreduce way,
>>>>>> then run the ClusterOutputPostProcessor with option mapreduce=true.
>>>>>>
>>>>>> If you have already tried this, and its still now working, then 
>>>>>> filing
>>> a
>>>>>> bug (as Lance mentioned) would be appropriate.
>>>>>>
>>>>>>
>>>>>> On 06-01-2012 17:18, gaurav redkar wrote:
>>>>>>
>>>>>>    Hello,
>>>>>>> wen I ran the ClusterOutputPostProcessor on 
>>>>>>> synthetic_control_data in
>>>>>>> mapreduce mode, I observed that one directory contained points
>>>>>>> belonging to
>>>>>>> 2 other clusters and the directories relating to those 2 
>>>>>>> clusters were
>>>>>>> not
>>>>>>> created as their "part- *" files were empty and the function "**
>>>>>>> movePartFilesToRespectiveDirec****tories()" was not able to 
>>>>>>> create the
>>>>>>>
>>>>>>> directories to put them into. I have converted the sequence file
>>>>>>> containing
>>>>>>> the points belonging to those 3 clusters into text file(by changing
>>> the
>>>>>>> output format to TextOutputFormat). Kindly find the attached 
>>>>>>> part-file
>>>>>>> which can be viewed.
>>>>>>> Any suggestions as to why this might be happening...?
>>>>>>> Note: The program runs fine in sequential mode.
>>>>>>> Thanks.
>>>>>>>
>>>>>>>
>>>>>>> No virus found in this message.
>>>>>>> Checked by AVG - www.avg.com<http://www.avg.com**>
>>>>>>> Version: 10.0.1416 / Virus Database: 2109/4125 - Release Date:
>>> 01/05/12
>>>>>>>
>>>>>>>
>>
>>
>
>
>

Re: Help regarding ClusterOutputPostProcessor

Posted by Paritosh Ranjan <pr...@xebia.com>.

To answer :

"I was wondering if the clusteredPoints directory contains the correct 
point assignment and if I could just use that for the purpose of my 
project. "
I would say "Yes".

If you will read the comments in the issue, you will find that

"The number of members printed by the clusterdumper code match the 
number of points generated by the ClusterOutputPostProcessor for each 
cluster. Sadly this number does not match the value 'n' for that cluster 
in the clusterdumper implementation. "

So, the bug is most probably in the value of "n". Even other people have 
faced it http://comments.gmane.org/gmane.comp.apache.mahout.user/10906.

So, go ahead with the clusteredPoints.

On 27-04-2012 03:40, praneet mhatre wrote:
> Hi,
>
> I had a look at the JIRA and looks like the issue is still unresolved. I
> wanted to know if the suggestion that the postprocessor may be at fault has
> been verified.
>
> I am using Dirichlet clustering for a project of mine and I also noticed
> the mismatch between the number of points actually present in the cluster
> and the value of n. I was wondering if the clusteredPoints directory
> contains the correct point assignment and if I could just use that for the
> purpose of my project.
>
> Thanks!
>
> On Mon, Jan 30, 2012 at 8:25 PM, gaurav redkar<ga...@gmail.com>wrote:
>
>> Hello. As Jeff mentioned, i created a JIRA issue. Kindly check out
>> MAHOUT-966<https://issues.apache.org/jira/browse/MAHOUT-966>   and share
>> your inputs.
>>
>> Thanks,
>> Gaurav
>>
>> On Wed, Jan 25, 2012 at 8:51 PM, Jeff Eastman<jdog@windwardsolutions.com
>>> wrote:
>>> Mean Shift accumulates the pointIds of every point assigned to a cluster,
>>> so I would expect n= to be correct in the cluster dumper output. It is
>> most
>>> likely the postprocessor is misbehaving. Please create a JIRA and attach
>>> your dataset and we will take a look at it.
>>>
>>> It would also be useful for you to include the exact CLI commands which
>>> you used to duplicate this problem.
>>>
>>>
>>> On 1/25/12 2:41 AM, gaurav redkar wrote:
>>>
>>>>   Hello,
>>>>
>>>> I was able to rectify the afore-mentioned problem after i implemented a
>>>> custom partitioner instead of using the default hash partitioner.  I
>> have
>>>> another issue though. After running the post processor the number of
>>>> points
>>>> that each cluster contains is not matching the number of points each
>>>> cluster should contain as stated by clusterdumper.
>>>>
>>>>
>>>> MSV-287{ n=90 c=[0.05195, 0.05675, 0.07151, 0.05713, 0.06946,...}
>>>>
>>>> MSV-145{ n=90 c=[0.93685, 0.93071, 0.93641, 0.94629, 0.94409,..}
>>>> the n mentioned in clusters-n-final against each cluster is different
>> from
>>>> the number of points actually contained in d directory for each cluster.
>>>> Any idea why is this happening ...?
>>>>
>>>> PS: the dataset on which i tested the algorithm has 1000 records with
>> 200
>>>> attributes per record. I can share the dataset that i have used if
>> needed.
>>>> Thanks,
>>>>
>>>> Gaurav
>>>>
>>>> On Fri, Jan 6, 2012 at 6:12 PM, Paritosh Ranjan<pr...@xebia.com>
>>>>   wrote:
>>>>
>>>>   ClusterOutputProcessorDriver has options to run either sequentially or
>>>>> in
>>>>> a mapreduce way.
>>>>>
>>>>> If the clustering was done sequetially, then ClusterOutputProcessor
>>>>> should
>>>>> be run sequentially, and if the clustering was done in a mapreduce way,
>>>>> then run the ClusterOutputPostProcessor with option mapreduce=true.
>>>>>
>>>>> If you have already tried this, and its still now working, then filing
>> a
>>>>> bug (as Lance mentioned) would be appropriate.
>>>>>
>>>>>
>>>>> On 06-01-2012 17:18, gaurav redkar wrote:
>>>>>
>>>>>    Hello,
>>>>>> wen I ran the ClusterOutputPostProcessor on synthetic_control_data in
>>>>>> mapreduce mode, I observed that one directory contained points
>>>>>> belonging to
>>>>>> 2 other clusters and the directories relating to those 2 clusters were
>>>>>> not
>>>>>> created as their "part- *" files were empty and the function "**
>>>>>> movePartFilesToRespectiveDirec****tories()" was not able to create the
>>>>>>
>>>>>> directories to put them into. I have converted the sequence file
>>>>>> containing
>>>>>> the points belonging to those 3 clusters into text file(by changing
>> the
>>>>>> output format to TextOutputFormat). Kindly find the attached part-file
>>>>>> which can be viewed.
>>>>>> Any suggestions as to why this might be happening...?
>>>>>> Note: The program runs fine in sequential mode.
>>>>>> Thanks.
>>>>>>
>>>>>>
>>>>>> No virus found in this message.
>>>>>> Checked by AVG - www.avg.com<http://www.avg.com**>
>>>>>> Version: 10.0.1416 / Virus Database: 2109/4125 - Release Date:
>> 01/05/12
>>>>>>
>>>>>>
>
>

Re: Help regarding ClusterOutputPostProcessor

Posted by praneet mhatre <pr...@gmail.com>.

Hi,

I had a look at the JIRA and looks like the issue is still unresolved. I
wanted to know if the suggestion that the postprocessor may be at fault has
been verified.

I am using Dirichlet clustering for a project of mine and I also noticed
the mismatch between the number of points actually present in the cluster
and the value of n. I was wondering if the clusteredPoints directory
contains the correct point assignment and if I could just use that for the
purpose of my project.

Thanks!

On Mon, Jan 30, 2012 at 8:25 PM, gaurav redkar <ga...@gmail.com>wrote:

> Hello. As Jeff mentioned, i created a JIRA issue. Kindly check out
> MAHOUT-966 <https://issues.apache.org/jira/browse/MAHOUT-966>  and share
> your inputs.
>
> Thanks,
> Gaurav
>
> On Wed, Jan 25, 2012 at 8:51 PM, Jeff Eastman <jdog@windwardsolutions.com
> >wrote:
>
> > Mean Shift accumulates the pointIds of every point assigned to a cluster,
> > so I would expect n= to be correct in the cluster dumper output. It is
> most
> > likely the postprocessor is misbehaving. Please create a JIRA and attach
> > your dataset and we will take a look at it.
> >
> > It would also be useful for you to include the exact CLI commands which
> > you used to duplicate this problem.
> >
> >
> > On 1/25/12 2:41 AM, gaurav redkar wrote:
> >
> >>  Hello,
> >>
> >> I was able to rectify the afore-mentioned problem after i implemented a
> >> custom partitioner instead of using the default hash partitioner.  I
> have
> >> another issue though. After running the post processor the number of
> >> points
> >> that each cluster contains is not matching the number of points each
> >> cluster should contain as stated by clusterdumper.
> >>
> >>
> >> MSV-287{ n=90 c=[0.05195, 0.05675, 0.07151, 0.05713, 0.06946,...}
> >>
> >> MSV-145{ n=90 c=[0.93685, 0.93071, 0.93641, 0.94629, 0.94409,..}
> >> the n mentioned in clusters-n-final against each cluster is different
> from
> >> the number of points actually contained in d directory for each cluster.
> >> Any idea why is this happening ...?
> >>
> >> PS: the dataset on which i tested the algorithm has 1000 records with
> 200
> >> attributes per record. I can share the dataset that i have used if
> needed.
> >>
> >> Thanks,
> >>
> >> Gaurav
> >>
> >> On Fri, Jan 6, 2012 at 6:12 PM, Paritosh Ranjan<pr...@xebia.com>
> >>  wrote:
> >>
> >>  ClusterOutputProcessorDriver has options to run either sequentially or
> >>> in
> >>> a mapreduce way.
> >>>
> >>> If the clustering was done sequetially, then ClusterOutputProcessor
> >>> should
> >>> be run sequentially, and if the clustering was done in a mapreduce way,
> >>> then run the ClusterOutputPostProcessor with option mapreduce=true.
> >>>
> >>> If you have already tried this, and its still now working, then filing
> a
> >>> bug (as Lance mentioned) would be appropriate.
> >>>
> >>>
> >>> On 06-01-2012 17:18, gaurav redkar wrote:
> >>>
> >>>   Hello,
> >>>> wen I ran the ClusterOutputPostProcessor on synthetic_control_data in
> >>>> mapreduce mode, I observed that one directory contained points
> >>>> belonging to
> >>>> 2 other clusters and the directories relating to those 2 clusters were
> >>>> not
> >>>> created as their "part- *" files were empty and the function "**
> >>>> movePartFilesToRespectiveDirec****tories()" was not able to create the
> >>>>
> >>>> directories to put them into. I have converted the sequence file
> >>>> containing
> >>>> the points belonging to those 3 clusters into text file(by changing
> the
> >>>> output format to TextOutputFormat). Kindly find the attached part-file
> >>>> which can be viewed.
> >>>> Any suggestions as to why this might be happening...?
> >>>> Note: The program runs fine in sequential mode.
> >>>> Thanks.
> >>>>
> >>>>
> >>>> No virus found in this message.
> >>>> Checked by AVG - www.avg.com<http://www.avg.com**>
> >>>> Version: 10.0.1416 / Virus Database: 2109/4125 - Release Date:
> 01/05/12
> >>>>
> >>>>
> >>>>
> >
>



-- 
Praneet Mhatre
Graduate Student
Donald Bren School of ICS
University of California, Irvine

Re: Help regarding ClusterOutputPostProcessor

Posted by gaurav redkar <ga...@gmail.com>.

Hello. As Jeff mentioned, i created a JIRA issue. Kindly check out
MAHOUT-966 <https://issues.apache.org/jira/browse/MAHOUT-966>  and share
your inputs.

Thanks,
Gaurav

On Wed, Jan 25, 2012 at 8:51 PM, Jeff Eastman <jd...@windwardsolutions.com>wrote:

> Mean Shift accumulates the pointIds of every point assigned to a cluster,
> so I would expect n= to be correct in the cluster dumper output. It is most
> likely the postprocessor is misbehaving. Please create a JIRA and attach
> your dataset and we will take a look at it.
>
> It would also be useful for you to include the exact CLI commands which
> you used to duplicate this problem.
>
>
> On 1/25/12 2:41 AM, gaurav redkar wrote:
>
>>  Hello,
>>
>> I was able to rectify the afore-mentioned problem after i implemented a
>> custom partitioner instead of using the default hash partitioner.  I have
>> another issue though. After running the post processor the number of
>> points
>> that each cluster contains is not matching the number of points each
>> cluster should contain as stated by clusterdumper.
>>
>>
>> MSV-287{ n=90 c=[0.05195, 0.05675, 0.07151, 0.05713, 0.06946,...}
>>
>> MSV-145{ n=90 c=[0.93685, 0.93071, 0.93641, 0.94629, 0.94409,..}
>> the n mentioned in clusters-n-final against each cluster is different from
>> the number of points actually contained in d directory for each cluster.
>> Any idea why is this happening ...?
>>
>> PS: the dataset on which i tested the algorithm has 1000 records with 200
>> attributes per record. I can share the dataset that i have used if needed.
>>
>> Thanks,
>>
>> Gaurav
>>
>> On Fri, Jan 6, 2012 at 6:12 PM, Paritosh Ranjan<pr...@xebia.com>
>>  wrote:
>>
>>  ClusterOutputProcessorDriver has options to run either sequentially or
>>> in
>>> a mapreduce way.
>>>
>>> If the clustering was done sequetially, then ClusterOutputProcessor
>>> should
>>> be run sequentially, and if the clustering was done in a mapreduce way,
>>> then run the ClusterOutputPostProcessor with option mapreduce=true.
>>>
>>> If you have already tried this, and its still now working, then filing a
>>> bug (as Lance mentioned) would be appropriate.
>>>
>>>
>>> On 06-01-2012 17:18, gaurav redkar wrote:
>>>
>>>   Hello,
>>>> wen I ran the ClusterOutputPostProcessor on synthetic_control_data in
>>>> mapreduce mode, I observed that one directory contained points
>>>> belonging to
>>>> 2 other clusters and the directories relating to those 2 clusters were
>>>> not
>>>> created as their "part- *" files were empty and the function "**
>>>> movePartFilesToRespectiveDirec****tories()" was not able to create the
>>>>
>>>> directories to put them into. I have converted the sequence file
>>>> containing
>>>> the points belonging to those 3 clusters into text file(by changing the
>>>> output format to TextOutputFormat). Kindly find the attached part-file
>>>> which can be viewed.
>>>> Any suggestions as to why this might be happening...?
>>>> Note: The program runs fine in sequential mode.
>>>> Thanks.
>>>>
>>>>
>>>> No virus found in this message.
>>>> Checked by AVG - www.avg.com<http://www.avg.com**>
>>>> Version: 10.0.1416 / Virus Database: 2109/4125 - Release Date: 01/05/12
>>>>
>>>>
>>>>
>

Re: Help regarding ClusterOutputPostProcessor

Posted by Jeff Eastman <jd...@windwardsolutions.com>.

Mean Shift accumulates the pointIds of every point assigned to a 
cluster, so I would expect n= to be correct in the cluster dumper 
output. It is most likely the postprocessor is misbehaving. Please 
create a JIRA and attach your dataset and we will take a look at it.

It would also be useful for you to include the exact CLI commands which 
you used to duplicate this problem.

On 1/25/12 2:41 AM, gaurav redkar wrote:
> Hello,
>
> I was able to rectify the afore-mentioned problem after i implemented a
> custom partitioner instead of using the default hash partitioner.  I have
> another issue though. After running the post processor the number of points
> that each cluster contains is not matching the number of points each
> cluster should contain as stated by clusterdumper.
>
>
> MSV-287{ n=90 c=[0.05195, 0.05675, 0.07151, 0.05713, 0.06946,...}
>
> MSV-145{ n=90 c=[0.93685, 0.93071, 0.93641, 0.94629, 0.94409,..}
> the n mentioned in clusters-n-final against each cluster is different from
> the number of points actually contained in d directory for each cluster.
> Any idea why is this happening ...?
>
> PS: the dataset on which i tested the algorithm has 1000 records with 200
> attributes per record. I can share the dataset that i have used if needed.
>
> Thanks,
>
> Gaurav
>
> On Fri, Jan 6, 2012 at 6:12 PM, Paritosh Ranjan<pr...@xebia.com>  wrote:
>
>> ClusterOutputProcessorDriver has options to run either sequentially or in
>> a mapreduce way.
>>
>> If the clustering was done sequetially, then ClusterOutputProcessor should
>> be run sequentially, and if the clustering was done in a mapreduce way,
>> then run the ClusterOutputPostProcessor with option mapreduce=true.
>>
>> If you have already tried this, and its still now working, then filing a
>> bug (as Lance mentioned) would be appropriate.
>>
>>
>> On 06-01-2012 17:18, gaurav redkar wrote:
>>
>>>   Hello,
>>> wen I ran the ClusterOutputPostProcessor on synthetic_control_data in
>>> mapreduce mode, I observed that one directory contained points belonging to
>>> 2 other clusters and the directories relating to those 2 clusters were not
>>> created as their "part- *" files were empty and the function "**
>>> movePartFilesToRespectiveDirec**tories()" was not able to create the
>>> directories to put them into. I have converted the sequence file containing
>>> the points belonging to those 3 clusters into text file(by changing the
>>> output format to TextOutputFormat). Kindly find the attached part-file
>>> which can be viewed.
>>> Any suggestions as to why this might be happening...?
>>> Note: The program runs fine in sequential mode.
>>> Thanks.
>>>
>>>
>>> No virus found in this message.
>>> Checked by AVG - www.avg.com<http://www.avg.com>
>>> Version: 10.0.1416 / Virus Database: 2109/4125 - Release Date: 01/05/12
>>>
>>>

Re: Help regarding ClusterOutputPostProcessor

Posted by gaurav redkar <ga...@gmail.com>.

Hello,

I was able to rectify the afore-mentioned problem after i implemented a
custom partitioner instead of using the default hash partitioner.  I have
another issue though. After running the post processor the number of points
that each cluster contains is not matching the number of points each
cluster should contain as stated by clusterdumper.

MSV-287{ n=90 c=[0.05195, 0.05675, 0.07151, 0.05713, 0.06946,...}

MSV-145{ n=90 c=[0.93685, 0.93071, 0.93641, 0.94629, 0.94409,..}
the n mentioned in clusters-n-final against each cluster is different from
the number of points actually contained in d directory for each cluster.
Any idea why is this happening ...?

PS: the dataset on which i tested the algorithm has 1000 records with 200
attributes per record. I can share the dataset that i have used if needed.

Thanks,

Gaurav

On Fri, Jan 6, 2012 at 6:12 PM, Paritosh Ranjan <pr...@xebia.com> wrote:

> ClusterOutputProcessorDriver has options to run either sequentially or in
> a mapreduce way.
>
> If the clustering was done sequetially, then ClusterOutputProcessor should
> be run sequentially, and if the clustering was done in a mapreduce way,
> then run the ClusterOutputPostProcessor with option mapreduce=true.
>
> If you have already tried this, and its still now working, then filing a
> bug (as Lance mentioned) would be appropriate.
>
>
> On 06-01-2012 17:18, gaurav redkar wrote:
>
>>  Hello,
>> wen I ran the ClusterOutputPostProcessor on synthetic_control_data in
>> mapreduce mode, I observed that one directory contained points belonging to
>> 2 other clusters and the directories relating to those 2 clusters were not
>> created as their "part- *" files were empty and the function "**
>> movePartFilesToRespectiveDirec**tories()" was not able to create the
>> directories to put them into. I have converted the sequence file containing
>> the points belonging to those 3 clusters into text file(by changing the
>> output format to TextOutputFormat). Kindly find the attached part-file
>> which can be viewed.
>> Any suggestions as to why this might be happening...?
>> Note: The program runs fine in sequential mode.
>> Thanks.
>>
>>
>> No virus found in this message.
>> Checked by AVG - www.avg.com <http://www.avg.com>
>> Version: 10.0.1416 / Virus Database: 2109/4125 - Release Date: 01/05/12
>>
>>
>

Re: Help regarding ClusterOutputPostProcessor

Posted by Paritosh Ranjan <pr...@xebia.com>.

ClusterOutputProcessorDriver has options to run either sequentially or 
in a mapreduce way.

If the clustering was done sequetially, then ClusterOutputProcessor 
should be run sequentially, and if the clustering was done in a 
mapreduce way, then run the ClusterOutputPostProcessor with option 
mapreduce=true.

If you have already tried this, and its still now working, then filing a 
bug (as Lance mentioned) would be appropriate.

On 06-01-2012 17:18, gaurav redkar wrote:
> Hello,
> wen I ran the ClusterOutputPostProcessor on synthetic_control_data in 
> mapreduce mode, I observed that one directory contained points 
> belonging to 2 other clusters and the directories relating to those 2 
> clusters were not created as their "part- *" files were empty and the 
> function "movePartFilesToRespectiveDirectories()" was not able to 
> create the directories to put them into. I have converted the sequence 
> file containing the points belonging to those 3 clusters into text 
> file(by changing the output format to TextOutputFormat). Kindly find 
> the attached part-file which can be viewed.
> Any suggestions as to why this might be happening...?
> Note: The program runs fine in sequential mode.
> Thanks.
>
>
> No virus found in this message.
> Checked by AVG - www.avg.com <http://www.avg.com>
> Version: 10.0.1416 / Virus Database: 2109/4125 - Release Date: 01/05/12
>