You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mahout.apache.org by Paritosh Ranjan <pr...@xebia.com> on 2011/10/01 20:57:10 UTC

Difference in results : Clustering : sequential and MapReduce

Hi,

I am able to cluster correctly sequentially, using CanopyDriver.

However, the same dataset, when processed as a MapReduce job, where ( t1 
= t3 and t2 = t4 and t1>t2) is not working. I am getting errors like 
Canopies are empty.

I also tried to reduce the values of t3 and t4. But reducing it either 
has no effect or gives meaningless results.

Am I doing something wrong? or is there a bug somewhere?

I feel that both, sequential and MapReduce should give similar results. 
But, It is not happening.

Thanks and Regards,
Paritosh

RE: Difference in results : Clustering : sequential and MapReduce

Posted by Jeff Eastman <je...@Narus.com>.
The sequential and mapreduce implementations do not produce the same results, as the sequential implementation runs canopy once and the mapreduce implementation twice: in each mapper and in the reducer. This is documented in https://cwiki.apache.org/confluence/display/MAHOUT/Canopy+Clustering (see #10).

-----Original Message-----
From: Paritosh Ranjan [mailto:pranjan@xebia.com] 
Sent: Sunday, October 02, 2011 8:59 PM
To: user@mahout.apache.org
Subject: Re: Difference in results : Clustering : sequential and MapReduce

The sequential algorithm finds more/better clusters  than the mapreduce one.
There's not a huge difference, but the standalone one is better for sure.

Thanks and Regards,
Paritosh

On 03-10-2011 01:47, Konstantin Shmakov wrote:
> I'd assume that distributed and sequential algorithms shouldn't produce
> identical results. To start with, they differ in initial setup:
> -- In distributed algorithm each mapper deals with subset of data and starts
> by picking up a random point, so N random points are picked up by N mappers
> to start with.
> -- In sequential algorithm 1 mapper deals with all data and starts by
> picking up 1 random point.
> But for the data with real clusters both algorithms should produce similar
> results.  How different are the results in your case?
>
> Thanks
> --Konstantin
>
>
>
>
>
>
>
>
> On Sun, Oct 2, 2011 at 1:36 AM, Paritosh Ranjan<pr...@xebia.com>  wrote:
>
>> Even run() of CanopyDriver, which takes only T1 and T2 is giving different
>> results for sequential and mapreduce.
>> This is preventing me from scaling up, as I need to run mapreduce on hadoop
>> to scale.
>>
>> Is anyone having any idea of this problem?
>>
>> On 02-10-2011 00:27, Paritosh Ranjan wrote:
>>
>>> Hi,
>>>
>>> I am able to cluster correctly sequentially, using CanopyDriver.
>>>
>>> However, the same dataset, when processed as a MapReduce job, where ( t1 =
>>> t3 and t2 = t4 and t1>t2) is not working. I am getting errors like Canopies
>>> are empty.
>>>
>>> I also tried to reduce the values of t3 and t4. But reducing it either has
>>> no effect or gives meaningless results.
>>>
>>> Am I doing something wrong? or is there a bug somewhere?
>>>
>>> I feel that both, sequential and MapReduce should give similar results.
>>> But, It is not happening.
>>>
>>> Thanks and Regards,
>>> Paritosh
>>>
>>>
>>> -----
>>> No virus found in this message.
>>> Checked by AVG - www.avg.com
>>> Version: 10.0.1410 / Virus Database: 1520/3932 - Release Date: 10/01/11
>>>
>>
>


Re: Difference in results : Clustering : sequential and MapReduce

Posted by Paritosh Ranjan <pr...@xebia.com>.
Yes, hierarchical clustering will be a good way to solve it. I will try 
it out. Should I create a jira issue for it, or, just provide the patch 
when I am done?

On 03-10-2011 22:56, Jeff Eastman wrote:
> Well, the default clusterFilter == 0, so this is not the difference between the implementations. When you talk about distributing similar vectors to each mapper, you are really moving into a hierarchical clustering method where you cluster your input points into a few large clusters and then cluster each cluster subset again. This can be done with scripting of any clustering algorithm and might be effective with canopy.
>
> -----Original Message-----
> From: Paritosh Ranjan [mailto:pranjan@xebia.com]
> Sent: Sunday, October 02, 2011 10:56 PM
> To: user@mahout.apache.org
> Subject: Re: Difference in results : Clustering : sequential and MapReduce
>
>
> I got the reason for difference.
> Actually, its due to
>
> if (canopy.getNumPoints()>   clusterFilter)
>
>
> in CanopyMapper.
>
> Similar data is not distributed evenly in the mappers. So, the canopies
> might come out with points<  clusterFilter which are not processed further.
> But, this check is a great performance enhancer. I have experienced that.
>
> Maybe, distributing similar vectors on mappers might help to attain both
> quality and performance.
>
>
> On 03-10-2011 09:29, Paritosh Ranjan wrote:
>> The sequential algorithm finds more/better clusters  than the
>> mapreduce one.
>> There's not a huge difference, but the standalone one is better for sure.
>>
>> Thanks and Regards,
>> Paritosh
>>
>> On 03-10-2011 01:47, Konstantin Shmakov wrote:
>>> I'd assume that distributed and sequential algorithms shouldn't produce
>>> identical results. To start with, they differ in initial setup:
>>> -- In distributed algorithm each mapper deals with subset of data and
>>> starts
>>> by picking up a random point, so N random points are picked up by N
>>> mappers
>>> to start with.
>>> -- In sequential algorithm 1 mapper deals with all data and starts by
>>> picking up 1 random point.
>>> But for the data with real clusters both algorithms should produce
>>> similar
>>> results.  How different are the results in your case?
>>>
>>> Thanks
>>> --Konstantin
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> On Sun, Oct 2, 2011 at 1:36 AM, Paritosh Ranjan<pr...@xebia.com>
>>> wrote:
>>>
>>>> Even run() of CanopyDriver, which takes only T1 and T2 is giving
>>>> different
>>>> results for sequential and mapreduce.
>>>> This is preventing me from scaling up, as I need to run mapreduce on
>>>> hadoop
>>>> to scale.
>>>>
>>>> Is anyone having any idea of this problem?
>>>>
>>>> On 02-10-2011 00:27, Paritosh Ranjan wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> I am able to cluster correctly sequentially, using CanopyDriver.
>>>>>
>>>>> However, the same dataset, when processed as a MapReduce job, where
>>>>> ( t1 =
>>>>> t3 and t2 = t4 and t1>t2) is not working. I am getting errors like
>>>>> Canopies
>>>>> are empty.
>>>>>
>>>>> I also tried to reduce the values of t3 and t4. But reducing it
>>>>> either has
>>>>> no effect or gives meaningless results.
>>>>>
>>>>> Am I doing something wrong? or is there a bug somewhere?
>>>>>
>>>>> I feel that both, sequential and MapReduce should give similar
>>>>> results.
>>>>> But, It is not happening.
>>>>>
>>>>> Thanks and Regards,
>>>>> Paritosh
>>>>>
>>>>>
>>>>> -----
>>>>> No virus found in this message.
>>>>> Checked by AVG - www.avg.com
>>>>> Version: 10.0.1410 / Virus Database: 1520/3932 - Release Date:
>>>>> 10/01/11
>>>>>
>>
>>
>> -----
>> No virus found in this message.
>> Checked by AVG - www.avg.com
>> Version: 10.0.1410 / Virus Database: 1520/3933 - Release Date: 10/02/11
>
>
> -----
> No virus found in this message.
> Checked by AVG - www.avg.com
> Version: 10.0.1410 / Virus Database: 1520/3935 - Release Date: 10/03/11
>


RE: Difference in results : Clustering : sequential and MapReduce

Posted by Jeff Eastman <je...@Narus.com>.
Well, the default clusterFilter == 0, so this is not the difference between the implementations. When you talk about distributing similar vectors to each mapper, you are really moving into a hierarchical clustering method where you cluster your input points into a few large clusters and then cluster each cluster subset again. This can be done with scripting of any clustering algorithm and might be effective with canopy.

-----Original Message-----
From: Paritosh Ranjan [mailto:pranjan@xebia.com] 
Sent: Sunday, October 02, 2011 10:56 PM
To: user@mahout.apache.org
Subject: Re: Difference in results : Clustering : sequential and MapReduce

I got the reason for difference.
Actually, its due to

if (canopy.getNumPoints()>  clusterFilter)


in CanopyMapper.

Similar data is not distributed evenly in the mappers. So, the canopies 
might come out with points < clusterFilter which are not processed further.
But, this check is a great performance enhancer. I have experienced that.

Maybe, distributing similar vectors on mappers might help to attain both 
quality and performance.


On 03-10-2011 09:29, Paritosh Ranjan wrote:
> The sequential algorithm finds more/better clusters  than the 
> mapreduce one.
> There's not a huge difference, but the standalone one is better for sure.
>
> Thanks and Regards,
> Paritosh
>
> On 03-10-2011 01:47, Konstantin Shmakov wrote:
>> I'd assume that distributed and sequential algorithms shouldn't produce
>> identical results. To start with, they differ in initial setup:
>> -- In distributed algorithm each mapper deals with subset of data and 
>> starts
>> by picking up a random point, so N random points are picked up by N 
>> mappers
>> to start with.
>> -- In sequential algorithm 1 mapper deals with all data and starts by
>> picking up 1 random point.
>> But for the data with real clusters both algorithms should produce 
>> similar
>> results.  How different are the results in your case?
>>
>> Thanks
>> --Konstantin
>>
>>
>>
>>
>>
>>
>>
>>
>> On Sun, Oct 2, 2011 at 1:36 AM, Paritosh Ranjan<pr...@xebia.com>  
>> wrote:
>>
>>> Even run() of CanopyDriver, which takes only T1 and T2 is giving 
>>> different
>>> results for sequential and mapreduce.
>>> This is preventing me from scaling up, as I need to run mapreduce on 
>>> hadoop
>>> to scale.
>>>
>>> Is anyone having any idea of this problem?
>>>
>>> On 02-10-2011 00:27, Paritosh Ranjan wrote:
>>>
>>>> Hi,
>>>>
>>>> I am able to cluster correctly sequentially, using CanopyDriver.
>>>>
>>>> However, the same dataset, when processed as a MapReduce job, where 
>>>> ( t1 =
>>>> t3 and t2 = t4 and t1>t2) is not working. I am getting errors like 
>>>> Canopies
>>>> are empty.
>>>>
>>>> I also tried to reduce the values of t3 and t4. But reducing it 
>>>> either has
>>>> no effect or gives meaningless results.
>>>>
>>>> Am I doing something wrong? or is there a bug somewhere?
>>>>
>>>> I feel that both, sequential and MapReduce should give similar 
>>>> results.
>>>> But, It is not happening.
>>>>
>>>> Thanks and Regards,
>>>> Paritosh
>>>>
>>>>
>>>> -----
>>>> No virus found in this message.
>>>> Checked by AVG - www.avg.com
>>>> Version: 10.0.1410 / Virus Database: 1520/3932 - Release Date: 
>>>> 10/01/11
>>>>
>>>
>>
>
>
>
> -----
> No virus found in this message.
> Checked by AVG - www.avg.com
> Version: 10.0.1410 / Virus Database: 1520/3933 - Release Date: 10/02/11


Re: Difference in results : Clustering : sequential and MapReduce

Posted by Paritosh Ranjan <pr...@xebia.com>.
I am implementing the functionality to distribute similar records on 
similar nodes (mappers). This will work like preselection, and hence 
will enable us to use the clusterFilter, which is a great performance 
enhancer, without any decrease in quality.

I will try to provide a patch for this.

On 03-10-2011 11:26, Paritosh Ranjan wrote:
> I got the reason for difference.
> Actually, its due to
>
> if (canopy.getNumPoints()>  clusterFilter)
>
>
> in CanopyMapper.
>
> Similar data is not distributed evenly in the mappers. So, the 
> canopies might come out with points < clusterFilter which are not 
> processed further.
> But, this check is a great performance enhancer. I have experienced that.
>
> Maybe, distributing similar vectors on mappers might help to attain 
> both quality and performance.
>
>
> On 03-10-2011 09:29, Paritosh Ranjan wrote:
>> The sequential algorithm finds more/better clusters  than the 
>> mapreduce one.
>> There's not a huge difference, but the standalone one is better for 
>> sure.
>>
>> Thanks and Regards,
>> Paritosh
>>
>> On 03-10-2011 01:47, Konstantin Shmakov wrote:
>>> I'd assume that distributed and sequential algorithms shouldn't produce
>>> identical results. To start with, they differ in initial setup:
>>> -- In distributed algorithm each mapper deals with subset of data 
>>> and starts
>>> by picking up a random point, so N random points are picked up by N 
>>> mappers
>>> to start with.
>>> -- In sequential algorithm 1 mapper deals with all data and starts by
>>> picking up 1 random point.
>>> But for the data with real clusters both algorithms should produce 
>>> similar
>>> results.  How different are the results in your case?
>>>
>>> Thanks
>>> --Konstantin
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> On Sun, Oct 2, 2011 at 1:36 AM, Paritosh Ranjan<pr...@xebia.com>  
>>> wrote:
>>>
>>>> Even run() of CanopyDriver, which takes only T1 and T2 is giving 
>>>> different
>>>> results for sequential and mapreduce.
>>>> This is preventing me from scaling up, as I need to run mapreduce 
>>>> on hadoop
>>>> to scale.
>>>>
>>>> Is anyone having any idea of this problem?
>>>>
>>>> On 02-10-2011 00:27, Paritosh Ranjan wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> I am able to cluster correctly sequentially, using CanopyDriver.
>>>>>
>>>>> However, the same dataset, when processed as a MapReduce job, 
>>>>> where ( t1 =
>>>>> t3 and t2 = t4 and t1>t2) is not working. I am getting errors like 
>>>>> Canopies
>>>>> are empty.
>>>>>
>>>>> I also tried to reduce the values of t3 and t4. But reducing it 
>>>>> either has
>>>>> no effect or gives meaningless results.
>>>>>
>>>>> Am I doing something wrong? or is there a bug somewhere?
>>>>>
>>>>> I feel that both, sequential and MapReduce should give similar 
>>>>> results.
>>>>> But, It is not happening.
>>>>>
>>>>> Thanks and Regards,
>>>>> Paritosh
>>>>>
>>>>>
>>>>> -----
>>>>> No virus found in this message.
>>>>> Checked by AVG - www.avg.com
>>>>> Version: 10.0.1410 / Virus Database: 1520/3932 - Release Date: 
>>>>> 10/01/11
>>>>>
>>>>
>>>
>>
>>
>>
>> -----
>> No virus found in this message.
>> Checked by AVG - www.avg.com
>> Version: 10.0.1410 / Virus Database: 1520/3933 - Release Date: 10/02/11
>
>
>
> -----
> No virus found in this message.
> Checked by AVG - www.avg.com
> Version: 10.0.1410 / Virus Database: 1520/3934 - Release Date: 10/02/11


Re: Difference in results : Clustering : sequential and MapReduce

Posted by Paritosh Ranjan <pr...@xebia.com>.
I got the reason for difference.
Actually, its due to

if (canopy.getNumPoints()>  clusterFilter)


in CanopyMapper.

Similar data is not distributed evenly in the mappers. So, the canopies 
might come out with points < clusterFilter which are not processed further.
But, this check is a great performance enhancer. I have experienced that.

Maybe, distributing similar vectors on mappers might help to attain both 
quality and performance.


On 03-10-2011 09:29, Paritosh Ranjan wrote:
> The sequential algorithm finds more/better clusters  than the 
> mapreduce one.
> There's not a huge difference, but the standalone one is better for sure.
>
> Thanks and Regards,
> Paritosh
>
> On 03-10-2011 01:47, Konstantin Shmakov wrote:
>> I'd assume that distributed and sequential algorithms shouldn't produce
>> identical results. To start with, they differ in initial setup:
>> -- In distributed algorithm each mapper deals with subset of data and 
>> starts
>> by picking up a random point, so N random points are picked up by N 
>> mappers
>> to start with.
>> -- In sequential algorithm 1 mapper deals with all data and starts by
>> picking up 1 random point.
>> But for the data with real clusters both algorithms should produce 
>> similar
>> results.  How different are the results in your case?
>>
>> Thanks
>> --Konstantin
>>
>>
>>
>>
>>
>>
>>
>>
>> On Sun, Oct 2, 2011 at 1:36 AM, Paritosh Ranjan<pr...@xebia.com>  
>> wrote:
>>
>>> Even run() of CanopyDriver, which takes only T1 and T2 is giving 
>>> different
>>> results for sequential and mapreduce.
>>> This is preventing me from scaling up, as I need to run mapreduce on 
>>> hadoop
>>> to scale.
>>>
>>> Is anyone having any idea of this problem?
>>>
>>> On 02-10-2011 00:27, Paritosh Ranjan wrote:
>>>
>>>> Hi,
>>>>
>>>> I am able to cluster correctly sequentially, using CanopyDriver.
>>>>
>>>> However, the same dataset, when processed as a MapReduce job, where 
>>>> ( t1 =
>>>> t3 and t2 = t4 and t1>t2) is not working. I am getting errors like 
>>>> Canopies
>>>> are empty.
>>>>
>>>> I also tried to reduce the values of t3 and t4. But reducing it 
>>>> either has
>>>> no effect or gives meaningless results.
>>>>
>>>> Am I doing something wrong? or is there a bug somewhere?
>>>>
>>>> I feel that both, sequential and MapReduce should give similar 
>>>> results.
>>>> But, It is not happening.
>>>>
>>>> Thanks and Regards,
>>>> Paritosh
>>>>
>>>>
>>>> -----
>>>> No virus found in this message.
>>>> Checked by AVG - www.avg.com
>>>> Version: 10.0.1410 / Virus Database: 1520/3932 - Release Date: 
>>>> 10/01/11
>>>>
>>>
>>
>
>
>
> -----
> No virus found in this message.
> Checked by AVG - www.avg.com
> Version: 10.0.1410 / Virus Database: 1520/3933 - Release Date: 10/02/11


Re: Difference in results : Clustering : sequential and MapReduce

Posted by Paritosh Ranjan <pr...@xebia.com>.
The sequential algorithm finds more/better clusters  than the mapreduce one.
There's not a huge difference, but the standalone one is better for sure.

Thanks and Regards,
Paritosh

On 03-10-2011 01:47, Konstantin Shmakov wrote:
> I'd assume that distributed and sequential algorithms shouldn't produce
> identical results. To start with, they differ in initial setup:
> -- In distributed algorithm each mapper deals with subset of data and starts
> by picking up a random point, so N random points are picked up by N mappers
> to start with.
> -- In sequential algorithm 1 mapper deals with all data and starts by
> picking up 1 random point.
> But for the data with real clusters both algorithms should produce similar
> results.  How different are the results in your case?
>
> Thanks
> --Konstantin
>
>
>
>
>
>
>
>
> On Sun, Oct 2, 2011 at 1:36 AM, Paritosh Ranjan<pr...@xebia.com>  wrote:
>
>> Even run() of CanopyDriver, which takes only T1 and T2 is giving different
>> results for sequential and mapreduce.
>> This is preventing me from scaling up, as I need to run mapreduce on hadoop
>> to scale.
>>
>> Is anyone having any idea of this problem?
>>
>> On 02-10-2011 00:27, Paritosh Ranjan wrote:
>>
>>> Hi,
>>>
>>> I am able to cluster correctly sequentially, using CanopyDriver.
>>>
>>> However, the same dataset, when processed as a MapReduce job, where ( t1 =
>>> t3 and t2 = t4 and t1>t2) is not working. I am getting errors like Canopies
>>> are empty.
>>>
>>> I also tried to reduce the values of t3 and t4. But reducing it either has
>>> no effect or gives meaningless results.
>>>
>>> Am I doing something wrong? or is there a bug somewhere?
>>>
>>> I feel that both, sequential and MapReduce should give similar results.
>>> But, It is not happening.
>>>
>>> Thanks and Regards,
>>> Paritosh
>>>
>>>
>>> -----
>>> No virus found in this message.
>>> Checked by AVG - www.avg.com
>>> Version: 10.0.1410 / Virus Database: 1520/3932 - Release Date: 10/01/11
>>>
>>
>


Re: Difference in results : Clustering : sequential and MapReduce

Posted by Konstantin Shmakov <ks...@gmail.com>.
I'd assume that distributed and sequential algorithms shouldn't produce
identical results. To start with, they differ in initial setup:
-- In distributed algorithm each mapper deals with subset of data and starts
by picking up a random point, so N random points are picked up by N mappers
to start with.
-- In sequential algorithm 1 mapper deals with all data and starts by
picking up 1 random point.
But for the data with real clusters both algorithms should produce similar
results.  How different are the results in your case?

Thanks
--Konstantin








On Sun, Oct 2, 2011 at 1:36 AM, Paritosh Ranjan <pr...@xebia.com> wrote:

> Even run() of CanopyDriver, which takes only T1 and T2 is giving different
> results for sequential and mapreduce.
> This is preventing me from scaling up, as I need to run mapreduce on hadoop
> to scale.
>
> Is anyone having any idea of this problem?
>
> On 02-10-2011 00:27, Paritosh Ranjan wrote:
>
>> Hi,
>>
>> I am able to cluster correctly sequentially, using CanopyDriver.
>>
>> However, the same dataset, when processed as a MapReduce job, where ( t1 =
>> t3 and t2 = t4 and t1>t2) is not working. I am getting errors like Canopies
>> are empty.
>>
>> I also tried to reduce the values of t3 and t4. But reducing it either has
>> no effect or gives meaningless results.
>>
>> Am I doing something wrong? or is there a bug somewhere?
>>
>> I feel that both, sequential and MapReduce should give similar results.
>> But, It is not happening.
>>
>> Thanks and Regards,
>> Paritosh
>>
>>
>> -----
>> No virus found in this message.
>> Checked by AVG - www.avg.com
>> Version: 10.0.1410 / Virus Database: 1520/3932 - Release Date: 10/01/11
>>
>
>


-- 
ksh:

CanopyDriver : run : clusterFilter : bug

Posted by Paritosh Ranjan <pr...@xebia.com>.
The new parameter, clusterFilter, in CanopyDriver's run method, is not 
working properly.

This is because, in ClusterMapper's findClosestCanopy method, the if 
condition

protected Canopy findClosestCanopy(Vector point, Iterable<Canopy>  canopies) {
     ...
     // find closest canopy
     for (Canopy canopy : canopies) {

       double dist = measure.distance(canopy.getCenter().getLengthSquared(), canopy.getCenter(), point);

       if (*dist<  minDist*) {

         ...
     }   
   }


should be replaced with,

if (*dist < minDist && dist <= t1 *)

Otherwise, all records get the same canopy.

This fix also needs some null pointer checks. I have fixed it, and got 
it working. I will try to provide the patch with a test case which 
reproduces the issue.

Thanks and Regards,
Paritosh Ranjan

On 02-10-2011 14:06, Paritosh Ranjan wrote:
> Even run() of CanopyDriver, which takes only T1 and T2 is giving 
> different results for sequential and mapreduce.
> This is preventing me from scaling up, as I need to run mapreduce on 
> hadoop to scale.
>
> Is anyone having any idea of this problem?
>
> On 02-10-2011 00:27, Paritosh Ranjan wrote:
>> Hi,
>>
>> I am able to cluster correctly sequentially, using CanopyDriver.
>>
>> However, the same dataset, when processed as a MapReduce job, where ( 
>> t1 = t3 and t2 = t4 and t1>t2) is not working. I am getting errors 
>> like Canopies are empty.
>>
>> I also tried to reduce the values of t3 and t4. But reducing it 
>> either has no effect or gives meaningless results.
>>
>> Am I doing something wrong? or is there a bug somewhere?
>>
>> I feel that both, sequential and MapReduce should give similar 
>> results. But, It is not happening.
>>
>> Thanks and Regards,
>> Paritosh
>>
>>
>> -----
>> No virus found in this message.
>> Checked by AVG - www.avg.com
>> Version: 10.0.1410 / Virus Database: 1520/3932 - Release Date: 10/01/11
>
>
>
> -----
> No virus found in this message.
> Checked by AVG - www.avg.com
> Version: 10.0.1410 / Virus Database: 1520/3932 - Release Date: 10/01/11


Re: Difference in results : Clustering : sequential and MapReduce

Posted by Paritosh Ranjan <pr...@xebia.com>.
Even run() of CanopyDriver, which takes only T1 and T2 is giving 
different results for sequential and mapreduce.
This is preventing me from scaling up, as I need to run mapreduce on 
hadoop to scale.

Is anyone having any idea of this problem?

On 02-10-2011 00:27, Paritosh Ranjan wrote:
> Hi,
>
> I am able to cluster correctly sequentially, using CanopyDriver.
>
> However, the same dataset, when processed as a MapReduce job, where ( 
> t1 = t3 and t2 = t4 and t1>t2) is not working. I am getting errors 
> like Canopies are empty.
>
> I also tried to reduce the values of t3 and t4. But reducing it either 
> has no effect or gives meaningless results.
>
> Am I doing something wrong? or is there a bug somewhere?
>
> I feel that both, sequential and MapReduce should give similar 
> results. But, It is not happening.
>
> Thanks and Regards,
> Paritosh
>
>
> -----
> No virus found in this message.
> Checked by AVG - www.avg.com
> Version: 10.0.1410 / Virus Database: 1520/3932 - Release Date: 10/01/11