You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hadoop.apache.org by Jeff Hubbs <jh...@att.net> on 2019/02/01 12:51:19 UTC

Re: Computation time with Hadoop for kmeans

Jï¿½rï¿½my -

Much of the whole point behind Hadoop is that with each worker node 
added to the cluster you widen the disk I/O and CPU-to-RAM pipelines and 
increase the number of cores operating at once. What you've essentially 
done is take a single machine and add on a lot of Hadoop overhead and 
some VM overhead to get in the way.

Another point behind Hadoop is to move the computation to where the data 
is. That idea is pretty much stuffed as far as your rig is concerned 
because all the data and all the computation have nowhere else to move to.

Also, I don't seem to have useful Hadoop cluster worker nodes until each 
has at least 8GiB RAM; I can't imagine what you can get out of just 
1.5GiB RAM.

I'd advise you to grab some real hardware and try this again.

On 1/30/19 10:48 AM, Jï¿½rï¿½my C wrote:
>
> Good afternoon,
>
>
> I programmed kmeans with Hadoop in R using Rhadoop in a cluster of 3 
> machines (3 virtual machines on a single one. Each machine represents 
> one core and has 1,5 GB of RAM).
>
> The purpose is to compare computation time between a Hadoop cluster 
> and a local machine (without Hadoop) for kmeans.
>
>
> I simulated data with gaussian distribution. With 2 millions data, 
> computation time with Hadoop is still much more higher than time taken 
> without Hadoop. Can computation time with Hadoop be lower than time 
> without Hadoop?
>
> If yes, how can I do it? As I am working on a single machine with 3 
> VM, I am wondering if it is possible to see the advantages of doing 
> computations with Hadoop.
>
>
> Thank you.
>
> Jeremy
>
>