You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by "Valdes, Pablo" <pv...@comscore.com> on 2014/12/01 20:54:25 UTC

Minimum cluster size for empirical testing

Hi everyone,

I’m interested in empirically measuring how faster spark works in comparison to Hadoop for certain problems and input corpus I currently work with (I’ve read Matei Zahari’s “Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing” paper and I want to perform a similar test). I personally think measuring the difference of speed in a single 1-node cluster isn’t enough, so I was wondering what would you recommend for this task, in regards of number of clusters/specs, etc.
I was thinking it could possible to launch a couple of CDH5 VMs across a few computers or do you think it would be easier to do it with Amazon EC2?

I’m particularly interested in knowing what is the overall experience in this case and what are your recommendations (what other common problems to test and what kind of benchmarks)

Have a great start of the week.
Cheers



Pablo Valdes Software Engineer | comScore, Inc. (NASDAQ:SCOR)

pvaldes@comscore.com<ma...@comscore.com>



Av. Del Cóndor N° 520, oficina 202, Ciudad Empresarial, Comuna de Huechuraba, | Santiago | CL

...........................................................................................................

comScore is a global leader in digital media analytics. We make audiences and advertising more valuable. To learn more, visit www.comscore.com<http://www.comscore.com>