You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@gora.apache.org by Furkan KAMACI <fu...@gmail.com> on 2015/10/26 15:36:21 UTC

Benchmark For Apache Gora

Hi All,

I want to prepare a benchmark and presentation for my Spark Backend of Gora
with help of Talat. I am planning to follow the approach of benchmarking
for Spark by University of California, Berkeley [1][2].

Dimensions of my benchmark:

* Hadoop Map/Reduce
* Spark
* Hadoop Map/Reduce via Gora
* Spark via Gora

For that aim, I would like to work on two types of dataset:

1) Data-intensive
2) CPU-intensive

First of all, is there any benchmark which presents the performance effect
of using Gora for Hadoop/MapReduce?

Secondly, do you suggest any dataset (or tool) for my purposes (i.e.
Logistic Regression, PageRank, TeraSort [3], Intel-Hadoop Benchmark[4],
etc)?


Kind Regards,
Furkan KAMACI

[1] https://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf
[2] http://www.cs.berkeley.edu/~matei/papers/2010/hotcloud_spark.pdf
[3]
http://www.michael-noll.com/blog/2011/04/09/benchmarking-and-stress-testing-an-hadoop-cluster-with-terasort-testdfsio-nnbench-mrbench/
[4] https://github.com/intel-hadoop/HiBench

Re: Benchmark For Apache Gora

Posted by Renato Marroquín Mogrovejo <re...@gmail.com>.

Sorry Furkan! I did mean you! Please excuse me.
Another interesting resource could be looking into
http://www.pdl.cmu.edu/ycsb++/


Best,

Renato M.


2015-10-30 0:31 GMT+01:00 Furkan KAMACI <fu...@gmail.com>:

> Hi Renato,
>
> I think you wanted to mention me :) My main purpose is to compare Spark and
> GoraSparkEngine. Spark uses K-Means, Logistic Regression, Expectation
> Maximization and Alternating Least Squares at its papers for performance
> benchmarking with Hadoop Map/Reduce (also a task which loads 39 GB dump of
> Wikipedia into memory and runs queries on it) and thats why I want to run
> it on two different datasets.
>
> Kind Regards,
> Furkan KAMACI
>
> On Fri, Oct 30, 2015 at 1:20 AM, Renato Marroquín Mogrovejo <
> renatoj.marroquin@gmail.com> wrote:
>
> > Hi Talat,
> >
> > This would be great! This is something that might be really interesting
> and
> > useful for Gora's community!
> > I think you could use as a baseline, the native data access of Spark to
> > different available stores, and then compare it against the GoraRDD
> > integration. We have GoraCI which has been thought as a continuous
> > ingestion test to verify that Gora doesn't loose data when doing a
> > distributed job, but it doesn't take into account the overhead of
> actually
> > using Gora as a middleware.
> > Choosing a cpu-bounded algorithm could be a second interesting step
> because
> > if you use for example an iterative algorithm from the start, then the
> many
> > layers of caching might make the benefits/drawbacks of using Gora
> difficult
> > to observe (Spark's internal caching mechanism, the OperatingSystem
> > caching, and Gora holding the in-memory until it is flushed). What I am
> > trying to say is that the results will depend on the algorithm chosen,
> and
> > the type of caching it takes advantage of (temporal or spatial locality).
> >
> >
> > Renato M.
> >
> > 2015-10-26 15:36 GMT+01:00 Furkan KAMACI <fu...@gmail.com>:
> >
> > > Hi All,
> > >
> > > I want to prepare a benchmark and presentation for my Spark Backend of
> > Gora
> > > with help of Talat. I am planning to follow the approach of
> benchmarking
> > > for Spark by University of California, Berkeley [1][2].
> > >
> > > Dimensions of my benchmark:
> > >
> > > * Hadoop Map/Reduce
> > > * Spark
> > > * Hadoop Map/Reduce via Gora
> > > * Spark via Gora
> > >
> > > For that aim, I would like to work on two types of dataset:
> > >
> > > 1) Data-intensive
> > > 2) CPU-intensive
> > >
> > > First of all, is there any benchmark which presents the performance
> > effect
> > > of using Gora for Hadoop/MapReduce?
> > >
> > > Secondly, do you suggest any dataset (or tool) for my purposes (i.e.
> > > Logistic Regression, PageRank, TeraSort [3], Intel-Hadoop Benchmark[4],
> > > etc)?
> > >
> > >
> > > Kind Regards,
> > > Furkan KAMACI
> > >
> > > [1] https://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf
> > > [2] http://www.cs.berkeley.edu/~matei/papers/2010/hotcloud_spark.pdf
> > > [3]
> > >
> > >
> >
> http://www.michael-noll.com/blog/2011/04/09/benchmarking-and-stress-testing-an-hadoop-cluster-with-terasort-testdfsio-nnbench-mrbench/
> > > [4] https://github.com/intel-hadoop/HiBench
> > >
> >
>

Re: Benchmark For Apache Gora

Posted by Furkan KAMACI <fu...@gmail.com>.

Hi Renato,

I think you wanted to mention me :) My main purpose is to compare Spark and
GoraSparkEngine. Spark uses K-Means, Logistic Regression, Expectation
Maximization and Alternating Least Squares at its papers for performance
benchmarking with Hadoop Map/Reduce (also a task which loads 39 GB dump of
Wikipedia into memory and runs queries on it) and thats why I want to run
it on two different datasets.

Kind Regards,
Furkan KAMACI

On Fri, Oct 30, 2015 at 1:20 AM, Renato Marroquín Mogrovejo <
renatoj.marroquin@gmail.com> wrote:

> Hi Talat,
>
> This would be great! This is something that might be really interesting and
> useful for Gora's community!
> I think you could use as a baseline, the native data access of Spark to
> different available stores, and then compare it against the GoraRDD
> integration. We have GoraCI which has been thought as a continuous
> ingestion test to verify that Gora doesn't loose data when doing a
> distributed job, but it doesn't take into account the overhead of actually
> using Gora as a middleware.
> Choosing a cpu-bounded algorithm could be a second interesting step because
> if you use for example an iterative algorithm from the start, then the many
> layers of caching might make the benefits/drawbacks of using Gora difficult
> to observe (Spark's internal caching mechanism, the OperatingSystem
> caching, and Gora holding the in-memory until it is flushed). What I am
> trying to say is that the results will depend on the algorithm chosen, and
> the type of caching it takes advantage of (temporal or spatial locality).
>
>
> Renato M.
>
> 2015-10-26 15:36 GMT+01:00 Furkan KAMACI <fu...@gmail.com>:
>
> > Hi All,
> >
> > I want to prepare a benchmark and presentation for my Spark Backend of
> Gora
> > with help of Talat. I am planning to follow the approach of benchmarking
> > for Spark by University of California, Berkeley [1][2].
> >
> > Dimensions of my benchmark:
> >
> > * Hadoop Map/Reduce
> > * Spark
> > * Hadoop Map/Reduce via Gora
> > * Spark via Gora
> >
> > For that aim, I would like to work on two types of dataset:
> >
> > 1) Data-intensive
> > 2) CPU-intensive
> >
> > First of all, is there any benchmark which presents the performance
> effect
> > of using Gora for Hadoop/MapReduce?
> >
> > Secondly, do you suggest any dataset (or tool) for my purposes (i.e.
> > Logistic Regression, PageRank, TeraSort [3], Intel-Hadoop Benchmark[4],
> > etc)?
> >
> >
> > Kind Regards,
> > Furkan KAMACI
> >
> > [1] https://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf
> > [2] http://www.cs.berkeley.edu/~matei/papers/2010/hotcloud_spark.pdf
> > [3]
> >
> >
> http://www.michael-noll.com/blog/2011/04/09/benchmarking-and-stress-testing-an-hadoop-cluster-with-terasort-testdfsio-nnbench-mrbench/
> > [4] https://github.com/intel-hadoop/HiBench
> >
>

Re: Benchmark For Apache Gora

Posted by Renato Marroquín Mogrovejo <re...@gmail.com>.

Hi Talat,

This would be great! This is something that might be really interesting and
useful for Gora's community!
I think you could use as a baseline, the native data access of Spark to
different available stores, and then compare it against the GoraRDD
integration. We have GoraCI which has been thought as a continuous
ingestion test to verify that Gora doesn't loose data when doing a
distributed job, but it doesn't take into account the overhead of actually
using Gora as a middleware.
Choosing a cpu-bounded algorithm could be a second interesting step because
if you use for example an iterative algorithm from the start, then the many
layers of caching might make the benefits/drawbacks of using Gora difficult
to observe (Spark's internal caching mechanism, the OperatingSystem
caching, and Gora holding the in-memory until it is flushed). What I am
trying to say is that the results will depend on the algorithm chosen, and
the type of caching it takes advantage of (temporal or spatial locality).


Renato M.

2015-10-26 15:36 GMT+01:00 Furkan KAMACI <fu...@gmail.com>:

> Hi All,
>
> I want to prepare a benchmark and presentation for my Spark Backend of Gora
> with help of Talat. I am planning to follow the approach of benchmarking
> for Spark by University of California, Berkeley [1][2].
>
> Dimensions of my benchmark:
>
> * Hadoop Map/Reduce
> * Spark
> * Hadoop Map/Reduce via Gora
> * Spark via Gora
>
> For that aim, I would like to work on two types of dataset:
>
> 1) Data-intensive
> 2) CPU-intensive
>
> First of all, is there any benchmark which presents the performance effect
> of using Gora for Hadoop/MapReduce?
>
> Secondly, do you suggest any dataset (or tool) for my purposes (i.e.
> Logistic Regression, PageRank, TeraSort [3], Intel-Hadoop Benchmark[4],
> etc)?
>
>
> Kind Regards,
> Furkan KAMACI
>
> [1] https://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf
> [2] http://www.cs.berkeley.edu/~matei/papers/2010/hotcloud_spark.pdf
> [3]
>
> http://www.michael-noll.com/blog/2011/04/09/benchmarking-and-stress-testing-an-hadoop-cluster-with-terasort-testdfsio-nnbench-mrbench/
> [4] https://github.com/intel-hadoop/HiBench
>