You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@spark.apache.org by Nicholas Chammas <ni...@gmail.com> on 2014/10/31 18:38:20 UTC

Surprising Spark SQL benchmark

I know we don't want to be jumping at every benchmark someone posts out
there, but this one surprised me:

http://www.citusdata.com/blog/86-making-postgresql-scale-hadoop-style

This benchmark has Spark SQL failing to complete several queries in the
TPC-H benchmark. I don't understand much about the details of performing
benchmarks, but this was surprising.

Are these results expected?

Related HN discussion here: https://news.ycombinator.com/item?id=8539678

Nick

Re: Surprising Spark SQL benchmark

Posted by Marco Slot <ma...@citusdata.com>.

Hi Patrick,

We left the details of the configuration of Spark that we used out of the
blog post for brevity, but we're happy to share them. We've done quite a
bit of tuning to find the configuration settings that gave us the best
query times and run the most queries. I think there might still be a few
improvements that we could make. We spent the majority of our time
optimizing the 20-node (in-memory) case. For documentation purposes, I'm
including a summary of the work we've done so far below. We are also
looking forward to working with the SparkSQL team to look into any further
optimizations.

Initially, we started off with Spark 1.0.2, so we wrote Java programs to
run the queries. We compared SQLContext to HiveContext and found the latter
to be faster and it was recommended to us on the mailing list for having
better SQL support http://goo.gl/IU5Hw0. With Spark 1.1.0, which is the
version that we used to get the benchmark numbers, we just used the
spark-sql command-line tool, which uses HiveContext by default.

We pretty quickly ran into the issue that query times were highly variable
(from minutes to hours). If I understand correctly, Spark's MemoryStore
uses a FIFO caching policy, which means it will remove the oldest block
first rather than the least recently-used one. At the start, the oldest
data is the actual table data which will be reused many times. In the
20-node benchmark, we found that the query times became more stable with
cache, because it pins the table data in memory and also uses the optimized
in-memory format. We did not see any difference between text and parquet
tables in the in-memory case after caching. However, we did not use cache
in the 4-node benchmark, because we saw better query times without it, and
used parquet files generated by inserting into a parquet-backed Hive
metastore table that used ParquetHiveSerDe.

A problem we ran into at the start is that the table data wouldn't fit in
memory when spark.storage.memoryFraction was set to the default value of
0.6, so we increased it to 0.65. We also confirmed that the partitions are
fitting in memory by looking at the storage tab of the Spark Web UI. We
also increased spark.shuffle.memoryFraction from 0.2 to 0.25. This avoided
some premature eviction problems. We ran the benchmark on machines with
122GB memory. We set spark.executor.memory to 110000m to use most of memory
available. Increasing this value gave more stable query times and fewer
failures.

We set spark.serializer to org.apache.spark.serializer.KryoSerializer as
recommended by the Spark docs, and also because it gave us better query
times. We also set spark.sql.inMemoryColumnarStorage.compressed to true.

TPC-H has some very large intermediate jobs and result sets. We found that
a number of timeouts in Spark trigger too early during queries. We
eventually increased spark.worker.timeout, spark.akka.timeout, and
spark.storage.blockManagerSlaveTimeoutMs
to 10 minutes to avoid these issues.

We also tried different block sizes. Some queries run slight faster, but
others a lot slower when using a bigger block size (e.g., 512MB), but
eventually found 128MB gave us most stable and overall lowest query times.
We also experimented a bit with spark.sql.shuffle.partitions. We eventually
set it to the number of vCPUs in the cluster, which was 320. I should note
that one thing that we've noticed in all of our benchmarks was that when
running TPC-H on EC2, there is not that much benefit of using 16 vCPUs over
8. This is because the r3.4xlarge and i2.4xlarge machines have 8 physical
cores, which each have hyper-threading to give 16 vCPUs, but the benefit of
hyper-threading isn't huge in this case, meaning that the cores are (in the
worst case) only half as fast when using all 16.

There are a few more settings we experimented with further with mixed, but
overall not hugely significant results. We tried
increasing spark.shuffle.file.buffer.kb, spark.akka.framesize. We increased
spark.akka.threads from 4 to 8.

We tried compute analyze <table name> compute statistics, but that failed
with the following errors:

java.lang.Throwable: Child Error
        at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:271)
Caused by: java.io.IOException: Task process exit with nonzero status of 1.
        at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:258)

We also tried running compute analyze <table name> compute statistics
noscan. However, when using cache <table name>, it failed with the
following error:

14/11/03 15:59:17 ERROR CliDriver: scala.NotImplementedError: Analyze has
only implemented for Hive tables, but customer is a Subquery
        at
org.apache.spark.sql.hive.HiveContext.analyze(HiveContext.scala:189)

Without cache, the noscan command completes successfully. However, we have
not seen any performance benefit from and still see queries like Q5 failing
(after a very long time). We found using cache to be preferable over using
analyze in the in-memory case.

Besides these, we also experimented with several different settings but
didn't find them to have a particular impact on query times. We are now
looking forward to working together with SparkSQL developers, and
re-running the numbers with proposed optimizations.

regards,
Marco

Re: Surprising Spark SQL benchmark

Posted by Michael Armbrust <mi...@databricks.com>.

dev to bcc.

Thanks for reaching out, Ozgun.  Let's discuss if there were any missing
optimizations off list.  We'll make sure to report back or add any findings
to the tuning guide.

On Mon, Nov 3, 2014 at 3:01 PM, ozgun <oz...@citusdata.com> wrote:

> Hey Patrick,
>
> It's Ozgun from Citus Data. We'd like to make these benchmark results fair,
> and have tried different config settings for SparkSQL over the past month.
> We picked the best config settings we could find, and also contacted the
> Spark users list about running TPC-H numbers.
>
> http://goo.gl/IU5Hw0
> http://goo.gl/WQ1kML
> http://goo.gl/ihLzgh
>
> We also received advice at the Spark Summit '14 to wait until v1.1, and
> therefore re-ran our tests on SparkSQL 1.1. On the specific optimizations,
> Marco and Samay from our team have much more context, and I'll let them
> answer your questions on the different settings we tried.
>
> Our intent is to be fair and not misrepresent SparkSQL's performance. On
> that front, we used publicly available documentation and user lists, and
> spent about a month trying to get the best Spark performance results. If
> there are specific optimizations we should have applied and missed, we'd
> love to be involved with the community in re-running the numbers.
>
> Is this email thread the best place to continue the conversation?
>
> Best,
> Ozgun
>
>
>
> --
> View this message in context:
> http://apache-spark-developers-list.1001551.n3.nabble.com/Surprising-Spark-SQL-benchmark-tp9041p9073.html
> Sent from the Apache Spark Developers List mailing list archive at
> Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
> For additional commands, e-mail: dev-help@spark.apache.org
>
>

Re: Surprising Spark SQL benchmark

Posted by ozgun <oz...@citusdata.com>.

Hey Patrick,

It's Ozgun from Citus Data. We'd like to make these benchmark results fair,
and have tried different config settings for SparkSQL over the past month.
We picked the best config settings we could find, and also contacted the
Spark users list about running TPC-H numbers.

http://goo.gl/IU5Hw0
http://goo.gl/WQ1kML
http://goo.gl/ihLzgh

We also received advice at the Spark Summit '14 to wait until v1.1, and
therefore re-ran our tests on SparkSQL 1.1. On the specific optimizations,
Marco and Samay from our team have much more context, and I'll let them
answer your questions on the different settings we tried.

Our intent is to be fair and not misrepresent SparkSQL's performance. On
that front, we used publicly available documentation and user lists, and
spent about a month trying to get the best Spark performance results. If
there are specific optimizations we should have applied and missed, we'd
love to be involved with the community in re-running the numbers.

Is this email thread the best place to continue the conversation?

Best,
Ozgun



--
View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Surprising-Spark-SQL-benchmark-tp9041p9073.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org

Re: Surprising Spark SQL benchmark

Posted by Nicholas Chammas <ni...@gmail.com>.

Good points raised. Some comments.

Re: #1

It seems like there is a misunderstanding of the purpose of the Daytona
Gray benchmark. The purpose of the benchmark is to see how fast you can
sort 100 TB of data (technically, your sort rate during the operation)
using *any* hardware or software config, within the common rules laid out
at http://sortbenchmark.org

Though people will naturally want to compare one benchmarked system to
another, the Gray benchmark does not control the hardware to make such a
comparison useful. So you're right that it's apples to oranges to compare
Databricks's Spark run to Yahoo's Hadoop run in this type of benchmark, but
that's just inherent to the definition of the benchmark. I wouldn't fault
Databricks or Yahoo for this.

That said, it's nice that Databricks went with a public cloud to do this
benchmark, which makes it more likely that future benchmarks done on the
same cloud can be compared meaningfully. The same can't be said of Yahoo's
benchmark, for example, which was done in private datacenter.

Re: #2

EC2 is a good place to run a reproducible benchmark since it's publicly
accessible and the instance types are well defined. If you had trouble
reproducing the AMPLab benchmark there, I would raise that with the AMPLab
team. I'd assume they would be interested in correcting any problems with
reproducing it, as it definitely detracts from the value of the benchmark.

Nick


2014년 11월 1일 토요일, RJ Nowling<rn...@gmail.com>님이 작성한 메시지:

> Two thoughts here:
>
> 1. The real flaw with the sort benchmark was that Hadoop wasn't run on the
> same hardware. Given the advances in networking (availabIlity of
> 10GB Ethernet) and disks (SSDs) since the Hadoop benchmarks it was compared
> to, it's an apples to oranges comparison. Without that, it doesn't tell me
> whether the improvement is due to Spark or just hardware.
>
> To me, that's the biggest flaw -- not the reproducibility of it. As you
> say, most people won't have the financial means to access those resources
> to reproduce it.
>
> And that's the same sort of flaw every other marketing benchmark has --
> apples to oranges comparisons.
>
> 2. The BDD benchmark is hard to run outside of EC2 and I and other users
> were not able to access all of the data via S3.  I could reproduce some of
> the data using HiBench but not the web corpus sub sample. As a result, for
> all the hard work put into documenting it, it's still hard to reproduce :(
>
> On Friday, October 31, 2014, Nicholas Chammas <nicholas.chammas@gmail.com
> <javascript:_e(%7B%7D,'cvml','nicholas.chammas@gmail.com');>> wrote:
>
>> I believe that benchmark has a pending certification on it. See
>> http://sortbenchmark.org under "Process".
>>
>> It's true they did not share enough details on the blog for readers to
>> reproduce the benchmark, but they will have to share enough with the
>> committee behind the benchmark in order to be certified. Given that this
>> is
>> a benchmark not many people will be able to reproduce due to size and
>> complexity, I don't see it as a big negative that the details are not laid
>> out as long as there is independent certification from a third party.
>>
>> From what I've seen so far, the best big data benchmark anywhere is this:
>> https://amplab.cs.berkeley.edu/benchmark/
>>
>> Is has all the details you'd expect, including hosted datasets, to allow
>> anyone to reproduce the full benchmark, covering a number of systems. I
>> look forward to the next update to that benchmark (a lot has changed since
>> Feb). And from what I can tell, it's produced by the same people behind
>> Spark (Patrick being among them).
>>
>> So I disagree that the Spark community "hasn't been any better" in this
>> regard.
>>
>> Nick
>>
>>
>> 2014년 10월 31일 금요일, Steve Nunez<sn...@hortonworks.com>님이 작성한 메시지:
>>
>> > To be fair, we (Spark community) haven’t been any better, for example
>> this
>> > benchmark:
>> >
>> >         https://databricks.com/blog/2014/10/10/spark-petabyte-sort.html
>> >
>> >
>> > For which no details or code have been released to allow others to
>> > reproduce it. I would encourage anyone doing a Spark benchmark in future
>> > to avoid the stigma of vendor reported benchmarks and publish enough
>> > information and code to let others repeat the exercise easily.
>> >
>> >         - Steve
>> >
>> >
>> >
>> > On 10/31/14, 11:30, "Nicholas Chammas" <nicholas.chammas@gmail.com
>> > <javascript:;>> wrote:
>> >
>> > >Thanks for the response, Patrick.
>> > >
>> > >I guess the key takeaways are 1) the tuning/config details are
>> everything
>> > >(they're not laid out here), 2) the benchmark should be reproducible
>> (it's
>> > >not), and 3) reach out to the relevant devs before publishing (didn't
>> > >happen).
>> > >
>> > >Probably key takeaways for any kind of benchmark, really...
>> > >
>> > >Nick
>> > >
>> > >
>> > >2014년 10월 31일 금요일, Patrick Wendell<pwendell@gmail.com
>> <javascript:;>>님이
>> > 작성한 메시지:
>> > >
>> > >> Hey Nick,
>> > >>
>> > >> Unfortunately Citus Data didn't contact any of the Spark or Spark SQL
>> > >> developers when running this. It is really easy to make one system
>> > >> look better than others when you are running a benchmark yourself
>> > >> because tuning and sizing can lead to a 10X performance improvement.
>> > >> This benchmark doesn't share the mechanism in a reproducible way.
>> > >>
>> > >> There are a bunch of things that aren't clear here:
>> > >>
>> > >> 1. Spark SQL has optimized parquet features, were these turned on?
>> > >> 2. It doesn't mention computing statistics in Spark SQL, but it does
>> > >> this for Impala and Parquet. Statistics allow Spark SQL to broadcast
>> > >> small tables which can make a 10X difference in TPC-H.
>> > >> 3. For data larger than memory, Spark SQL often performs better if
>> you
>> > >> don't call "cache", did they try this?
>> > >>
>> > >> Basically, a self-reported marketing benchmark like this that
>> > >> *shocker* concludes this vendor's solution is the best, is not
>> > >> particularly useful.
>> > >>
>> > >> If Citus data wants to run a credible benchmark, I'd invite them to
>> > >> directly involve Spark SQL developers in the future. Until then, I
>> > >> wouldn't give much credence to this or any other similar vendor
>> > >> benchmark.
>> > >>
>> > >> - Patrick
>> > >>
>> > >> On Fri, Oct 31, 2014 at 10:38 AM, Nicholas Chammas
>> > >> <nicholas.chammas@gmail.com <javascript:;> <javascript:;>> wrote:
>> > >> > I know we don't want to be jumping at every benchmark someone posts
>> > >>out
>> > >> > there, but this one surprised me:
>> > >> >
>> > >> >
>> http://www.citusdata.com/blog/86-making-postgresql-scale-hadoop-style
>> > >> >
>> > >> > This benchmark has Spark SQL failing to complete several queries in
>> > >>the
>> > >> > TPC-H benchmark. I don't understand much about the details of
>> > >>performing
>> > >> > benchmarks, but this was surprising.
>> > >> >
>> > >> > Are these results expected?
>> > >> >
>> > >> > Related HN discussion here:
>> > >>https://news.ycombinator.com/item?id=8539678
>> > >> >
>> > >> > Nick
>> > >>
>> >
>> >
>> >
>> > --
>> > CONFIDENTIALITY NOTICE
>> > NOTICE: This message is intended for the use of the individual or
>> entity to
>> > which it is addressed and may contain information that is confidential,
>> > privileged and exempt from disclosure under applicable law. If the
>> reader
>> > of this message is not the intended recipient, you are hereby notified
>> that
>> > any printing, copying, dissemination, distribution, disclosure or
>> > forwarding of this communication is strictly prohibited. If you have
>> > received this communication in error, please contact the sender
>> immediately
>> > and delete it from your system. Thank You.
>> >
>>
>
>
> --
> em rnowling@gmail.com <javascript:_e(%7B%7D,'cvml','rnowling@gmail.com');>
> c 954.496.2314
>

Re: Surprising Spark SQL benchmark

Posted by RJ Nowling <rn...@gmail.com>.

Two thoughts here:

1. The real flaw with the sort benchmark was that Hadoop wasn't run on the
same hardware. Given the advances in networking (availabIlity of
10GB Ethernet) and disks (SSDs) since the Hadoop benchmarks it was compared
to, it's an apples to oranges comparison. Without that, it doesn't tell me
whether the improvement is due to Spark or just hardware.

To me, that's the biggest flaw -- not the reproducibility of it. As you
say, most people won't have the financial means to access those resources
to reproduce it.

And that's the same sort of flaw every other marketing benchmark has --
apples to oranges comparisons.

2. The BDD benchmark is hard to run outside of EC2 and I and other users
were not able to access all of the data via S3.  I could reproduce some of
the data using HiBench but not the web corpus sub sample. As a result, for
all the hard work put into documenting it, it's still hard to reproduce :(

On Friday, October 31, 2014, Nicholas Chammas <ni...@gmail.com>
wrote:

> I believe that benchmark has a pending certification on it. See
> http://sortbenchmark.org under "Process".
>
> It's true they did not share enough details on the blog for readers to
> reproduce the benchmark, but they will have to share enough with the
> committee behind the benchmark in order to be certified. Given that this is
> a benchmark not many people will be able to reproduce due to size and
> complexity, I don't see it as a big negative that the details are not laid
> out as long as there is independent certification from a third party.
>
> From what I've seen so far, the best big data benchmark anywhere is this:
> https://amplab.cs.berkeley.edu/benchmark/
>
> Is has all the details you'd expect, including hosted datasets, to allow
> anyone to reproduce the full benchmark, covering a number of systems. I
> look forward to the next update to that benchmark (a lot has changed since
> Feb). And from what I can tell, it's produced by the same people behind
> Spark (Patrick being among them).
>
> So I disagree that the Spark community "hasn't been any better" in this
> regard.
>
> Nick
>
>
> 2014년 10월 31일 금요일, Steve Nunez<snunez@hortonworks.com <javascript:;>>님이
> 작성한 메시지:
>
> > To be fair, we (Spark community) haven’t been any better, for example
> this
> > benchmark:
> >
> >         https://databricks.com/blog/2014/10/10/spark-petabyte-sort.html
> >
> >
> > For which no details or code have been released to allow others to
> > reproduce it. I would encourage anyone doing a Spark benchmark in future
> > to avoid the stigma of vendor reported benchmarks and publish enough
> > information and code to let others repeat the exercise easily.
> >
> >         - Steve
> >
> >
> >
> > On 10/31/14, 11:30, "Nicholas Chammas" <nicholas.chammas@gmail.com
> <javascript:;>
> > <javascript:;>> wrote:
> >
> > >Thanks for the response, Patrick.
> > >
> > >I guess the key takeaways are 1) the tuning/config details are
> everything
> > >(they're not laid out here), 2) the benchmark should be reproducible
> (it's
> > >not), and 3) reach out to the relevant devs before publishing (didn't
> > >happen).
> > >
> > >Probably key takeaways for any kind of benchmark, really...
> > >
> > >Nick
> > >
> > >
> > >2014년 10월 31일 금요일, Patrick Wendell<pwendell@gmail.com <javascript:;>
> <javascript:;>>님이
> > 작성한 메시지:
> > >
> > >> Hey Nick,
> > >>
> > >> Unfortunately Citus Data didn't contact any of the Spark or Spark SQL
> > >> developers when running this. It is really easy to make one system
> > >> look better than others when you are running a benchmark yourself
> > >> because tuning and sizing can lead to a 10X performance improvement.
> > >> This benchmark doesn't share the mechanism in a reproducible way.
> > >>
> > >> There are a bunch of things that aren't clear here:
> > >>
> > >> 1. Spark SQL has optimized parquet features, were these turned on?
> > >> 2. It doesn't mention computing statistics in Spark SQL, but it does
> > >> this for Impala and Parquet. Statistics allow Spark SQL to broadcast
> > >> small tables which can make a 10X difference in TPC-H.
> > >> 3. For data larger than memory, Spark SQL often performs better if you
> > >> don't call "cache", did they try this?
> > >>
> > >> Basically, a self-reported marketing benchmark like this that
> > >> *shocker* concludes this vendor's solution is the best, is not
> > >> particularly useful.
> > >>
> > >> If Citus data wants to run a credible benchmark, I'd invite them to
> > >> directly involve Spark SQL developers in the future. Until then, I
> > >> wouldn't give much credence to this or any other similar vendor
> > >> benchmark.
> > >>
> > >> - Patrick
> > >>
> > >> On Fri, Oct 31, 2014 at 10:38 AM, Nicholas Chammas
> > >> <nicholas.chammas@gmail.com <javascript:;> <javascript:;>
> <javascript:;>> wrote:
> > >> > I know we don't want to be jumping at every benchmark someone posts
> > >>out
> > >> > there, but this one surprised me:
> > >> >
> > >> >
> http://www.citusdata.com/blog/86-making-postgresql-scale-hadoop-style
> > >> >
> > >> > This benchmark has Spark SQL failing to complete several queries in
> > >>the
> > >> > TPC-H benchmark. I don't understand much about the details of
> > >>performing
> > >> > benchmarks, but this was surprising.
> > >> >
> > >> > Are these results expected?
> > >> >
> > >> > Related HN discussion here:
> > >>https://news.ycombinator.com/item?id=8539678
> > >> >
> > >> > Nick
> > >>
> >
> >
> >
> > --
> > CONFIDENTIALITY NOTICE
> > NOTICE: This message is intended for the use of the individual or entity
> to
> > which it is addressed and may contain information that is confidential,
> > privileged and exempt from disclosure under applicable law. If the reader
> > of this message is not the intended recipient, you are hereby notified
> that
> > any printing, copying, dissemination, distribution, disclosure or
> > forwarding of this communication is strictly prohibited. If you have
> > received this communication in error, please contact the sender
> immediately
> > and delete it from your system. Thank You.
> >
>


-- 
em rnowling@gmail.com
c 954.496.2314

Re: Surprising Spark SQL benchmark

Posted by Kay Ousterhout <ke...@eecs.berkeley.edu>.

Hi Nick,

No -- we're doing a much more constrained thing of just trying to get
things set up to easily run TPC-DS on SparkSQL (which involves generating
the data, storing it in HDFS, getting all the queries in the right format,
etc.).
Cloudera does have a repo here: https://github.com/cloudera/impala-tpcds-kit
that we've found helpful in running TPC-DS on Hive (you should also be able
to use that repo to run TPC-DS on Impala, although we haven't actually done
this).

-Kay

On Sat, Nov 1, 2014 at 10:50 AM, Nicholas Chammas <
nicholas.chammas@gmail.com> wrote:

> Kay,
>
> Is this effort related to the existing AMPLab Big Data benchmark that
> covers Spark, Redshift, Tez, and Impala?
>
> Nick
>
>
> 2014년 10월 31일 금요일, Kay Ousterhout<ke...@eecs.berkeley.edu>님이 작성한 메시지:
>
> There's been an effort in the AMPLab at Berkeley to set up a shared
>> codebase that makes it easy to run TPC-DS on SparkSQL, since it's something
>> we do frequently in the lab to evaluate new research.  Based on this
>> thread, it sounds like making this more widely-available is something that
>> would be useful to folks for reproducing the results published by
>> Databricks / Hortonworks / Cloudera / etc.; we'll share the code on the
>> list as soon as we're done.
>>
>> -Kay
>>
>> On Fri, Oct 31, 2014 at 12:45 PM, Nicholas Chammas <
>> nicholas.chammas@gmail.com> wrote:
>>
>>> I believe that benchmark has a pending certification on it. See
>>> http://sortbenchmark.org under "Process".
>>>
>>> It's true they did not share enough details on the blog for readers to
>>> reproduce the benchmark, but they will have to share enough with the
>>> committee behind the benchmark in order to be certified. Given that this
>>> is
>>> a benchmark not many people will be able to reproduce due to size and
>>> complexity, I don't see it as a big negative that the details are not
>>> laid
>>> out as long as there is independent certification from a third party.
>>>
>>> From what I've seen so far, the best big data benchmark anywhere is this:
>>> https://amplab.cs.berkeley.edu/benchmark/
>>>
>>> Is has all the details you'd expect, including hosted datasets, to allow
>>> anyone to reproduce the full benchmark, covering a number of systems. I
>>> look forward to the next update to that benchmark (a lot has changed
>>> since
>>> Feb). And from what I can tell, it's produced by the same people behind
>>> Spark (Patrick being among them).
>>>
>>> So I disagree that the Spark community "hasn't been any better" in this
>>> regard.
>>>
>>> Nick
>>>
>>>
>>> 2014년 10월 31일 금요일, Steve Nunez<sn...@hortonworks.com>님이 작성한 메시지:
>>>
>>> > To be fair, we (Spark community) haven’t been any better, for example
>>> this
>>> > benchmark:
>>> >
>>> >
>>> https://databricks.com/blog/2014/10/10/spark-petabyte-sort.html
>>> >
>>> >
>>> > For which no details or code have been released to allow others to
>>> > reproduce it. I would encourage anyone doing a Spark benchmark in
>>> future
>>> > to avoid the stigma of vendor reported benchmarks and publish enough
>>> > information and code to let others repeat the exercise easily.
>>> >
>>> >         - Steve
>>> >
>>> >
>>> >
>>> > On 10/31/14, 11:30, "Nicholas Chammas" <nicholas.chammas@gmail.com
>>> > <javascript:;>> wrote:
>>> >
>>> > >Thanks for the response, Patrick.
>>> > >
>>> > >I guess the key takeaways are 1) the tuning/config details are
>>> everything
>>> > >(they're not laid out here), 2) the benchmark should be reproducible
>>> (it's
>>> > >not), and 3) reach out to the relevant devs before publishing (didn't
>>> > >happen).
>>> > >
>>> > >Probably key takeaways for any kind of benchmark, really...
>>> > >
>>> > >Nick
>>> > >
>>> > >
>>> > >2014년 10월 31일 금요일, Patrick Wendell<pwendell@gmail.com
>>> <javascript:;>>님이
>>> > 작성한 메시지:
>>> > >
>>> > >> Hey Nick,
>>> > >>
>>> > >> Unfortunately Citus Data didn't contact any of the Spark or Spark
>>> SQL
>>> > >> developers when running this. It is really easy to make one system
>>> > >> look better than others when you are running a benchmark yourself
>>> > >> because tuning and sizing can lead to a 10X performance improvement.
>>> > >> This benchmark doesn't share the mechanism in a reproducible way.
>>> > >>
>>> > >> There are a bunch of things that aren't clear here:
>>> > >>
>>> > >> 1. Spark SQL has optimized parquet features, were these turned on?
>>> > >> 2. It doesn't mention computing statistics in Spark SQL, but it does
>>> > >> this for Impala and Parquet. Statistics allow Spark SQL to broadcast
>>> > >> small tables which can make a 10X difference in TPC-H.
>>> > >> 3. For data larger than memory, Spark SQL often performs better if
>>> you
>>> > >> don't call "cache", did they try this?
>>> > >>
>>> > >> Basically, a self-reported marketing benchmark like this that
>>> > >> *shocker* concludes this vendor's solution is the best, is not
>>> > >> particularly useful.
>>> > >>
>>> > >> If Citus data wants to run a credible benchmark, I'd invite them to
>>> > >> directly involve Spark SQL developers in the future. Until then, I
>>> > >> wouldn't give much credence to this or any other similar vendor
>>> > >> benchmark.
>>> > >>
>>> > >> - Patrick
>>> > >>
>>> > >> On Fri, Oct 31, 2014 at 10:38 AM, Nicholas Chammas
>>> > >> <nicholas.chammas@gmail.com <javascript:;> <javascript:;>> wrote:
>>> > >> > I know we don't want to be jumping at every benchmark someone
>>> posts
>>> > >>out
>>> > >> > there, but this one surprised me:
>>> > >> >
>>> > >> >
>>> http://www.citusdata.com/blog/86-making-postgresql-scale-hadoop-style
>>> > >> >
>>> > >> > This benchmark has Spark SQL failing to complete several queries
>>> in
>>> > >>the
>>> > >> > TPC-H benchmark. I don't understand much about the details of
>>> > >>performing
>>> > >> > benchmarks, but this was surprising.
>>> > >> >
>>> > >> > Are these results expected?
>>> > >> >
>>> > >> > Related HN discussion here:
>>> > >>https://news.ycombinator.com/item?id=8539678
>>> > >> >
>>> > >> > Nick
>>> > >>
>>> >
>>> >
>>> >
>>> > --
>>> > CONFIDENTIALITY NOTICE
>>> > NOTICE: This message is intended for the use of the individual or
>>> entity to
>>> > which it is addressed and may contain information that is confidential,
>>> > privileged and exempt from disclosure under applicable law. If the
>>> reader
>>> > of this message is not the intended recipient, you are hereby notified
>>> that
>>> > any printing, copying, dissemination, distribution, disclosure or
>>> > forwarding of this communication is strictly prohibited. If you have
>>> > received this communication in error, please contact the sender
>>> immediately
>>> > and delete it from your system. Thank You.
>>> >
>>>
>>
>>

Re: Surprising Spark SQL benchmark

Posted by Nicholas Chammas <ni...@gmail.com>.

Kay,

Is this effort related to the existing AMPLab Big Data benchmark that
covers Spark, Redshift, Tez, and Impala?

Nick


2014년 10월 31일 금요일, Kay Ousterhout<ke...@eecs.berkeley.edu>님이 작성한 메시지:

> There's been an effort in the AMPLab at Berkeley to set up a shared
> codebase that makes it easy to run TPC-DS on SparkSQL, since it's something
> we do frequently in the lab to evaluate new research.  Based on this
> thread, it sounds like making this more widely-available is something that
> would be useful to folks for reproducing the results published by
> Databricks / Hortonworks / Cloudera / etc.; we'll share the code on the
> list as soon as we're done.
>
> -Kay
>
> On Fri, Oct 31, 2014 at 12:45 PM, Nicholas Chammas <
> nicholas.chammas@gmail.com
> <javascript:_e(%7B%7D,'cvml','nicholas.chammas@gmail.com');>> wrote:
>
>> I believe that benchmark has a pending certification on it. See
>> http://sortbenchmark.org under "Process".
>>
>> It's true they did not share enough details on the blog for readers to
>> reproduce the benchmark, but they will have to share enough with the
>> committee behind the benchmark in order to be certified. Given that this
>> is
>> a benchmark not many people will be able to reproduce due to size and
>> complexity, I don't see it as a big negative that the details are not laid
>> out as long as there is independent certification from a third party.
>>
>> From what I've seen so far, the best big data benchmark anywhere is this:
>> https://amplab.cs.berkeley.edu/benchmark/
>>
>> Is has all the details you'd expect, including hosted datasets, to allow
>> anyone to reproduce the full benchmark, covering a number of systems. I
>> look forward to the next update to that benchmark (a lot has changed since
>> Feb). And from what I can tell, it's produced by the same people behind
>> Spark (Patrick being among them).
>>
>> So I disagree that the Spark community "hasn't been any better" in this
>> regard.
>>
>> Nick
>>
>>
>> 2014년 10월 31일 금요일, Steve Nunez<snunez@hortonworks.com
>> <javascript:_e(%7B%7D,'cvml','snunez@hortonworks.com');>>님이 작성한 메시지:
>>
>> > To be fair, we (Spark community) haven’t been any better, for example
>> this
>> > benchmark:
>> >
>> >         https://databricks.com/blog/2014/10/10/spark-petabyte-sort.html
>> >
>> >
>> > For which no details or code have been released to allow others to
>> > reproduce it. I would encourage anyone doing a Spark benchmark in future
>> > to avoid the stigma of vendor reported benchmarks and publish enough
>> > information and code to let others repeat the exercise easily.
>> >
>> >         - Steve
>> >
>> >
>> >
>> > On 10/31/14, 11:30, "Nicholas Chammas" <nicholas.chammas@gmail.com
>> <javascript:_e(%7B%7D,'cvml','nicholas.chammas@gmail.com');>
>> > <javascript:;>> wrote:
>> >
>> > >Thanks for the response, Patrick.
>> > >
>> > >I guess the key takeaways are 1) the tuning/config details are
>> everything
>> > >(they're not laid out here), 2) the benchmark should be reproducible
>> (it's
>> > >not), and 3) reach out to the relevant devs before publishing (didn't
>> > >happen).
>> > >
>> > >Probably key takeaways for any kind of benchmark, really...
>> > >
>> > >Nick
>> > >
>> > >
>> > >2014년 10월 31일 금요일, Patrick Wendell<pwendell@gmail.com
>> <javascript:_e(%7B%7D,'cvml','pwendell@gmail.com');> <javascript:;>>님이
>> > 작성한 메시지:
>> > >
>> > >> Hey Nick,
>> > >>
>> > >> Unfortunately Citus Data didn't contact any of the Spark or Spark SQL
>> > >> developers when running this. It is really easy to make one system
>> > >> look better than others when you are running a benchmark yourself
>> > >> because tuning and sizing can lead to a 10X performance improvement.
>> > >> This benchmark doesn't share the mechanism in a reproducible way.
>> > >>
>> > >> There are a bunch of things that aren't clear here:
>> > >>
>> > >> 1. Spark SQL has optimized parquet features, were these turned on?
>> > >> 2. It doesn't mention computing statistics in Spark SQL, but it does
>> > >> this for Impala and Parquet. Statistics allow Spark SQL to broadcast
>> > >> small tables which can make a 10X difference in TPC-H.
>> > >> 3. For data larger than memory, Spark SQL often performs better if
>> you
>> > >> don't call "cache", did they try this?
>> > >>
>> > >> Basically, a self-reported marketing benchmark like this that
>> > >> *shocker* concludes this vendor's solution is the best, is not
>> > >> particularly useful.
>> > >>
>> > >> If Citus data wants to run a credible benchmark, I'd invite them to
>> > >> directly involve Spark SQL developers in the future. Until then, I
>> > >> wouldn't give much credence to this or any other similar vendor
>> > >> benchmark.
>> > >>
>> > >> - Patrick
>> > >>
>> > >> On Fri, Oct 31, 2014 at 10:38 AM, Nicholas Chammas
>> > >> <nicholas.chammas@gmail.com
>> <javascript:_e(%7B%7D,'cvml','nicholas.chammas@gmail.com');>
>> <javascript:;> <javascript:;>> wrote:
>> > >> > I know we don't want to be jumping at every benchmark someone posts
>> > >>out
>> > >> > there, but this one surprised me:
>> > >> >
>> > >> >
>> http://www.citusdata.com/blog/86-making-postgresql-scale-hadoop-style
>> > >> >
>> > >> > This benchmark has Spark SQL failing to complete several queries in
>> > >>the
>> > >> > TPC-H benchmark. I don't understand much about the details of
>> > >>performing
>> > >> > benchmarks, but this was surprising.
>> > >> >
>> > >> > Are these results expected?
>> > >> >
>> > >> > Related HN discussion here:
>> > >>https://news.ycombinator.com/item?id=8539678
>> > >> >
>> > >> > Nick
>> > >>
>> >
>> >
>> >
>> > --
>> > CONFIDENTIALITY NOTICE
>> > NOTICE: This message is intended for the use of the individual or
>> entity to
>> > which it is addressed and may contain information that is confidential,
>> > privileged and exempt from disclosure under applicable law. If the
>> reader
>> > of this message is not the intended recipient, you are hereby notified
>> that
>> > any printing, copying, dissemination, distribution, disclosure or
>> > forwarding of this communication is strictly prohibited. If you have
>> > received this communication in error, please contact the sender
>> immediately
>> > and delete it from your system. Thank You.
>> >
>>
>
>

Re: Surprising Spark SQL benchmark

Posted by "Arthur.hk.chan@gmail.com" <ar...@gmail.com>.

Hi Key,

Thank you so much for your update!!
Look forward to the shared code from AMPLab.  As a member of the Spark community, I really hope that I could help to run TPC-DS on SparkSQL.  At the moment, I am trying TPC-H 22 queries on SparkSQL 1.1.0 +Hive 0.12, and Hive 0.13.1 respectively (waiting Spark 1.2).

Arthur  

On 1 Nov, 2014, at 3:51 am, Kay Ousterhout <ke...@eecs.berkeley.edu> wrote:

> There's been an effort in the AMPLab at Berkeley to set up a shared
> codebase that makes it easy to run TPC-DS on SparkSQL, since it's something
> we do frequently in the lab to evaluate new research.  Based on this
> thread, it sounds like making this more widely-available is something that
> would be useful to folks for reproducing the results published by
> Databricks / Hortonworks / Cloudera / etc.; we'll share the code on the
> list as soon as we're done.
> 
> -Kay
> 
> On Fri, Oct 31, 2014 at 12:45 PM, Nicholas Chammas <
> nicholas.chammas@gmail.com> wrote:
> 
>> I believe that benchmark has a pending certification on it. See
>> http://sortbenchmark.org under "Process".
>> 
>> It's true they did not share enough details on the blog for readers to
>> reproduce the benchmark, but they will have to share enough with the
>> committee behind the benchmark in order to be certified. Given that this is
>> a benchmark not many people will be able to reproduce due to size and
>> complexity, I don't see it as a big negative that the details are not laid
>> out as long as there is independent certification from a third party.
>> 
>> From what I've seen so far, the best big data benchmark anywhere is this:
>> https://amplab.cs.berkeley.edu/benchmark/
>> 
>> Is has all the details you'd expect, including hosted datasets, to allow
>> anyone to reproduce the full benchmark, covering a number of systems. I
>> look forward to the next update to that benchmark (a lot has changed since
>> Feb). And from what I can tell, it's produced by the same people behind
>> Spark (Patrick being among them).
>> 
>> So I disagree that the Spark community "hasn't been any better" in this
>> regard.
>> 
>> Nick
>> 
>> 
>> 2014년 10월 31일 금요일, Steve Nunez<sn...@hortonworks.com>님이 작성한 메시지:
>> 
>>> To be fair, we (Spark community) haven’t been any better, for example
>> this
>>> benchmark:
>>> 
>>>        https://databricks.com/blog/2014/10/10/spark-petabyte-sort.html
>>> 
>>> 
>>> For which no details or code have been released to allow others to
>>> reproduce it. I would encourage anyone doing a Spark benchmark in future
>>> to avoid the stigma of vendor reported benchmarks and publish enough
>>> information and code to let others repeat the exercise easily.
>>> 
>>>        - Steve
>>> 
>>> 
>>> 
>>> On 10/31/14, 11:30, "Nicholas Chammas" <nicholas.chammas@gmail.com
>>> <javascript:;>> wrote:
>>> 
>>>> Thanks for the response, Patrick.
>>>> 
>>>> I guess the key takeaways are 1) the tuning/config details are
>> everything
>>>> (they're not laid out here), 2) the benchmark should be reproducible
>> (it's
>>>> not), and 3) reach out to the relevant devs before publishing (didn't
>>>> happen).
>>>> 
>>>> Probably key takeaways for any kind of benchmark, really...
>>>> 
>>>> Nick
>>>> 
>>>> 
>>>> 2014년 10월 31일 금요일, Patrick Wendell<pwendell@gmail.com <javascript:;>>님이
>>> 작성한 메시지:
>>>> 
>>>>> Hey Nick,
>>>>> 
>>>>> Unfortunately Citus Data didn't contact any of the Spark or Spark SQL
>>>>> developers when running this. It is really easy to make one system
>>>>> look better than others when you are running a benchmark yourself
>>>>> because tuning and sizing can lead to a 10X performance improvement.
>>>>> This benchmark doesn't share the mechanism in a reproducible way.
>>>>> 
>>>>> There are a bunch of things that aren't clear here:
>>>>> 
>>>>> 1. Spark SQL has optimized parquet features, were these turned on?
>>>>> 2. It doesn't mention computing statistics in Spark SQL, but it does
>>>>> this for Impala and Parquet. Statistics allow Spark SQL to broadcast
>>>>> small tables which can make a 10X difference in TPC-H.
>>>>> 3. For data larger than memory, Spark SQL often performs better if you
>>>>> don't call "cache", did they try this?
>>>>> 
>>>>> Basically, a self-reported marketing benchmark like this that
>>>>> *shocker* concludes this vendor's solution is the best, is not
>>>>> particularly useful.
>>>>> 
>>>>> If Citus data wants to run a credible benchmark, I'd invite them to
>>>>> directly involve Spark SQL developers in the future. Until then, I
>>>>> wouldn't give much credence to this or any other similar vendor
>>>>> benchmark.
>>>>> 
>>>>> - Patrick
>>>>> 
>>>>> On Fri, Oct 31, 2014 at 10:38 AM, Nicholas Chammas
>>>>> <nicholas.chammas@gmail.com <javascript:;> <javascript:;>> wrote:
>>>>>> I know we don't want to be jumping at every benchmark someone posts
>>>>> out
>>>>>> there, but this one surprised me:
>>>>>> 
>>>>>> 
>> http://www.citusdata.com/blog/86-making-postgresql-scale-hadoop-style
>>>>>> 
>>>>>> This benchmark has Spark SQL failing to complete several queries in
>>>>> the
>>>>>> TPC-H benchmark. I don't understand much about the details of
>>>>> performing
>>>>>> benchmarks, but this was surprising.
>>>>>> 
>>>>>> Are these results expected?
>>>>>> 
>>>>>> Related HN discussion here:
>>>>> https://news.ycombinator.com/item?id=8539678
>>>>>> 
>>>>>> Nick
>>>>> 
>>> 
>>> 
>>> 
>>> --
>>> CONFIDENTIALITY NOTICE
>>> NOTICE: This message is intended for the use of the individual or entity
>> to
>>> which it is addressed and may contain information that is confidential,
>>> privileged and exempt from disclosure under applicable law. If the reader
>>> of this message is not the intended recipient, you are hereby notified
>> that
>>> any printing, copying, dissemination, distribution, disclosure or
>>> forwarding of this communication is strictly prohibited. If you have
>>> received this communication in error, please contact the sender
>> immediately
>>> and delete it from your system. Thank You.
>>> 
>> 


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org

Re: Surprising Spark SQL benchmark

Posted by Kay Ousterhout <ke...@eecs.berkeley.edu>.

There's been an effort in the AMPLab at Berkeley to set up a shared
codebase that makes it easy to run TPC-DS on SparkSQL, since it's something
we do frequently in the lab to evaluate new research.  Based on this
thread, it sounds like making this more widely-available is something that
would be useful to folks for reproducing the results published by
Databricks / Hortonworks / Cloudera / etc.; we'll share the code on the
list as soon as we're done.

-Kay

On Fri, Oct 31, 2014 at 12:45 PM, Nicholas Chammas <
nicholas.chammas@gmail.com> wrote:

> I believe that benchmark has a pending certification on it. See
> http://sortbenchmark.org under "Process".
>
> It's true they did not share enough details on the blog for readers to
> reproduce the benchmark, but they will have to share enough with the
> committee behind the benchmark in order to be certified. Given that this is
> a benchmark not many people will be able to reproduce due to size and
> complexity, I don't see it as a big negative that the details are not laid
> out as long as there is independent certification from a third party.
>
> From what I've seen so far, the best big data benchmark anywhere is this:
> https://amplab.cs.berkeley.edu/benchmark/
>
> Is has all the details you'd expect, including hosted datasets, to allow
> anyone to reproduce the full benchmark, covering a number of systems. I
> look forward to the next update to that benchmark (a lot has changed since
> Feb). And from what I can tell, it's produced by the same people behind
> Spark (Patrick being among them).
>
> So I disagree that the Spark community "hasn't been any better" in this
> regard.
>
> Nick
>
>
> 2014년 10월 31일 금요일, Steve Nunez<sn...@hortonworks.com>님이 작성한 메시지:
>
> > To be fair, we (Spark community) haven’t been any better, for example
> this
> > benchmark:
> >
> >         https://databricks.com/blog/2014/10/10/spark-petabyte-sort.html
> >
> >
> > For which no details or code have been released to allow others to
> > reproduce it. I would encourage anyone doing a Spark benchmark in future
> > to avoid the stigma of vendor reported benchmarks and publish enough
> > information and code to let others repeat the exercise easily.
> >
> >         - Steve
> >
> >
> >
> > On 10/31/14, 11:30, "Nicholas Chammas" <nicholas.chammas@gmail.com
> > <javascript:;>> wrote:
> >
> > >Thanks for the response, Patrick.
> > >
> > >I guess the key takeaways are 1) the tuning/config details are
> everything
> > >(they're not laid out here), 2) the benchmark should be reproducible
> (it's
> > >not), and 3) reach out to the relevant devs before publishing (didn't
> > >happen).
> > >
> > >Probably key takeaways for any kind of benchmark, really...
> > >
> > >Nick
> > >
> > >
> > >2014년 10월 31일 금요일, Patrick Wendell<pwendell@gmail.com <javascript:;>>님이
> > 작성한 메시지:
> > >
> > >> Hey Nick,
> > >>
> > >> Unfortunately Citus Data didn't contact any of the Spark or Spark SQL
> > >> developers when running this. It is really easy to make one system
> > >> look better than others when you are running a benchmark yourself
> > >> because tuning and sizing can lead to a 10X performance improvement.
> > >> This benchmark doesn't share the mechanism in a reproducible way.
> > >>
> > >> There are a bunch of things that aren't clear here:
> > >>
> > >> 1. Spark SQL has optimized parquet features, were these turned on?
> > >> 2. It doesn't mention computing statistics in Spark SQL, but it does
> > >> this for Impala and Parquet. Statistics allow Spark SQL to broadcast
> > >> small tables which can make a 10X difference in TPC-H.
> > >> 3. For data larger than memory, Spark SQL often performs better if you
> > >> don't call "cache", did they try this?
> > >>
> > >> Basically, a self-reported marketing benchmark like this that
> > >> *shocker* concludes this vendor's solution is the best, is not
> > >> particularly useful.
> > >>
> > >> If Citus data wants to run a credible benchmark, I'd invite them to
> > >> directly involve Spark SQL developers in the future. Until then, I
> > >> wouldn't give much credence to this or any other similar vendor
> > >> benchmark.
> > >>
> > >> - Patrick
> > >>
> > >> On Fri, Oct 31, 2014 at 10:38 AM, Nicholas Chammas
> > >> <nicholas.chammas@gmail.com <javascript:;> <javascript:;>> wrote:
> > >> > I know we don't want to be jumping at every benchmark someone posts
> > >>out
> > >> > there, but this one surprised me:
> > >> >
> > >> >
> http://www.citusdata.com/blog/86-making-postgresql-scale-hadoop-style
> > >> >
> > >> > This benchmark has Spark SQL failing to complete several queries in
> > >>the
> > >> > TPC-H benchmark. I don't understand much about the details of
> > >>performing
> > >> > benchmarks, but this was surprising.
> > >> >
> > >> > Are these results expected?
> > >> >
> > >> > Related HN discussion here:
> > >>https://news.ycombinator.com/item?id=8539678
> > >> >
> > >> > Nick
> > >>
> >
> >
> >
> > --
> > CONFIDENTIALITY NOTICE
> > NOTICE: This message is intended for the use of the individual or entity
> to
> > which it is addressed and may contain information that is confidential,
> > privileged and exempt from disclosure under applicable law. If the reader
> > of this message is not the intended recipient, you are hereby notified
> that
> > any printing, copying, dissemination, distribution, disclosure or
> > forwarding of this communication is strictly prohibited. If you have
> > received this communication in error, please contact the sender
> immediately
> > and delete it from your system. Thank You.
> >
>

Re: Surprising Spark SQL benchmark

Posted by Nicholas Chammas <ni...@gmail.com>.

Steve,

Your original comment was about the *reproducibility* of the benchmark,
which I was responding to. No one is suggesting you doubt the authenticity
or results of the benchmark.

For which no details or code have been released to allow others to
> reproduce it. I would encourage anyone doing a Spark benchmark in future

to avoid the stigma of vendor reported benchmarks and publish enough

information and code to let others repeat the exercise easily.


So to reiterate, the results and paper that Databricks published should let
other people reproduce their submission to the Daytona Gray benchmark. This
addresses your original concern quoted above.

Nick


On Wed, Nov 5, 2014 at 7:10 PM, Reynold Xin <rx...@databricks.com> wrote:

> Steve,
>
> I wouldn't say Hadoop MR is a 2001 Toyota Celica :) In either case, I
> updated the blog post to actually include CPU / disk / network measures.
> You should see that in any measure that matters to this benchmark, the old
> 2100 node cluster is vastly superior. The data even fit in memory!
>
>
>
> On Wed, Nov 5, 2014 at 4:07 PM, Steve Nunez <sn...@hortonworks.com>
> wrote:
>
>> Nicholas,
>>
>> I never doubted the authenticity of the benchmark, nor the results. What I
>> think could be better is an objective analysis of the results. That post
>> neglected to point out the significant differences in hardware those two
>> benchmarks were run on. It is bit like bragging you broke the world record
>> at the Nürburgring in a 2014 1000hp LaFerrari and somehow forgetting to
>> mention that the last record was held by a 2001 Toyota Celica.
>>
>> - Steve
>>
>>
>> From:  Nicholas Chammas <ni...@gmail.com>
>> Date:  Wednesday, November 5, 2014 at 15:56
>> To:  Steve Nunez <sn...@hortonworks.com>
>> Cc:  Patrick Wendell <pw...@gmail.com>, dev <de...@spark.apache.org>
>> Subject:  Re: Surprising Spark SQL benchmark
>>
>> > Steve Nunez, I believe the information behind the links below should
>> address
>> > your concerns earlier about Databricks's submission to the Daytona Gray
>> > benchmark.
>> >
>> > On Wed, Nov 5, 2014 at 6:43 PM, Nicholas Chammas <
>> nicholas.chammas@gmail.com>
>> > wrote:
>> >> On Fri, Oct 31, 2014 at 3:45 PM, Nicholas Chammas
>> >> <ni...@gmail.com> wrote:
>> >>
>> >>> I believe that benchmark has a pending certification on it. See
>> >>> http://sortbenchmark.org under "Process".
>> >> Regarding this comment, Reynold has just announced that this benchmark
>> is now
>> >> certified.
>> >> * Announcement:
>> >>
>> http://databricks.com/blog/2014/11/05/spark-officially-sets-a-new-record-in-l
>> >> arge-scale-sorting.html
>> >> * Updated benchmark results page: http://sortbenchmark.org/
>> >> * Paper detailing Spark cluster configuration for the benchmark:
>> >> http://sortbenchmark.org/ApacheSpark2014.pdf
>> >> Nick
>> >>
>> >> 
>> >
>>
>>
>>
>> --
>> CONFIDENTIALITY NOTICE
>> NOTICE: This message is intended for the use of the individual or entity
>> to
>> which it is addressed and may contain information that is confidential,
>> privileged and exempt from disclosure under applicable law. If the reader
>> of this message is not the intended recipient, you are hereby notified
>> that
>> any printing, copying, dissemination, distribution, disclosure or
>> forwarding of this communication is strictly prohibited. If you have
>> received this communication in error, please contact the sender
>> immediately
>> and delete it from your system. Thank You.
>>
>
>

Re: Surprising Spark SQL benchmark

Posted by Matei Zaharia <ma...@gmail.com>.

Yup, the Hadoop nodes were from 2013, each with 64 GB RAM, 12 cores, 10 Gbps Ethernet and 12 disks. For 100 TB of data, the intermediate data could fit in memory on this cluster, which can make shuffle much faster than with intermediate data on SSDs. You can find the specs in http://sortbenchmark.org/Yahoo2013Sort.pdf. It just takes effort to utilize modern machines fully -- for instance the Yahoo! cluster had 1 TB/s network bandwidth, but only sorted data at 0.02 TB/s. Systems optimized for sorting, like TritonSort (which also won this year's benchmark), get much closer to full utilization.

Matei

> On Nov 5, 2014, at 4:10 PM, Reynold Xin <rx...@databricks.com> wrote:
> 
> Steve,
> 
> I wouldn't say Hadoop MR is a 2001 Toyota Celica :) In either case, I
> updated the blog post to actually include CPU / disk / network measures.
> You should see that in any measure that matters to this benchmark, the old
> 2100 node cluster is vastly superior. The data even fit in memory!
> 
> 
> 
> On Wed, Nov 5, 2014 at 4:07 PM, Steve Nunez <sn...@hortonworks.com> wrote:
> 
>> Nicholas,
>> 
>> I never doubted the authenticity of the benchmark, nor the results. What I
>> think could be better is an objective analysis of the results. That post
>> neglected to point out the significant differences in hardware those two
>> benchmarks were run on. It is bit like bragging you broke the world record
>> at the Nürburgring in a 2014 1000hp LaFerrari and somehow forgetting to
>> mention that the last record was held by a 2001 Toyota Celica.
>> 
>> - Steve
>> 
>> 
>> From:  Nicholas Chammas <ni...@gmail.com>
>> Date:  Wednesday, November 5, 2014 at 15:56
>> To:  Steve Nunez <sn...@hortonworks.com>
>> Cc:  Patrick Wendell <pw...@gmail.com>, dev <de...@spark.apache.org>
>> Subject:  Re: Surprising Spark SQL benchmark
>> 
>>> Steve Nunez, I believe the information behind the links below should
>> address
>>> your concerns earlier about Databricks's submission to the Daytona Gray
>>> benchmark.
>>> 
>>> On Wed, Nov 5, 2014 at 6:43 PM, Nicholas Chammas <
>> nicholas.chammas@gmail.com>
>>> wrote:
>>>> On Fri, Oct 31, 2014 at 3:45 PM, Nicholas Chammas
>>>> <ni...@gmail.com> wrote:
>>>> 
>>>>> I believe that benchmark has a pending certification on it. See
>>>>> http://sortbenchmark.org under "Process".
>>>> Regarding this comment, Reynold has just announced that this benchmark
>> is now
>>>> certified.
>>>> * Announcement:
>>>> 
>> http://databricks.com/blog/2014/11/05/spark-officially-sets-a-new-record-in-l
>>>> arge-scale-sorting.html
>>>> * Updated benchmark results page: http://sortbenchmark.org/
>>>> * Paper detailing Spark cluster configuration for the benchmark:
>>>> http://sortbenchmark.org/ApacheSpark2014.pdf
>>>> Nick
>>>> 
>>>> 
>>> 
>> 
>> 
>> 
>> --
>> CONFIDENTIALITY NOTICE
>> NOTICE: This message is intended for the use of the individual or entity to
>> which it is addressed and may contain information that is confidential,
>> privileged and exempt from disclosure under applicable law. If the reader
>> of this message is not the intended recipient, you are hereby notified that
>> any printing, copying, dissemination, distribution, disclosure or
>> forwarding of this communication is strictly prohibited. If you have
>> received this communication in error, please contact the sender immediately
>> and delete it from your system. Thank You.
>> 


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org

Re: Surprising Spark SQL benchmark

Posted by Reynold Xin <rx...@databricks.com>.

Steve,

I wouldn't say Hadoop MR is a 2001 Toyota Celica :) In either case, I
updated the blog post to actually include CPU / disk / network measures.
You should see that in any measure that matters to this benchmark, the old
2100 node cluster is vastly superior. The data even fit in memory!



On Wed, Nov 5, 2014 at 4:07 PM, Steve Nunez <sn...@hortonworks.com> wrote:

> Nicholas,
>
> I never doubted the authenticity of the benchmark, nor the results. What I
> think could be better is an objective analysis of the results. That post
> neglected to point out the significant differences in hardware those two
> benchmarks were run on. It is bit like bragging you broke the world record
> at the Nürburgring in a 2014 1000hp LaFerrari and somehow forgetting to
> mention that the last record was held by a 2001 Toyota Celica.
>
> - Steve
>
>
> From:  Nicholas Chammas <ni...@gmail.com>
> Date:  Wednesday, November 5, 2014 at 15:56
> To:  Steve Nunez <sn...@hortonworks.com>
> Cc:  Patrick Wendell <pw...@gmail.com>, dev <de...@spark.apache.org>
> Subject:  Re: Surprising Spark SQL benchmark
>
> > Steve Nunez, I believe the information behind the links below should
> address
> > your concerns earlier about Databricks's submission to the Daytona Gray
> > benchmark.
> >
> > On Wed, Nov 5, 2014 at 6:43 PM, Nicholas Chammas <
> nicholas.chammas@gmail.com>
> > wrote:
> >> On Fri, Oct 31, 2014 at 3:45 PM, Nicholas Chammas
> >> <ni...@gmail.com> wrote:
> >>
> >>> I believe that benchmark has a pending certification on it. See
> >>> http://sortbenchmark.org under "Process".
> >> Regarding this comment, Reynold has just announced that this benchmark
> is now
> >> certified.
> >> * Announcement:
> >>
> http://databricks.com/blog/2014/11/05/spark-officially-sets-a-new-record-in-l
> >> arge-scale-sorting.html
> >> * Updated benchmark results page: http://sortbenchmark.org/
> >> * Paper detailing Spark cluster configuration for the benchmark:
> >> http://sortbenchmark.org/ApacheSpark2014.pdf
> >> Nick
> >>
> >> 
> >
>
>
>
> --
> CONFIDENTIALITY NOTICE
> NOTICE: This message is intended for the use of the individual or entity to
> which it is addressed and may contain information that is confidential,
> privileged and exempt from disclosure under applicable law. If the reader
> of this message is not the intended recipient, you are hereby notified that
> any printing, copying, dissemination, distribution, disclosure or
> forwarding of this communication is strictly prohibited. If you have
> received this communication in error, please contact the sender immediately
> and delete it from your system. Thank You.
>

Re: Surprising Spark SQL benchmark

Posted by Steve Nunez <sn...@hortonworks.com>.

Nicholas,

I never doubted the authenticity of the benchmark, nor the results. What I
think could be better is an objective analysis of the results. That post
neglected to point out the significant differences in hardware those two
benchmarks were run on. It is bit like bragging you broke the world record
at the Nürburgring in a 2014 1000hp LaFerrari and somehow forgetting to
mention that the last record was held by a 2001 Toyota Celica.

- Steve


From:  Nicholas Chammas <ni...@gmail.com>
Date:  Wednesday, November 5, 2014 at 15:56
To:  Steve Nunez <sn...@hortonworks.com>
Cc:  Patrick Wendell <pw...@gmail.com>, dev <de...@spark.apache.org>
Subject:  Re: Surprising Spark SQL benchmark

> Steve Nunez, I believe the information behind the links below should address
> your concerns earlier about Databricks's submission to the Daytona Gray
> benchmark.
> 
> On Wed, Nov 5, 2014 at 6:43 PM, Nicholas Chammas <ni...@gmail.com>
> wrote:
>> On Fri, Oct 31, 2014 at 3:45 PM, Nicholas Chammas
>> <ni...@gmail.com> wrote:
>> 
>>> I believe that benchmark has a pending certification on it. See
>>> http://sortbenchmark.org under "Process".
>> Regarding this comment, Reynold has just announced that this benchmark is now
>> certified.
>> * Announcement: 
>> http://databricks.com/blog/2014/11/05/spark-officially-sets-a-new-record-in-l
>> arge-scale-sorting.html
>> * Updated benchmark results page: http://sortbenchmark.org/
>> * Paper detailing Spark cluster configuration for the benchmark:
>> http://sortbenchmark.org/ApacheSpark2014.pdf
>> Nick
>> 
>> 
> 



-- 
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to 
which it is addressed and may contain information that is confidential, 
privileged and exempt from disclosure under applicable law. If the reader 
of this message is not the intended recipient, you are hereby notified that 
any printing, copying, dissemination, distribution, disclosure or 
forwarding of this communication is strictly prohibited. If you have 
received this communication in error, please contact the sender immediately 
and delete it from your system. Thank You.

Re: Surprising Spark SQL benchmark

Posted by Nicholas Chammas <ni...@gmail.com>.

Steve Nunez, I believe the information behind the links below should
address your concerns earlier about Databricks's submission to the Daytona
Gray benchmark.

On Wed, Nov 5, 2014 at 6:43 PM, Nicholas Chammas <nicholas.chammas@gmail.com
> wrote:

> On Fri, Oct 31, 2014 at 3:45 PM, Nicholas Chammas <
> nicholas.chammas@gmail.com> wrote:
>
> I believe that benchmark has a pending certification on it. See
>> http://sortbenchmark.org under "Process".
>>
> Regarding this comment, Reynold has just announced that this benchmark is
> now certified.
>
>    - Announcement:
>    http://databricks.com/blog/2014/11/05/spark-officially-sets-a-new-record-in-large-scale-sorting.html
>    - Updated benchmark results page: http://sortbenchmark.org/
>    - Paper detailing Spark cluster configuration for the benchmark:
>    http://sortbenchmark.org/ApacheSpark2014.pdf
>
> Nick
> 
>

Re: Surprising Spark SQL benchmark

Posted by Nicholas Chammas <ni...@gmail.com>.

On Fri, Oct 31, 2014 at 3:45 PM, Nicholas Chammas <
nicholas.chammas@gmail.com> wrote:

I believe that benchmark has a pending certification on it. See
> http://sortbenchmark.org under "Process".
>
Regarding this comment, Reynold has just announced that this benchmark is
now certified.

   - Announcement:
   http://databricks.com/blog/2014/11/05/spark-officially-sets-a-new-record-in-large-scale-sorting.html
   - Updated benchmark results page: http://sortbenchmark.org/
   - Paper detailing Spark cluster configuration for the benchmark:
   http://sortbenchmark.org/ApacheSpark2014.pdf

Nick

Re: Surprising Spark SQL benchmark

Posted by Nicholas Chammas <ni...@gmail.com>.

I believe that benchmark has a pending certification on it. See
http://sortbenchmark.org under "Process".

It's true they did not share enough details on the blog for readers to
reproduce the benchmark, but they will have to share enough with the
committee behind the benchmark in order to be certified. Given that this is
a benchmark not many people will be able to reproduce due to size and
complexity, I don't see it as a big negative that the details are not laid
out as long as there is independent certification from a third party.

>From what I've seen so far, the best big data benchmark anywhere is this:
https://amplab.cs.berkeley.edu/benchmark/

Is has all the details you'd expect, including hosted datasets, to allow
anyone to reproduce the full benchmark, covering a number of systems. I
look forward to the next update to that benchmark (a lot has changed since
Feb). And from what I can tell, it's produced by the same people behind
Spark (Patrick being among them).

So I disagree that the Spark community "hasn't been any better" in this
regard.

Nick


2014년 10월 31일 금요일, Steve Nunez<sn...@hortonworks.com>님이 작성한 메시지:

> To be fair, we (Spark community) haven’t been any better, for example this
> benchmark:
>
>         https://databricks.com/blog/2014/10/10/spark-petabyte-sort.html
>
>
> For which no details or code have been released to allow others to
> reproduce it. I would encourage anyone doing a Spark benchmark in future
> to avoid the stigma of vendor reported benchmarks and publish enough
> information and code to let others repeat the exercise easily.
>
>         - Steve
>
>
>
> On 10/31/14, 11:30, "Nicholas Chammas" <nicholas.chammas@gmail.com
> <javascript:;>> wrote:
>
> >Thanks for the response, Patrick.
> >
> >I guess the key takeaways are 1) the tuning/config details are everything
> >(they're not laid out here), 2) the benchmark should be reproducible (it's
> >not), and 3) reach out to the relevant devs before publishing (didn't
> >happen).
> >
> >Probably key takeaways for any kind of benchmark, really...
> >
> >Nick
> >
> >
> >2014년 10월 31일 금요일, Patrick Wendell<pwendell@gmail.com <javascript:;>>님이
> 작성한 메시지:
> >
> >> Hey Nick,
> >>
> >> Unfortunately Citus Data didn't contact any of the Spark or Spark SQL
> >> developers when running this. It is really easy to make one system
> >> look better than others when you are running a benchmark yourself
> >> because tuning and sizing can lead to a 10X performance improvement.
> >> This benchmark doesn't share the mechanism in a reproducible way.
> >>
> >> There are a bunch of things that aren't clear here:
> >>
> >> 1. Spark SQL has optimized parquet features, were these turned on?
> >> 2. It doesn't mention computing statistics in Spark SQL, but it does
> >> this for Impala and Parquet. Statistics allow Spark SQL to broadcast
> >> small tables which can make a 10X difference in TPC-H.
> >> 3. For data larger than memory, Spark SQL often performs better if you
> >> don't call "cache", did they try this?
> >>
> >> Basically, a self-reported marketing benchmark like this that
> >> *shocker* concludes this vendor's solution is the best, is not
> >> particularly useful.
> >>
> >> If Citus data wants to run a credible benchmark, I'd invite them to
> >> directly involve Spark SQL developers in the future. Until then, I
> >> wouldn't give much credence to this or any other similar vendor
> >> benchmark.
> >>
> >> - Patrick
> >>
> >> On Fri, Oct 31, 2014 at 10:38 AM, Nicholas Chammas
> >> <nicholas.chammas@gmail.com <javascript:;> <javascript:;>> wrote:
> >> > I know we don't want to be jumping at every benchmark someone posts
> >>out
> >> > there, but this one surprised me:
> >> >
> >> > http://www.citusdata.com/blog/86-making-postgresql-scale-hadoop-style
> >> >
> >> > This benchmark has Spark SQL failing to complete several queries in
> >>the
> >> > TPC-H benchmark. I don't understand much about the details of
> >>performing
> >> > benchmarks, but this was surprising.
> >> >
> >> > Are these results expected?
> >> >
> >> > Related HN discussion here:
> >>https://news.ycombinator.com/item?id=8539678
> >> >
> >> > Nick
> >>
>
>
>
> --
> CONFIDENTIALITY NOTICE
> NOTICE: This message is intended for the use of the individual or entity to
> which it is addressed and may contain information that is confidential,
> privileged and exempt from disclosure under applicable law. If the reader
> of this message is not the intended recipient, you are hereby notified that
> any printing, copying, dissemination, distribution, disclosure or
> forwarding of this communication is strictly prohibited. If you have
> received this communication in error, please contact the sender immediately
> and delete it from your system. Thank You.
>

Re: Surprising Spark SQL benchmark

Posted by Steve Nunez <sn...@hortonworks.com>.

To be fair, we (Spark community) haven’t been any better, for example this
benchmark:

	https://databricks.com/blog/2014/10/10/spark-petabyte-sort.html


For which no details or code have been released to allow others to
reproduce it. I would encourage anyone doing a Spark benchmark in future
to avoid the stigma of vendor reported benchmarks and publish enough
information and code to let others repeat the exercise easily.

	- Steve



On 10/31/14, 11:30, "Nicholas Chammas" <ni...@gmail.com> wrote:

>Thanks for the response, Patrick.
>
>I guess the key takeaways are 1) the tuning/config details are everything
>(they're not laid out here), 2) the benchmark should be reproducible (it's
>not), and 3) reach out to the relevant devs before publishing (didn't
>happen).
>
>Probably key takeaways for any kind of benchmark, really...
>
>Nick
>
>
>2014년 10월 31일 금요일, Patrick Wendell<pw...@gmail.com>님이 작성한 메시지:
>
>> Hey Nick,
>>
>> Unfortunately Citus Data didn't contact any of the Spark or Spark SQL
>> developers when running this. It is really easy to make one system
>> look better than others when you are running a benchmark yourself
>> because tuning and sizing can lead to a 10X performance improvement.
>> This benchmark doesn't share the mechanism in a reproducible way.
>>
>> There are a bunch of things that aren't clear here:
>>
>> 1. Spark SQL has optimized parquet features, were these turned on?
>> 2. It doesn't mention computing statistics in Spark SQL, but it does
>> this for Impala and Parquet. Statistics allow Spark SQL to broadcast
>> small tables which can make a 10X difference in TPC-H.
>> 3. For data larger than memory, Spark SQL often performs better if you
>> don't call "cache", did they try this?
>>
>> Basically, a self-reported marketing benchmark like this that
>> *shocker* concludes this vendor's solution is the best, is not
>> particularly useful.
>>
>> If Citus data wants to run a credible benchmark, I'd invite them to
>> directly involve Spark SQL developers in the future. Until then, I
>> wouldn't give much credence to this or any other similar vendor
>> benchmark.
>>
>> - Patrick
>>
>> On Fri, Oct 31, 2014 at 10:38 AM, Nicholas Chammas
>> <nicholas.chammas@gmail.com <javascript:;>> wrote:
>> > I know we don't want to be jumping at every benchmark someone posts
>>out
>> > there, but this one surprised me:
>> >
>> > http://www.citusdata.com/blog/86-making-postgresql-scale-hadoop-style
>> >
>> > This benchmark has Spark SQL failing to complete several queries in
>>the
>> > TPC-H benchmark. I don't understand much about the details of
>>performing
>> > benchmarks, but this was surprising.
>> >
>> > Are these results expected?
>> >
>> > Related HN discussion here:
>>https://news.ycombinator.com/item?id=8539678
>> >
>> > Nick
>>



-- 
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to 
which it is addressed and may contain information that is confidential, 
privileged and exempt from disclosure under applicable law. If the reader 
of this message is not the intended recipient, you are hereby notified that 
any printing, copying, dissemination, distribution, disclosure or 
forwarding of this communication is strictly prohibited. If you have 
received this communication in error, please contact the sender immediately 
and delete it from your system. Thank You.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org

Re: Surprising Spark SQL benchmark

Posted by Nicholas Chammas <ni...@gmail.com>.

Thanks for the response, Patrick.

I guess the key takeaways are 1) the tuning/config details are everything
(they're not laid out here), 2) the benchmark should be reproducible (it's
not), and 3) reach out to the relevant devs before publishing (didn't
happen).

Probably key takeaways for any kind of benchmark, really...

Nick


2014년 10월 31일 금요일, Patrick Wendell<pw...@gmail.com>님이 작성한 메시지:

> Hey Nick,
>
> Unfortunately Citus Data didn't contact any of the Spark or Spark SQL
> developers when running this. It is really easy to make one system
> look better than others when you are running a benchmark yourself
> because tuning and sizing can lead to a 10X performance improvement.
> This benchmark doesn't share the mechanism in a reproducible way.
>
> There are a bunch of things that aren't clear here:
>
> 1. Spark SQL has optimized parquet features, were these turned on?
> 2. It doesn't mention computing statistics in Spark SQL, but it does
> this for Impala and Parquet. Statistics allow Spark SQL to broadcast
> small tables which can make a 10X difference in TPC-H.
> 3. For data larger than memory, Spark SQL often performs better if you
> don't call "cache", did they try this?
>
> Basically, a self-reported marketing benchmark like this that
> *shocker* concludes this vendor's solution is the best, is not
> particularly useful.
>
> If Citus data wants to run a credible benchmark, I'd invite them to
> directly involve Spark SQL developers in the future. Until then, I
> wouldn't give much credence to this or any other similar vendor
> benchmark.
>
> - Patrick
>
> On Fri, Oct 31, 2014 at 10:38 AM, Nicholas Chammas
> <nicholas.chammas@gmail.com <javascript:;>> wrote:
> > I know we don't want to be jumping at every benchmark someone posts out
> > there, but this one surprised me:
> >
> > http://www.citusdata.com/blog/86-making-postgresql-scale-hadoop-style
> >
> > This benchmark has Spark SQL failing to complete several queries in the
> > TPC-H benchmark. I don't understand much about the details of performing
> > benchmarks, but this was surprising.
> >
> > Are these results expected?
> >
> > Related HN discussion here: https://news.ycombinator.com/item?id=8539678
> >
> > Nick
>

Re: Surprising Spark SQL benchmark

Posted by Patrick Wendell <pw...@gmail.com>.

Hey Nick,

Unfortunately Citus Data didn't contact any of the Spark or Spark SQL
developers when running this. It is really easy to make one system
look better than others when you are running a benchmark yourself
because tuning and sizing can lead to a 10X performance improvement.
This benchmark doesn't share the mechanism in a reproducible way.

There are a bunch of things that aren't clear here:

1. Spark SQL has optimized parquet features, were these turned on?
2. It doesn't mention computing statistics in Spark SQL, but it does
this for Impala and Parquet. Statistics allow Spark SQL to broadcast
small tables which can make a 10X difference in TPC-H.
3. For data larger than memory, Spark SQL often performs better if you
don't call "cache", did they try this?

Basically, a self-reported marketing benchmark like this that
*shocker* concludes this vendor's solution is the best, is not
particularly useful.

If Citus data wants to run a credible benchmark, I'd invite them to
directly involve Spark SQL developers in the future. Until then, I
wouldn't give much credence to this or any other similar vendor
benchmark.

- Patrick

On Fri, Oct 31, 2014 at 10:38 AM, Nicholas Chammas
<ni...@gmail.com> wrote:
> I know we don't want to be jumping at every benchmark someone posts out
> there, but this one surprised me:
>
> http://www.citusdata.com/blog/86-making-postgresql-scale-hadoop-style
>
> This benchmark has Spark SQL failing to complete several queries in the
> TPC-H benchmark. I don't understand much about the details of performing
> benchmarks, but this was surprising.
>
> Are these results expected?
>
> Related HN discussion here: https://news.ycombinator.com/item?id=8539678
>
> Nick

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org