You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Reynold Xin <rx...@databricks.com> on 2014/11/06 00:11:42 UTC

Re: Breaking the previous large-scale sort record with Spark

Hi all,

We are excited to announce that the benchmark entry has been reviewed by
the Sort Benchmark committee and Spark has officially won the Daytona
GraySort contest in sorting 100TB of data.

Our entry tied with a UCSD research team building high performance systems
and we jointly set a new world record. This is an important milestone for
the project, as it validates the amount of engineering work put into Spark
by the community.

As Matei said, "For an engine to scale from these multi-hour petabyte batch
jobs down to 100-millisecond streaming and interactive queries is quite
uncommon, and it's thanks to all of you folks that we are able to make this
happen."

Updated blog post:
http://databricks.com/blog/2014/11/05/spark-officially-sets-a-new-record-in-large-scale-sorting.html




On Fri, Oct 10, 2014 at 7:54 AM, Matei Zaharia <ma...@gmail.com>
wrote:

> Hi folks,
>
> I interrupt your regularly scheduled user / dev list to bring you some
> pretty cool news for the project, which is that we've been able to use
> Spark to break MapReduce's 100 TB and 1 PB sort records, sorting data 3x
> faster on 10x fewer nodes. There's a detailed writeup at
> http://databricks.com/blog/2014/10/10/spark-breaks-previous-large-scale-sort-record.html.
> Summary: while Hadoop MapReduce held last year's 100 TB world record by
> sorting 100 TB in 72 minutes on 2100 nodes, we sorted it in 23 minutes on
> 206 nodes; and we also scaled up to sort 1 PB in 234 minutes.
>
> I want to thank Reynold Xin for leading this effort over the past few
> weeks, along with Parviz Deyhim, Xiangrui Meng, Aaron Davidson and Ali
> Ghodsi. In addition, we'd really like to thank Amazon's EC2 team for
> providing the machines to make this possible. Finally, this result would of
> course not be possible without the many many other contributions, testing
> and feature requests from throughout the community.
>
> For an engine to scale from these multi-hour petabyte batch jobs down to
> 100-millisecond streaming and interactive queries is quite uncommon, and
> it's thanks to all of you folks that we are able to make this happen.
>
> Matei
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> For additional commands, e-mail: user-help@spark.apache.org
>
>

Re: Breaking the previous large-scale sort record with Spark

Posted by Matei Zaharia <ma...@gmail.com>.
Congrats to everyone who helped make this happen. And if anyone has even more machines they'd like us to run on next year, let us know :).

Matei

> On Nov 5, 2014, at 3:11 PM, Reynold Xin <rx...@databricks.com> wrote:
> 
> Hi all,
> 
> We are excited to announce that the benchmark entry has been reviewed by
> the Sort Benchmark committee and Spark has officially won the Daytona
> GraySort contest in sorting 100TB of data.
> 
> Our entry tied with a UCSD research team building high performance systems
> and we jointly set a new world record. This is an important milestone for
> the project, as it validates the amount of engineering work put into Spark
> by the community.
> 
> As Matei said, "For an engine to scale from these multi-hour petabyte batch
> jobs down to 100-millisecond streaming and interactive queries is quite
> uncommon, and it's thanks to all of you folks that we are able to make this
> happen."
> 
> Updated blog post:
> http://databricks.com/blog/2014/11/05/spark-officially-sets-a-new-record-in-large-scale-sorting.html
> 
> 
> 
> 
> On Fri, Oct 10, 2014 at 7:54 AM, Matei Zaharia <ma...@gmail.com>
> wrote:
> 
>> Hi folks,
>> 
>> I interrupt your regularly scheduled user / dev list to bring you some
>> pretty cool news for the project, which is that we've been able to use
>> Spark to break MapReduce's 100 TB and 1 PB sort records, sorting data 3x
>> faster on 10x fewer nodes. There's a detailed writeup at
>> http://databricks.com/blog/2014/10/10/spark-breaks-previous-large-scale-sort-record.html.
>> Summary: while Hadoop MapReduce held last year's 100 TB world record by
>> sorting 100 TB in 72 minutes on 2100 nodes, we sorted it in 23 minutes on
>> 206 nodes; and we also scaled up to sort 1 PB in 234 minutes.
>> 
>> I want to thank Reynold Xin for leading this effort over the past few
>> weeks, along with Parviz Deyhim, Xiangrui Meng, Aaron Davidson and Ali
>> Ghodsi. In addition, we'd really like to thank Amazon's EC2 team for
>> providing the machines to make this possible. Finally, this result would of
>> course not be possible without the many many other contributions, testing
>> and feature requests from throughout the community.
>> 
>> For an engine to scale from these multi-hour petabyte batch jobs down to
>> 100-millisecond streaming and interactive queries is quite uncommon, and
>> it's thanks to all of you folks that we are able to make this happen.
>> 
>> Matei
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>> For additional commands, e-mail: user-help@spark.apache.org
>> 
>> 


---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org


Re: Breaking the previous large-scale sort record with Spark

Posted by Matei Zaharia <ma...@gmail.com>.
Congrats to everyone who helped make this happen. And if anyone has even more machines they'd like us to run on next year, let us know :).

Matei

> On Nov 5, 2014, at 3:11 PM, Reynold Xin <rx...@databricks.com> wrote:
> 
> Hi all,
> 
> We are excited to announce that the benchmark entry has been reviewed by
> the Sort Benchmark committee and Spark has officially won the Daytona
> GraySort contest in sorting 100TB of data.
> 
> Our entry tied with a UCSD research team building high performance systems
> and we jointly set a new world record. This is an important milestone for
> the project, as it validates the amount of engineering work put into Spark
> by the community.
> 
> As Matei said, "For an engine to scale from these multi-hour petabyte batch
> jobs down to 100-millisecond streaming and interactive queries is quite
> uncommon, and it's thanks to all of you folks that we are able to make this
> happen."
> 
> Updated blog post:
> http://databricks.com/blog/2014/11/05/spark-officially-sets-a-new-record-in-large-scale-sorting.html
> 
> 
> 
> 
> On Fri, Oct 10, 2014 at 7:54 AM, Matei Zaharia <ma...@gmail.com>
> wrote:
> 
>> Hi folks,
>> 
>> I interrupt your regularly scheduled user / dev list to bring you some
>> pretty cool news for the project, which is that we've been able to use
>> Spark to break MapReduce's 100 TB and 1 PB sort records, sorting data 3x
>> faster on 10x fewer nodes. There's a detailed writeup at
>> http://databricks.com/blog/2014/10/10/spark-breaks-previous-large-scale-sort-record.html.
>> Summary: while Hadoop MapReduce held last year's 100 TB world record by
>> sorting 100 TB in 72 minutes on 2100 nodes, we sorted it in 23 minutes on
>> 206 nodes; and we also scaled up to sort 1 PB in 234 minutes.
>> 
>> I want to thank Reynold Xin for leading this effort over the past few
>> weeks, along with Parviz Deyhim, Xiangrui Meng, Aaron Davidson and Ali
>> Ghodsi. In addition, we'd really like to thank Amazon's EC2 team for
>> providing the machines to make this possible. Finally, this result would of
>> course not be possible without the many many other contributions, testing
>> and feature requests from throughout the community.
>> 
>> For an engine to scale from these multi-hour petabyte batch jobs down to
>> 100-millisecond streaming and interactive queries is quite uncommon, and
>> it's thanks to all of you folks that we are able to make this happen.
>> 
>> Matei
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>> For additional commands, e-mail: user-help@spark.apache.org
>> 
>> 


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org