You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by JayeshLalwani <Ja...@capitalone.com> on 2017/05/03 22:37:15 UTC

Re: Spark-SQL collect function

In any distributed application, you scale up by splitting execution up on
multiple machines. The way Spark does this is by slicing the data into
partitions and spreading them on multiple machines. Logically, an RDD is
exactly that: data is split up and spread around on multiple machines. When
you perform operations on an RDD, Spark tells all the machines to perform
that operation on their own slice of data. SO, for example, if you perform a
filter operation (or if you are using SQL, you do /Select * from tablename
where col=colval/, Spark tells each machine to look for rows that match your
filter criteria in their own slice of data. This operation results in
another distributed dataset that contains the filtered records. Note that
when you do a filter operation, Spark doesn't move data outside of the
machines that they reside in. It keeps the filtered records in the same
machine. This ability of Spark to keep data in place is what provides
scalability. As long as your operations keep data in place, you can scale up
infinitely. If you got 10x more records, you can add 10x more machines, and
you will get the same performance

However, the problem is that a lot of operations cannot be done by keeping
data in place. For example, let's say you have 2 tables/dataframes. Spark
will slice both up and spread them around the machines. Now let's say, you
joined both tables. It may happen that the slice of data that resides in one
machine has matching records in another machine. So, now, Spark has to bring
data over from one machine to another. This is what Spark calls a
/shuffle/Spark does this intelligently. However, whenever data leaves one
machine and goes to other machines, you cannot scale infinitely. There will
be a point at which you will overwhelm the network, and adding more machines
isn't going to improve performance. 

So, the point is that you have to avoid shuffles as much as possible. You
cannot eliminate shuffles altogether, but you can reduce them

Now, /collect/ is the granddaddy of all shuffles. It causes Spark to bring
all the data that it has distributedd over the machines into a single
machine. If you call collect on a large table, it's analogous to drinking
from a firehose. You are going to drown.Calling collect on a small table is
fine, because very little data will move

Usually, it's recommended to run all your aggregations using Spark SQL, and
when you get the data boiled down to a small enough size that can be
presented to a human, you can call collect on it to fetch it and present it
to the human user. 



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-collect-function-tp28644p28647.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org


Re: Spark-SQL collect function

Posted by Aakash Basu <aa...@gmail.com>.
Well described​, thanks!

On 04-May-2017 4:07 AM, "JayeshLalwani" <Ja...@capitalone.com>
wrote:

> In any distributed application, you scale up by splitting execution up on
> multiple machines. The way Spark does this is by slicing the data into
> partitions and spreading them on multiple machines. Logically, an RDD is
> exactly that: data is split up and spread around on multiple machines. When
> you perform operations on an RDD, Spark tells all the machines to perform
> that operation on their own slice of data. SO, for example, if you perform
> a
> filter operation (or if you are using SQL, you do /Select * from tablename
> where col=colval/, Spark tells each machine to look for rows that match
> your
> filter criteria in their own slice of data. This operation results in
> another distributed dataset that contains the filtered records. Note that
> when you do a filter operation, Spark doesn't move data outside of the
> machines that they reside in. It keeps the filtered records in the same
> machine. This ability of Spark to keep data in place is what provides
> scalability. As long as your operations keep data in place, you can scale
> up
> infinitely. If you got 10x more records, you can add 10x more machines, and
> you will get the same performance
>
> However, the problem is that a lot of operations cannot be done by keeping
> data in place. For example, let's say you have 2 tables/dataframes. Spark
> will slice both up and spread them around the machines. Now let's say, you
> joined both tables. It may happen that the slice of data that resides in
> one
> machine has matching records in another machine. So, now, Spark has to
> bring
> data over from one machine to another. This is what Spark calls a
> /shuffle/Spark does this intelligently. However, whenever data leaves one
> machine and goes to other machines, you cannot scale infinitely. There will
> be a point at which you will overwhelm the network, and adding more
> machines
> isn't going to improve performance.
>
> So, the point is that you have to avoid shuffles as much as possible. You
> cannot eliminate shuffles altogether, but you can reduce them
>
> Now, /collect/ is the granddaddy of all shuffles. It causes Spark to bring
> all the data that it has distributedd over the machines into a single
> machine. If you call collect on a large table, it's analogous to drinking
> from a firehose. You are going to drown.Calling collect on a small table is
> fine, because very little data will move
>
> Usually, it's recommended to run all your aggregations using Spark SQL, and
> when you get the data boiled down to a small enough size that can be
> presented to a human, you can call collect on it to fetch it and present it
> to the human user.
>
>
>
> --
> View this message in context: http://apache-spark-user-list.
> 1001560.n3.nabble.com/Spark-SQL-collect-function-tp28644p28647.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>
>