You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Koert Kuipers <ko...@tresata.com> on 2021/01/13 16:38:37 UTC

understanding spark shuffle file re-use better

is shuffle file re-use based on identity or equality of the dataframe?

for example if run the exact same code twice to load data and do transforms
(joins, aggregations, etc.) but without re-using any actual dataframes,
will i still see skipped stages thanks to shuffle file re-use?

thanks!
koert

Re: understanding spark shuffle file re-use better

Posted by Jacek Laskowski <ja...@japila.pl>.

Hi,

An interesting question that I must admit I'm not sure how to answer myself
actually :)

Off the top of my head, I'd **guess** unless you cache the first query
these two queries would share nothing. With caching, there's a phase in
query execution when a canonicalized version of a query is used to look up
any cached queries.

Again, I'm not really sure and if I'd have to answer it (e.g. as part of an
interview) I'd say nothing would be shared / re-used.

Pozdrawiam,
Jacek Laskowski
----
https://about.me/JacekLaskowski
"The Internals Of" Online Books <https://books.japila.pl/>
Follow me on https://twitter.com/jaceklaskowski

<https://twitter.com/jaceklaskowski>

On Wed, Jan 13, 2021 at 5:39 PM Koert Kuipers <ko...@tresata.com> wrote:

> is shuffle file re-use based on identity or equality of the dataframe?
>
> for example if run the exact same code twice to load data and do
> transforms (joins, aggregations, etc.) but without re-using any actual
> dataframes, will i still see skipped stages thanks to shuffle file re-use?
>
> thanks!
> koert
>

Re: understanding spark shuffle file re-use better

Posted by Mandloi87 <ba...@yahoo.com.INVALID>.

Increase or Decrease the number of data partitions: Since a data partition
represents the quantum of data to be processed together by a single Spark
Task, there could be situations:
 (a) Where existing number of data partitions are not sufficient enough in
order to maximize the usage of available resources
 (b) Where existing number of data partitions are too heavy to be computed
reliably without memory overruns.
 (c) Where existing number of data partitions are too high in number such
that task scheduling overhead becomes the bottleneck in the overall
processing time. 



-----
&nbsp;ᐅ&nbsp; Targeted Web Traffic &nbsp;AFFORDABLE web traffic package is the best ideal for small businesses&nbsp;👉 Website Traffic Packages : Turn Traffic Increase Into Revenue - 
--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org

Re: understanding spark shuffle file re-use better

Posted by Attila Zsolt Piros <pi...@gmail.com>.

A much better one-liner (easier to understand the UI because it will be 1
simple job with 2 stages):

```
spark.read.text("README.md").repartition(2).take(1)
```


Attila Zsolt Piros wrote
> No, it won't be reused.
> You should reuse the dateframe for reusing the shuffle blocks (and cached
> data).
> 
> I know this because the two actions will lead to building a two separate
> DAGs, but I will show you a way how you could check this on your own (with
> a
> small simple spark application). 
> 
> For this you can even use the spark-shell. Start it in directory where a
> simple text file available ("README.md" in my case).
> 
> After this the one-liner is:
> 
> ```
> scala> spark.read.text("README.md").selectExpr("length(value) as l",
> "value").groupBy("l").count
> .take(1)
> ```
> 
> Now if you check Stages tab on the UI you will see 3 stages.
> After re-executing the same line of code in the Stages tab you can see the
> number of stages are doubled.
> 
> So shuffle files are not reused.
> 
> Finally you can delete the file and re-execute our small test. Now it will
> produce:
> 
> ``` 
> org.apache.spark.sql.AnalysisException: Path does not exist:
> file:/Users/attilazsoltpiros/git/attilapiros/spark/README.md;
> ```
> 
> So the file would have been opened again for loading the data (even in the
> 3rd run).
> 
> 
> 
> --
> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
> 
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscribe@spark.apache.org


```
```



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org

Re: understanding spark shuffle file re-use better

Posted by Attila Zsolt Piros <pi...@gmail.com>.

No, it won't be reused.
You should reuse the dateframe for reusing the shuffle blocks (and cached
data).

I know this because the two actions will lead to building a two separate
DAGs, but I will show you a way how you could check this on your own (with a
small simple spark application). 

For this the spark-shell can be used, too. Start it in directory where a
simple text file available ("README.md" in my case).

After this the one-liner is:

```
scala> spark.read.text("README.md").selectExpr("length(value) as l",
"value").groupBy("l").count
.take(1)
```

Now if you check Stages tab on the UI you will see 3 stages.
After re-executing the same line of code in the Stages tab you can see the
number of stages are doubled.

So shuffle files are not reused.

Finally you can delete the file and re-execute our small test. Now it will
produce:

``` 
org.apache.spark.sql.AnalysisException: Path does not exist:
file:/Users/attilazsoltpiros/git/attilapiros/spark/README.md;
```

So the file would have been opened again for loading the data (even in the
3rd run).



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org