You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by V0lleyBallJunki3 <ve...@gmail.com> on 2020/01/19 00:50:13 UTC

Does explode lead to more usage of memory

I am using a dataframe and has structure like this :

root
 |-- orders: array (nullable = true)
 |    |-- element: struct (containsNull = true) 
 |    |    |-- amount: double (nullable = true)
 |    |    |-- id: string (nullable = true)
 |-- user: string (nullable = true)
 |-- language: string (nullable = true)

Each user has multiple orders. Now if I explode orders like this:

df.select($"user", explode($"orders").as("order")) . Each order element will
become a row with a duplicated user and language. Was wondering if spark
actually converts each order element into a single row in memory or it just
logical. Because if a single user has 1000 orders  then wouldn't it lead to
a lot more memory consumption since it is duplicating user and language a
1000 times (once for each order) in memory?



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org

Re: Does explode lead to more usage of memory

Posted by Chris Teoh <ch...@gmail.com>.

I think it does mean more memory usage but consider how big your arrays
are. Think about your use case requirements and whether it makes sense to
use arrays. Also it may be preferable to explode if the arrays are very
large. I'd say exploding arrays will make the data more splittable, having
the array has benefit of avoiding a join and colocation of the children
items but does imply more memory pressure on each executor to read every
record in the array, requiring denser nodes.

I hope that helps.

On Sun, 19 Jan 2020, 7:50 am V0lleyBallJunki3, <ve...@gmail.com>
wrote:

> I am using a dataframe and has structure like this :
>
> root
>  |-- orders: array (nullable = true)
>  |    |-- element: struct (containsNull = true)
>  |    |    |-- amount: double (nullable = true)
>  |    |    |-- id: string (nullable = true)
>  |-- user: string (nullable = true)
>  |-- language: string (nullable = true)
>
> Each user has multiple orders. Now if I explode orders like this:
>
> df.select($"user", explode($"orders").as("order")) . Each order element
> will
> become a row with a duplicated user and language. Was wondering if spark
> actually converts each order element into a single row in memory or it just
> logical. Because if a single user has 1000 orders  then wouldn't it lead to
> a lot more memory consumption since it is duplicating user and language a
> 1000 times (once for each order) in memory?
>
>
>
> --
> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>
>

Re: Does explode lead to more usage of memory

Posted by Chris Teoh <ch...@gmail.com>.

Depends on the use case, if you have to join, you're saving a join and a
shuffle from having it already in an array.

If you explode, at least sort within partitions to get you predicate
pushdown when you read the data next time.

On Sun, 19 Jan 2020, 1:19 pm Jörn Franke, <jo...@gmail.com> wrote:

> Why not two tables and then you can join them? This would be the standard
> way. it depends what your full use case is, what volumes / orders you
> expect on average, how aggregations and filters look like. The example
> below states that you do a Select all on the table.
>
> > Am 19.01.2020 um 01:50 schrieb V0lleyBallJunki3 <ve...@gmail.com>:
> >
> > I am using a dataframe and has structure like this :
> >
> > root
> > |-- orders: array (nullable = true)
> > |    |-- element: struct (containsNull = true)
> > |    |    |-- amount: double (nullable = true)
> > |    |    |-- id: string (nullable = true)
> > |-- user: string (nullable = true)
> > |-- language: string (nullable = true)
> >
> > Each user has multiple orders. Now if I explode orders like this:
> >
> > df.select($"user", explode($"orders").as("order")) . Each order element
> will
> > become a row with a duplicated user and language. Was wondering if spark
> > actually converts each order element into a single row in memory or it
> just
> > logical. Because if a single user has 1000 orders  then wouldn't it lead
> to
> > a lot more memory consumption since it is duplicating user and language a
> > 1000 times (once for each order) in memory?
> >
> >
> >
> > --
> > Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
> >
> > ---------------------------------------------------------------------
> > To unsubscribe e-mail: user-unsubscribe@spark.apache.org
> >
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>
>

Re: Does explode lead to more usage of memory

Posted by Jörn Franke <jo...@gmail.com>.

Why not two tables and then you can join them? This would be the standard way. it depends what your full use case is, what volumes / orders you expect on average, how aggregations and filters look like. The example below states that you do a Select all on the table.

> Am 19.01.2020 um 01:50 schrieb V0lleyBallJunki3 <ve...@gmail.com>:
> 
> I am using a dataframe and has structure like this :
> 
> root
> |-- orders: array (nullable = true)
> |    |-- element: struct (containsNull = true) 
> |    |    |-- amount: double (nullable = true)
> |    |    |-- id: string (nullable = true)
> |-- user: string (nullable = true)
> |-- language: string (nullable = true)
> 
> Each user has multiple orders. Now if I explode orders like this:
> 
> df.select($"user", explode($"orders").as("order")) . Each order element will
> become a row with a duplicated user and language. Was wondering if spark
> actually converts each order element into a single row in memory or it just
> logical. Because if a single user has 1000 orders  then wouldn't it lead to
> a lot more memory consumption since it is duplicating user and language a
> 1000 times (once for each order) in memory?
> 
> 
> 
> --
> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
> 
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
> 

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org