You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by taozhuo <ta...@gmail.com> on 2016/11/26 02:16:21 UTC

Why is shuffle write size so large when joining Dataset with nested structure?

The Dataset is defined as case class with many fields with nested
structure(Map, List of another case class etc.)
The size of the Dataset is only 1T when saving to disk as Parquet file.
But when joining it, the shuffle write size becomes as large as 12T.
Is there a way to cut it down without changing the schema? If not, what is
the best practice when designing complex schemas?



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Why-is-shuffle-write-size-so-large-when-joining-Dataset-with-nested-structure-tp28136.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org

Re: Why is shuffle write size so large when joining Dataset with nested structure?

Posted by Zhuo Tao <ta...@gmail.com>.

Hi Takeshi,

Thank you for your comment. I changed it to RDD and it's a lot better.

Zhuo

On Fri, Nov 25, 2016 at 7:04 PM, Takeshi Yamamuro <li...@gmail.com>
wrote:

> Hi,
>
> I think this is just the overhead to represent nested elements as internal
> rows on-runtime
> (e.g., it consumes null bits for each nested element).
> Moreover, in parquet formats, nested data are columnar and highly
> compressed,
> so it becomes so compact.
>
> But, I'm not sure about better approaches in this cases.
>
> // maropu
>
>
>
>
>
>
>
>
> On Sat, Nov 26, 2016 at 11:16 AM, taozhuo <ta...@gmail.com> wrote:
>
>> The Dataset is defined as case class with many fields with nested
>> structure(Map, List of another case class etc.)
>> The size of the Dataset is only 1T when saving to disk as Parquet file.
>> But when joining it, the shuffle write size becomes as large as 12T.
>> Is there a way to cut it down without changing the schema? If not, what is
>> the best practice when designing complex schemas?
>>
>>
>>
>> --
>> View this message in context: http://apache-spark-user-list.
>> 1001560.n3.nabble.com/Why-is-shuffle-write-size-so-large-whe
>> n-joining-Dataset-with-nested-structure-tp28136.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> ---------------------------------------------------------------------
>> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>>
>>
>
>
> --
> ---
> Takeshi Yamamuro
>

Re: Why is shuffle write size so large when joining Dataset with nested structure?

Posted by Takeshi Yamamuro <li...@gmail.com>.

Hi,

I think this is just the overhead to represent nested elements as internal
rows on-runtime
(e.g., it consumes null bits for each nested element).
Moreover, in parquet formats, nested data are columnar and highly
compressed,
so it becomes so compact.

But, I'm not sure about better approaches in this cases.

// maropu

On Sat, Nov 26, 2016 at 11:16 AM, taozhuo <ta...@gmail.com> wrote:

> The Dataset is defined as case class with many fields with nested
> structure(Map, List of another case class etc.)
> The size of the Dataset is only 1T when saving to disk as Parquet file.
> But when joining it, the shuffle write size becomes as large as 12T.
> Is there a way to cut it down without changing the schema? If not, what is
> the best practice when designing complex schemas?
>
>
>
> --
> View this message in context: http://apache-spark-user-list.
> 1001560.n3.nabble.com/Why-is-shuffle-write-size-so-large-
> when-joining-Dataset-with-nested-structure-tp28136.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>
>

-- 
---
Takeshi Yamamuro