You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Bahubali Jain <ba...@gmail.com> on 2017/03/16 17:39:43 UTC

Dataset : Issue with Save

Hi,
While saving a dataset using       *
mydataset.write().csv("outputlocation")  *                 I am running
into an exception



*"Total size of serialized results of 3722 tasks (1024.0 MB) is bigger than
spark.driver.maxResultSize (1024.0 MB)"*
Does it mean that for saving a dataset whole of the dataset contents are
being sent to driver ,similar to collect()  action?

Thanks,
Baahu

Re: Dataset : Issue with Save

Posted by Yong Zhang <ja...@hotmail.com>.
Looks like the current fix is reducing accumulator data being sent to driver, but there are still lots of more statistics data being sent to the driver. It is arguable that how much data is reasonable for 3.7k tasks.


You can attach your heap dump file in that JIRA and follow it.


Yong

________________________________
From: Bahubali Jain <ba...@gmail.com>
Sent: Thursday, March 16, 2017 11:41 PM
To: Yong Zhang
Cc: user@spark.apache.org
Subject: Re: Dataset : Issue with Save

I am using SPARK 2.0 . There are comments in the ticket since Oct-2016 which clearly mention that issue still persists even in 2.0.
I agree 1G is very small today's world, and I have already resolved by increasing the spark.driver.maxResultSize.
I was more intrigued as to why is the data being sent to driver during save(similat to collect() action ), are there any plans to fix this behavior/issue ?

Thanks,
Baahu

On Fri, Mar 17, 2017 at 8:17 AM, Yong Zhang <ja...@hotmail.com>> wrote:

Did you read the JIRA ticket? Are you confirming that it is fixed in Spark 2.0, or you complain that it still exists in Spark 2.0?


First, you didn't tell us what version of your Spark you are using. The JIRA clearly said that it is a bug in Spark 1.x, and should be fixed in Spark 2.0. So help yourself and the community, to confirm if this is the case.


If you are looking for workaround, the JIRA ticket clearly show you how to increase your driver heap. 1G in today's world really is kind of small.


Yong


________________________________
From: Bahubali Jain <ba...@gmail.com>>
Sent: Thursday, March 16, 2017 10:34 PM
To: Yong Zhang
Cc: user@spark.apache.org<ma...@spark.apache.org>
Subject: Re: Dataset : Issue with Save

Hi,
Was this not yet resolved?
Its a very common requirement to save a dataframe, is there a better way to save a dataframe by avoiding data being sent to driver?.

"Total size of serialized results of 3722 tasks (1024.0 MB) is bigger than spark.driver.maxResultSize (1024.0 MB) "

Thanks,
Baahu

On Fri, Mar 17, 2017 at 1:19 AM, Yong Zhang <ja...@hotmail.com>> wrote:

You can take a look of https://issues.apache.org/jira/browse/SPARK-12837


Yong

Spark driver requires large memory space for serialized ...<https://issues.apache.org/jira/browse/SPARK-12837>
issues.apache.org<http://issues.apache.org>
Executing a sql statement with a large number of partitions requires a high memory space for the driver even there are no requests to collect data back to the driver.




________________________________
From: Bahubali Jain <ba...@gmail.com>>
Sent: Thursday, March 16, 2017 1:39 PM
To: user@spark.apache.org<ma...@spark.apache.org>
Subject: Dataset : Issue with Save

Hi,
While saving a dataset using        mydataset.write().csv("outputlocation")                   I am running into an exception

"Total size of serialized results of 3722 tasks (1024.0 MB) is bigger than spark.driver.maxResultSize (1024.0 MB)"

Does it mean that for saving a dataset whole of the dataset contents are being sent to driver ,similar to collect()  action?

Thanks,
Baahu



--
Twitter:http://twitter.com/Baahu




--
Twitter:http://twitter.com/Baahu


Re: Dataset : Issue with Save

Posted by Bahubali Jain <ba...@gmail.com>.
I am using SPARK 2.0 . There are comments in the ticket since Oct-2016
which clearly mention that issue still persists even in 2.0.
I agree 1G is very small today's world, and I have already resolved by
increasing the
*spark.driver.maxResultSize.*
I was more intrigued as to why is the data being sent to driver during
save(similat to collect() action ), are there any plans to fix this
behavior/issue ?

Thanks,
Baahu

On Fri, Mar 17, 2017 at 8:17 AM, Yong Zhang <ja...@hotmail.com> wrote:

> Did you read the JIRA ticket? Are you confirming that it is fixed in Spark
> 2.0, or you complain that it still exists in Spark 2.0?
>
>
> First, you didn't tell us what version of your Spark you are using. The
> JIRA clearly said that it is a bug in Spark 1.x, and should be fixed in
> Spark 2.0. So help yourself and the community, to confirm if this is the
> case.
>
>
> If you are looking for workaround, the JIRA ticket clearly show you how to
> increase your driver heap. 1G in today's world really is kind of small.
>
>
> Yong
>
>
> ------------------------------
> *From:* Bahubali Jain <ba...@gmail.com>
> *Sent:* Thursday, March 16, 2017 10:34 PM
> *To:* Yong Zhang
> *Cc:* user@spark.apache.org
> *Subject:* Re: Dataset : Issue with Save
>
> Hi,
> Was this not yet resolved?
> Its a very common requirement to save a dataframe, is there a better way
> to save a dataframe by avoiding data being sent to driver?.
>
>
> * "Total size of serialized results of 3722 tasks (1024.0 MB) is bigger
> than spark.driver.maxResultSize (1024.0 MB) " *
> Thanks,
> Baahu
>
> On Fri, Mar 17, 2017 at 1:19 AM, Yong Zhang <ja...@hotmail.com> wrote:
>
>> You can take a look of https://issues.apache.org/jira/browse/SPARK-12837
>>
>>
>> Yong
>> Spark driver requires large memory space for serialized ...
>> <https://issues.apache.org/jira/browse/SPARK-12837>
>> issues.apache.org
>> Executing a sql statement with a large number of partitions requires a
>> high memory space for the driver even there are no requests to collect data
>> back to the driver.
>>
>>
>>
>> ------------------------------
>> *From:* Bahubali Jain <ba...@gmail.com>
>> *Sent:* Thursday, March 16, 2017 1:39 PM
>> *To:* user@spark.apache.org
>> *Subject:* Dataset : Issue with Save
>>
>> Hi,
>> While saving a dataset using       *
>> mydataset.write().csv("outputlocation")  *                 I am running
>> into an exception
>>
>>
>>
>> * "Total size of serialized results of 3722 tasks (1024.0 MB) is bigger
>> than spark.driver.maxResultSize (1024.0 MB)" *
>> Does it mean that for saving a dataset whole of the dataset contents are
>> being sent to driver ,similar to collect()  action?
>>
>> Thanks,
>> Baahu
>>
>
>
>
> --
> Twitter:http://twitter.com/Baahu
>
>


-- 
Twitter:http://twitter.com/Baahu

Re: Dataset : Issue with Save

Posted by Yong Zhang <ja...@hotmail.com>.
Did you read the JIRA ticket? Are you confirming that it is fixed in Spark 2.0, or you complain that it still exists in Spark 2.0?


First, you didn't tell us what version of your Spark you are using. The JIRA clearly said that it is a bug in Spark 1.x, and should be fixed in Spark 2.0. So help yourself and the community, to confirm if this is the case.


If you are looking for workaround, the JIRA ticket clearly show you how to increase your driver heap. 1G in today's world really is kind of small.


Yong


________________________________
From: Bahubali Jain <ba...@gmail.com>
Sent: Thursday, March 16, 2017 10:34 PM
To: Yong Zhang
Cc: user@spark.apache.org
Subject: Re: Dataset : Issue with Save

Hi,
Was this not yet resolved?
Its a very common requirement to save a dataframe, is there a better way to save a dataframe by avoiding data being sent to driver?.

"Total size of serialized results of 3722 tasks (1024.0 MB) is bigger than spark.driver.maxResultSize (1024.0 MB) "

Thanks,
Baahu

On Fri, Mar 17, 2017 at 1:19 AM, Yong Zhang <ja...@hotmail.com>> wrote:

You can take a look of https://issues.apache.org/jira/browse/SPARK-12837


Yong

Spark driver requires large memory space for serialized ...<https://issues.apache.org/jira/browse/SPARK-12837>
issues.apache.org<http://issues.apache.org>
Executing a sql statement with a large number of partitions requires a high memory space for the driver even there are no requests to collect data back to the driver.




________________________________
From: Bahubali Jain <ba...@gmail.com>>
Sent: Thursday, March 16, 2017 1:39 PM
To: user@spark.apache.org<ma...@spark.apache.org>
Subject: Dataset : Issue with Save

Hi,
While saving a dataset using        mydataset.write().csv("outputlocation")                   I am running into an exception

"Total size of serialized results of 3722 tasks (1024.0 MB) is bigger than spark.driver.maxResultSize (1024.0 MB)"

Does it mean that for saving a dataset whole of the dataset contents are being sent to driver ,similar to collect()  action?

Thanks,
Baahu



--
Twitter:http://twitter.com/Baahu


Re: Dataset : Issue with Save

Posted by Bahubali Jain <ba...@gmail.com>.
Hi,
Was this not yet resolved?
Its a very common requirement to save a dataframe, is there a better way to
save a dataframe by avoiding data being sent to driver?.


*"Total size of serialized results of 3722 tasks (1024.0 MB) is bigger than
spark.driver.maxResultSize (1024.0 MB) "*
Thanks,
Baahu

On Fri, Mar 17, 2017 at 1:19 AM, Yong Zhang <ja...@hotmail.com> wrote:

> You can take a look of https://issues.apache.org/jira/browse/SPARK-12837
>
>
> Yong
> Spark driver requires large memory space for serialized ...
> <https://issues.apache.org/jira/browse/SPARK-12837>
> issues.apache.org
> Executing a sql statement with a large number of partitions requires a
> high memory space for the driver even there are no requests to collect data
> back to the driver.
>
>
>
> ------------------------------
> *From:* Bahubali Jain <ba...@gmail.com>
> *Sent:* Thursday, March 16, 2017 1:39 PM
> *To:* user@spark.apache.org
> *Subject:* Dataset : Issue with Save
>
> Hi,
> While saving a dataset using       *
> mydataset.write().csv("outputlocation")  *                 I am running
> into an exception
>
>
>
> * "Total size of serialized results of 3722 tasks (1024.0 MB) is bigger
> than spark.driver.maxResultSize (1024.0 MB)" *
> Does it mean that for saving a dataset whole of the dataset contents are
> being sent to driver ,similar to collect()  action?
>
> Thanks,
> Baahu
>



-- 
Twitter:http://twitter.com/Baahu

Re: Dataset : Issue with Save

Posted by Yong Zhang <ja...@hotmail.com>.
You can take a look of https://issues.apache.org/jira/browse/SPARK-12837


Yong

Spark driver requires large memory space for serialized ...<https://issues.apache.org/jira/browse/SPARK-12837>
issues.apache.org
Executing a sql statement with a large number of partitions requires a high memory space for the driver even there are no requests to collect data back to the driver.




________________________________
From: Bahubali Jain <ba...@gmail.com>
Sent: Thursday, March 16, 2017 1:39 PM
To: user@spark.apache.org
Subject: Dataset : Issue with Save

Hi,
While saving a dataset using        mydataset.write().csv("outputlocation")                   I am running into an exception

"Total size of serialized results of 3722 tasks (1024.0 MB) is bigger than spark.driver.maxResultSize (1024.0 MB)"

Does it mean that for saving a dataset whole of the dataset contents are being sent to driver ,similar to collect()  action?

Thanks,
Baahu