You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Arkadiusz Bicz <ar...@gmail.com> on 2016/01/14 20:31:57 UTC

DataFrameWriter on partitionBy for parquet eat all RAM

Hi

What is the proper configuration for saving parquet partition with
large number of repeated keys?

On bellow code I load 500 milion rows of data and partition it on
column with not so many different values.

Using spark-shell with 30g per executor and driver and 3 executor cores

sqlContext.read.load("hdfs://notpartitioneddata").write.partitionBy("columnname").parquet("partitioneddata")


Job failed because not enough memory in executor :

WARN YarnSchedulerBackend$YarnSchedulerEndpoint: Container killed by
YARN for exceeding memory limits. 43.5 GB of 43.5 GB physical memory
used. Consider boosting spark.yarn.executor.memoryOverhead.
16/01/14 17:32:38 ERROR YarnScheduler: Lost executor 11 on
datanode2.babar.poc: Container killed by YARN for exceeding memory
limits. 43.5 GB of 43.5 GB physical memory used. Consider boosting
spark.yarn.executor.memoryOverhead.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org

Re: DataFrameWriter on partitionBy for parquet eat all RAM

Posted by Jerry Lam <ch...@gmail.com>.

Hi Michael,

Thanks for sharing the tip. It will help to the write path of the partitioned table. 
Do you have similar suggestion on reading the partitioned table back when there is a million of distinct values on the partition field (for example on user id)? Last time I have trouble to read a partitioned table because it takes very long (over hours on s3) to execute the sqlcontext.read.parquet("partitioned_table").

Best Regards,

Jerry

Sent from my iPhone

> On 15 Jan, 2016, at 3:59 pm, Michael Armbrust <mi...@databricks.com> wrote:
> 
> See here for some workarounds: https://issues.apache.org/jira/browse/SPARK-12546
> 
>> On Thu, Jan 14, 2016 at 6:46 PM, Jerry Lam <ch...@gmail.com> wrote:
>> Hi Arkadiusz,
>> 
>> the partitionBy is not designed to have many distinct value the last time I used it. If you search in the mailing list, I think there are couple of people also face similar issues. For example, in my case, it won't work over a million distinct user ids. It will require a lot of memory and very long time to read the table back. 
>> 
>> Best Regards,
>> 
>> Jerry
>> 
>>> On Thu, Jan 14, 2016 at 2:31 PM, Arkadiusz Bicz <ar...@gmail.com> wrote:
>>> Hi
>>> 
>>> What is the proper configuration for saving parquet partition with
>>> large number of repeated keys?
>>> 
>>> On bellow code I load 500 milion rows of data and partition it on
>>> column with not so many different values.
>>> 
>>> Using spark-shell with 30g per executor and driver and 3 executor cores
>>> 
>>> sqlContext.read.load("hdfs://notpartitioneddata").write.partitionBy("columnname").parquet("partitioneddata")
>>> 
>>> 
>>> Job failed because not enough memory in executor :
>>> 
>>> WARN YarnSchedulerBackend$YarnSchedulerEndpoint: Container killed by
>>> YARN for exceeding memory limits. 43.5 GB of 43.5 GB physical memory
>>> used. Consider boosting spark.yarn.executor.memoryOverhead.
>>> 16/01/14 17:32:38 ERROR YarnScheduler: Lost executor 11 on
>>> datanode2.babar.poc: Container killed by YARN for exceeding memory
>>> limits. 43.5 GB of 43.5 GB physical memory used. Consider boosting
>>> spark.yarn.executor.memoryOverhead.
>>> 
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>>> For additional commands, e-mail: user-help@spark.apache.org
>

Re: DataFrameWriter on partitionBy for parquet eat all RAM

Posted by Michael Armbrust <mi...@databricks.com>.

See here for some workarounds:
https://issues.apache.org/jira/browse/SPARK-12546

On Thu, Jan 14, 2016 at 6:46 PM, Jerry Lam <ch...@gmail.com> wrote:

> Hi Arkadiusz,
>
> the partitionBy is not designed to have many distinct value the last time
> I used it. If you search in the mailing list, I think there are couple of
> people also face similar issues. For example, in my case, it won't work
> over a million distinct user ids. It will require a lot of memory and very
> long time to read the table back.
>
> Best Regards,
>
> Jerry
>
> On Thu, Jan 14, 2016 at 2:31 PM, Arkadiusz Bicz <ar...@gmail.com>
> wrote:
>
>> Hi
>>
>> What is the proper configuration for saving parquet partition with
>> large number of repeated keys?
>>
>> On bellow code I load 500 milion rows of data and partition it on
>> column with not so many different values.
>>
>> Using spark-shell with 30g per executor and driver and 3 executor cores
>>
>>
>> sqlContext.read.load("hdfs://notpartitioneddata").write.partitionBy("columnname").parquet("partitioneddata")
>>
>>
>> Job failed because not enough memory in executor :
>>
>> WARN YarnSchedulerBackend$YarnSchedulerEndpoint: Container killed by
>> YARN for exceeding memory limits. 43.5 GB of 43.5 GB physical memory
>> used. Consider boosting spark.yarn.executor.memoryOverhead.
>> 16/01/14 17:32:38 ERROR YarnScheduler: Lost executor 11 on
>> datanode2.babar.poc: Container killed by YARN for exceeding memory
>> limits. 43.5 GB of 43.5 GB physical memory used. Consider boosting
>> spark.yarn.executor.memoryOverhead.
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>> For additional commands, e-mail: user-help@spark.apache.org
>>
>>
>

Re: DataFrameWriter on partitionBy for parquet eat all RAM

Posted by Jerry Lam <ch...@gmail.com>.

Hi Arkadiusz,

the partitionBy is not designed to have many distinct value the last time I
used it. If you search in the mailing list, I think there are couple of
people also face similar issues. For example, in my case, it won't work
over a million distinct user ids. It will require a lot of memory and very
long time to read the table back.

Best Regards,

Jerry

On Thu, Jan 14, 2016 at 2:31 PM, Arkadiusz Bicz <ar...@gmail.com>
wrote:

> Hi
>
> What is the proper configuration for saving parquet partition with
> large number of repeated keys?
>
> On bellow code I load 500 milion rows of data and partition it on
> column with not so many different values.
>
> Using spark-shell with 30g per executor and driver and 3 executor cores
>
>
> sqlContext.read.load("hdfs://notpartitioneddata").write.partitionBy("columnname").parquet("partitioneddata")
>
>
> Job failed because not enough memory in executor :
>
> WARN YarnSchedulerBackend$YarnSchedulerEndpoint: Container killed by
> YARN for exceeding memory limits. 43.5 GB of 43.5 GB physical memory
> used. Consider boosting spark.yarn.executor.memoryOverhead.
> 16/01/14 17:32:38 ERROR YarnScheduler: Lost executor 11 on
> datanode2.babar.poc: Container killed by YARN for exceeding memory
> limits. 43.5 GB of 43.5 GB physical memory used. Consider boosting
> spark.yarn.executor.memoryOverhead.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> For additional commands, e-mail: user-help@spark.apache.org
>
>