You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@spark.apache.org by "Yin Huai (JIRA)" <ji...@apache.org> on 2015/08/29 00:37:45 UTC

[jira] [Assigned] (SPARK-10339) When scanning a partitioned table having thousands of partitions, Driver has a very high memory pressure because of SQL metrics

     [ https://issues.apache.org/jira/browse/SPARK-10339?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Yin Huai reassigned SPARK-10339:
--------------------------------

    Assignee: Yin Huai

> When scanning a partitioned table having thousands of partitions, Driver has a very high memory pressure because of SQL metrics
> -------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-10339
>                 URL: https://issues.apache.org/jira/browse/SPARK-10339
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 1.5.0
>            Reporter: Yin Huai
>            Assignee: Yin Huai
>            Priority: Blocker
>
> I have a local dataset having 5000 partitions stored in {{/tmp/partitioned}}. When I run the following code, the free memory space in driver's old gen gradually decreases and eventually there is pretty much no free space in driver's old gen. Finally, all kinds of timeouts happen and the cluster is died.
> {code}
> val df = sqlContext.read.format("parquet").load("/tmp/partitioned")
> df.filter("a > -100").selectExpr("hash(a, b)").queryExecution.toRdd.foreach(_ => Unit)
> {code}
> I did a quick test by deleting SQL metrics from project and filter operator, my job works fine.
> The reason is that for a partitioned table, when we scan it, the actual plan is like
> {code}
>        other operators
>            |
>            |
>         /--|------\
>        /   |       \
>       /    |        \
>      /     |         \
> project  project ... project
>   |        |           |
> filter   filter  ... filter
>   |        |           |
> part1    part2   ... part n
> {code}
> We create SQL metrics for every filter and project, which causing the extremely high memory pressure to the driver.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org