You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Michael Armbrust (JIRA)" <ji...@apache.org> on 2015/08/30 01:40:45 UTC

[jira] [Resolved] (SPARK-10339) When scanning a partitioned table having thousands of partitions, Driver has a very high memory pressure because of SQL metrics

     [ https://issues.apache.org/jira/browse/SPARK-10339?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Michael Armbrust resolved SPARK-10339.
--------------------------------------
       Resolution: Fixed
    Fix Version/s: 1.5.0

Issue resolved by pull request 8515
[https://github.com/apache/spark/pull/8515]

> When scanning a partitioned table having thousands of partitions, Driver has a very high memory pressure because of SQL metrics
> -------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-10339
>                 URL: https://issues.apache.org/jira/browse/SPARK-10339
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 1.5.0
>            Reporter: Yin Huai
>            Assignee: Yin Huai
>            Priority: Blocker
>             Fix For: 1.5.0
>
>
> I have a local dataset having 5000 partitions stored in {{/tmp/partitioned}}. When I run the following code, the free memory space in driver's old gen gradually decreases and eventually there is pretty much no free space in driver's old gen. Finally, all kinds of timeouts happen and the cluster is died.
> {code}
> val df = sqlContext.read.format("parquet").load("/tmp/partitioned")
> df.filter("a > -100").selectExpr("hash(a, b)").queryExecution.toRdd.foreach(_ => Unit)
> {code}
> I did a quick test by deleting SQL metrics from project and filter operator, my job works fine.
> The reason is that for a partitioned table, when we scan it, the actual plan is like
> {code}
>        other operators
>            |
>            |
>         /--|------\
>        /   |       \
>       /    |        \
>      /     |         \
> project  project ... project
>   |        |           |
> filter   filter  ... filter
>   |        |           |
> part1    part2   ... part n
> {code}
> We create SQL metrics for every filter and project, which causing the extremely high memory pressure to the driver.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org