You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@spark.apache.org by "Joseph K. Bradley (JIRA)" <ji...@apache.org> on 2016/07/08 20:12:11 UTC

[jira] [Updated] (SPARK-13346) Using DataFrames iteratively leads to slow query planning

     [ https://issues.apache.org/jira/browse/SPARK-13346?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Joseph K. Bradley updated SPARK-13346:
--------------------------------------
    Summary: Using DataFrames iteratively leads to slow query planning  (was: Using DataFrames iteratively leads to massive query plans, which slows execution)

> Using DataFrames iteratively leads to slow query planning
> ---------------------------------------------------------
>
>                 Key: SPARK-13346
>                 URL: https://issues.apache.org/jira/browse/SPARK-13346
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 2.0.0
>            Reporter: Joseph K. Bradley
>
> I have an iterative algorithm based on DataFrames, and the query plan grows very quickly with each iteration.  Caching the current DataFrame at the end of an iteration does not fix the problem.  However, converting the DataFrame to an RDD and back at the end of each iteration does fix the problem.
> Printing the query plans shows that the plan explodes quickly (10 lines, to several hundred lines, to several thousand lines, ...) with successive iterations.
> The desired behavior is for the analyzer to recognize that a big chunk of the query plan does not need to be computed since it is already cached.  The computation on each iteration should be the same.
> If useful, I can push (complex) code to reproduce the issue.  But it should be simple to see if you create an iterative algorithm which produces a new DataFrame from an old one on each iteration.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org