You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Jungtaek Lim (Jira)" <ji...@apache.org> on 2022/07/27 08:04:00 UTC

[jira] [Resolved] (SPARK-39834) Include the origin stats and constraints for LogicalRDD if it comes from DataFrame

     [ https://issues.apache.org/jira/browse/SPARK-39834?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jungtaek Lim resolved SPARK-39834.
----------------------------------
    Fix Version/s: 3.4.0
       Resolution: Fixed

Issue resolved by pull request 37248
[https://github.com/apache/spark/pull/37248]

> Include the origin stats and constraints for LogicalRDD if it comes from DataFrame
> ----------------------------------------------------------------------------------
>
>                 Key: SPARK-39834
>                 URL: https://issues.apache.org/jira/browse/SPARK-39834
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL, Structured Streaming
>    Affects Versions: 3.4.0
>            Reporter: Jungtaek Lim
>            Assignee: Jungtaek Lim
>            Priority: Major
>             Fix For: 3.4.0
>
>
> With SPARK-39748, Spark includes the origin logical plan for LogicalRDD if it comes from DataFrame, to achieve carrying-over stats as well as providing information to possibly connect two disconnected logical plans into one.
> After we introduced the change, we figured out several issues:
> 1. One of major use case for DataFrame.checkpoint is ML, especially "iterative algorithm", which purpose is to "prune" the logical plan. That is against the purpose of including origin logical plan and we have a risk to have nested LogicalRDDs which grows the size of logical plan infinitely.
> 2. We leverage logical plan to carry over stats, but the correct stats information is in optimized plan.
> 3. (Not an issue but missing spot) constraints is also something we can carry over.
> To address above issues, it would be better if we include stats and constraints in LogicalRDD rather than logical plan.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org