You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@spark.apache.org by "Apache Spark (Jira)" <ji...@apache.org> on 2022/07/22 03:13:00 UTC

[jira] [Assigned] (SPARK-39834) Include the origin stats and constraints for LogicalRDD if it comes from DataFrame

     [ https://issues.apache.org/jira/browse/SPARK-39834?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Apache Spark reassigned SPARK-39834:
------------------------------------

    Assignee: Apache Spark

> Include the origin stats and constraints for LogicalRDD if it comes from DataFrame
> ----------------------------------------------------------------------------------
>
>                 Key: SPARK-39834
>                 URL: https://issues.apache.org/jira/browse/SPARK-39834
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL, Structured Streaming
>    Affects Versions: 3.4.0
>            Reporter: Jungtaek Lim
>            Assignee: Apache Spark
>            Priority: Major
>
> With SPARK-39748, Spark includes the origin logical plan for LogicalRDD if it comes from DataFrame, to achieve carrying-over stats as well as providing information to possibly connect two disconnected logical plans into one.
> After we introduced the change, we figured out several issues:
> 1. One of major use case for DataFrame.checkpoint is ML, especially "iterative algorithm", which purpose is to "prune" the logical plan. That is against the purpose of including origin logical plan and we have a risk to have nested LogicalRDDs which grows the size of logical plan infinitely.
> 2. We leverage logical plan to carry over stats, but the correct stats information is in optimized plan.
> 3. (Not an issue but missing spot) constraints is also something we can carry over.
> To address above issues, it would be better if we include stats and constraints in LogicalRDD rather than logical plan.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org