You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@spark.apache.org by "Jungtaek Lim (Jira)" <ji...@apache.org> on 2022/07/22 03:03:00 UTC

[jira] [Created] (SPARK-39834) Include the origin stats and constraints for LogicalRDD if it comes from DataFrame

Jungtaek Lim created SPARK-39834:
------------------------------------

Summary: Include the origin stats and constraints for LogicalRDD if it comes from DataFrame
Key: SPARK-39834
URL: https://issues.apache.org/jira/browse/SPARK-39834
Project: Spark
Issue Type: Improvement
Components: SQL, Structured Streaming
Affects Versions: 3.4.0
Reporter: Jungtaek Lim

With SPARK-39748, Spark includes the origin logical plan for LogicalRDD if it comes from DataFrame, to achieve carrying-over stats as well as providing information to possibly connect two disconnected logical plans into one.

After we introduced the change, we figured out several issues:

1. One of major use case for DataFrame.checkpoint is ML, especially "iterative algorithm", which purpose is to "prune" the logical plan. That is against the purpose of including origin logical plan and we have a risk to have nested LogicalRDDs which grows the size of logical plan infinitely.

2. We leverage logical plan to carry over stats, but the correct stats information is in optimized plan.

3. (Not an issue but missing spot) constraints is also something we can carry over.

To address above issues, it would be better if we include stats and constraints in LogicalRDD rather than logical plan.

--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org