You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by "HeartSaVioR (via GitHub)" <gi...@apache.org> on 2023/09/06 02:37:14 UTC

[GitHub] [spark] HeartSaVioR opened a new pull request, #42823: [SPARK-45080][SS] Explicitly call out support for columnar in DSv2 streaming data sources

HeartSaVioR opened a new pull request, #42823:
URL: https://github.com/apache/spark/pull/42823

   ### What changes were proposed in this pull request?
   
   This PR proposes to override `Scan.columnarSupportMode` for DSv2 streaming data sources. All of them don't support columnar.
   
   Rationalization will be explained in the next section.
   
   ### Why are the changes needed?
   
   The default value for `Scan.columnarSupportMode` is `PARTITION_DEFINED`, which requires `inputPartitions` to be called/evaluated. That could be referenced multiple times during planning.
   
   In `MicrobatchScanExec`, we define `inputPartitions` as lazy val, so that there is no multiple evaluation of inputPartitions, which calls `MicroBatchStream.planInputPartitions`. But we missed that there is no guarantee that the instance will be initialized only once (although the actual execution will happen once) - for example, executedPlan clones the plan (internally we call constructor to make a deep copy of the node), explain (internally called to build a SQL execution start event), etc...
   
   I see `MicroBatchStream.planInputPartitions` gets called 4 times per microbatch, which can be concerning if the overhead of planInputPartitions is non-trivial.
   
   ### Does this PR introduce _any_ user-facing change?
   
   No.
   
   ### How was this patch tested?
   
   Existing UTs.
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   No.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HeartSaVioR closed pull request #42823: [SPARK-45080][SS] Explicitly call out support for columnar in DSv2 streaming data sources

Posted by "HeartSaVioR (via GitHub)" <gi...@apache.org>.
HeartSaVioR closed pull request #42823: [SPARK-45080][SS] Explicitly call out support for columnar in DSv2 streaming data sources
URL: https://github.com/apache/spark/pull/42823


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HeartSaVioR commented on pull request #42823: [SPARK-45080][SS] Explicitly call out support for columnar in DSv2 streaming data sources

Posted by "HeartSaVioR (via GitHub)" <gi...@apache.org>.
HeartSaVioR commented on PR #42823:
URL: https://github.com/apache/spark/pull/42823#issuecomment-1709925128

   Thanks for reviewing, merging to master. (The last commit is to fix the broken test as side-effect.)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HeartSaVioR commented on pull request #42823: [SPARK-45080][SS] Explicitly call out support for columnar in DSv2 streaming data sources

Posted by "HeartSaVioR (via GitHub)" <gi...@apache.org>.
HeartSaVioR commented on PR #42823:
URL: https://github.com/apache/spark/pull/42823#issuecomment-1707580885

   cc. @cloud-fan 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HeartSaVioR commented on pull request #42823: [SPARK-45080][SS] Explicitly call out support for columnar in DSv2 streaming data sources

Posted by "HeartSaVioR (via GitHub)" <gi...@apache.org>.
HeartSaVioR commented on PR #42823:
URL: https://github.com/apache/spark/pull/42823#issuecomment-1707590753

   cc. @zsxwing @viirya @xuanyuanking as well


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HeartSaVioR commented on pull request #42823: [SPARK-45080][SS] Explicitly call out support for columnar in DSv2 streaming data sources

Posted by "HeartSaVioR (via GitHub)" <gi...@apache.org>.
HeartSaVioR commented on PR #42823:
URL: https://github.com/apache/spark/pull/42823#issuecomment-1708358772

   > status, lastProgress, and recentProgress *** FAILED *** (382 milliseconds)
   
   This seems to happen consistently. Will look into it.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org