You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by "razajafri (via GitHub)" <gi...@apache.org> on 2024/03/14 18:51:08 UTC

[PR] [SPARK-47398][SQL] Extract a trait for InMemoryTableScanExec to allow for extending functionality [spark]

razajafri opened a new pull request, #45525:
URL: https://github.com/apache/spark/pull/45525

   <!--
   Thanks for sending a pull request!  Here are some tips for you:
     1. If this is your first time, please read our contributor guidelines: https://spark.apache.org/contributing.html
     2. Ensure you have added or run the appropriate tests for your PR: https://spark.apache.org/developer-tools.html
     3. If the PR is unfinished, add '[WIP]' in your PR title, e.g., '[WIP][SPARK-XXXX] Your PR title ...'.
     4. Be sure to keep the PR description updated to reflect all changes.
     5. Please write your PR title to summarize what this PR proposes.
     6. If possible, provide a concise example to reproduce the issue for a faster review.
     7. If you want to add a new configuration, please read the guideline first for naming configurations in
        'core/src/main/scala/org/apache/spark/internal/config/ConfigEntry.scala'.
     8. If you want to add or modify an error type or message, please read the guideline first in
        'common/utils/src/main/resources/error/README.md'.
   -->
   
   ### What changes were proposed in this pull request?
   We are proposing to allow the users to have a custom `InMemoryTableScanExec`. To accomplish this we can follow the same convention we followed for `ShuffleExchangeLike` and `BroadcastExchangeLike` 
   
   
   ### Why are the changes needed?
   In the PR added by @ulysses-you, we are wrapping `InMemoryTableScanExec` inside `TableCacheQueryStageExec`. This could potentially cause problems, especially in the RAPIDS Accelerator for Apache Spark, where we replace `InMemoryTableScanExec` with a customized version that has optimizations needed by us. This situation could lead to the loss of benefits from [SPARK-42101](https://issues.apache.org/jira/browse/SPARK-42101) or even result in Spark throwing an Exception.
   
   ### Does this PR introduce _any_ user-facing change?
   No
   
   
   ### How was this patch tested?
   Ran the existing tests 
   
   
   ### Was this patch authored or co-authored using generative AI tooling?
   No


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47398][SQL] Extract a trait for InMemoryTableScanExec to allow for extending functionality [spark]

Posted by "razajafri (via GitHub)" <gi...@apache.org>.
razajafri commented on PR #45525:
URL: https://github.com/apache/spark/pull/45525#issuecomment-2004599457

   @cloud-fan @maryannxue can you please take a look at this?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47398][SQL] Extract a trait for InMemoryTableScanExec to allow for extending functionality [spark]

Posted by "tgravescs (via GitHub)" <gi...@apache.org>.
tgravescs commented on code in PR #45525:
URL: https://github.com/apache/spark/pull/45525#discussion_r1526620660


##########
sql/core/src/main/scala/org/apache/spark/sql/execution/columnar/InMemoryTableScanExec.scala:
##########
@@ -21,18 +21,37 @@ import org.apache.spark.rdd.RDD
 import org.apache.spark.sql.catalyst.InternalRow
 import org.apache.spark.sql.catalyst.expressions._
 import org.apache.spark.sql.catalyst.plans.QueryPlan
+import org.apache.spark.sql.catalyst.plans.logical.Statistics
 import org.apache.spark.sql.catalyst.plans.physical.Partitioning
 import org.apache.spark.sql.columnar.CachedBatch
 import org.apache.spark.sql.execution.{LeafExecNode, SparkPlan, WholeStageCodegenExec}
 import org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec
 import org.apache.spark.sql.execution.metric.SQLMetrics
 import org.apache.spark.sql.vectorized.ColumnarBatch
 
+trait InMemoryTableScanLike extends LeafExecNode {

Review Comment:
   please add a description - can just be something basic like what ShuffleExchangeLike has



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47398][SQL] Extract a trait for InMemoryTableScanExec to allow for extending functionality [spark]

Posted by "asfgit (via GitHub)" <gi...@apache.org>.
asfgit closed pull request #45525: [SPARK-47398][SQL] Extract a trait for InMemoryTableScanExec to allow for extending functionality
URL: https://github.com/apache/spark/pull/45525


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47398][SQL] Extract a trait for InMemoryTableScanExec to allow for extending functionality [spark]

Posted by "tgravescs (via GitHub)" <gi...@apache.org>.
tgravescs commented on PR #45525:
URL: https://github.com/apache/spark/pull/45525#issuecomment-2013529448

   merged to master and branch-3.5.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47398][SQL] Extract a trait for InMemoryTableScanExec to allow for extending functionality [spark]

Posted by "razajafri (via GitHub)" <gi...@apache.org>.
razajafri commented on PR #45525:
URL: https://github.com/apache/spark/pull/45525#issuecomment-2000008940

   > Curious why do you need a customized version of `InMemoryTableScanExec` ? Does `CachedBatchSerializer` not enough ? Is it possible to improve the `CachedBatchSerializer` interface if there is a requirement ?
   
   Our version of `InMemoryTableScanExec` only supports columnar input/output, if it sees non-columnar data it throws an Exception. We also have our custom metrics being collected in the GPU version. We also have optimized the `convertCachedBatchToColumnarBatch` when `GpuInMemoryTableScanExec` is being used as opposed to the default one. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47398][SQL] Extract a trait for InMemoryTableScanExec to allow for extending functionality [spark]

Posted by "ulysses-you (via GitHub)" <gi...@apache.org>.
ulysses-you commented on PR #45525:
URL: https://github.com/apache/spark/pull/45525#issuecomment-1998725012

   Curious why do you need a customized version of  `InMemoryTableScanExec` ?  Does `CachedBatchSerializer` not enough ? Is it possible to improve the `CachedBatchSerializer` interface if there is a requirement ?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47398][SQL] Extract a trait for InMemoryTableScanExec to allow for extending functionality [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.
cloud-fan commented on PR #45525:
URL: https://github.com/apache/spark/pull/45525#issuecomment-2014332538

   late LGTM


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47398][SQL] Extract a trait for InMemoryTableScanExec to allow for extending functionality [spark]

Posted by "tgravescs (via GitHub)" <gi...@apache.org>.
tgravescs commented on PR #45525:
URL: https://github.com/apache/spark/pull/45525#issuecomment-2004634348

   +1 lgtm


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47398][SQL] Extract a trait for InMemoryTableScanExec to allow for extending functionality [spark]

Posted by "dongjoon-hyun (via GitHub)" <gi...@apache.org>.
dongjoon-hyun commented on PR #45525:
URL: https://github.com/apache/spark/pull/45525#issuecomment-2013611873

   Thank you for fixing the regression.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org