You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by "yabola (via GitHub)" <gi...@apache.org> on 2023/03/20 16:30:48 UTC

[GitHub] [spark] yabola opened a new pull request, #40495: only test for reading footer within file range

yabola opened a new pull request, #40495:
URL: https://github.com/apache/spark/pull/40495

   only for test, please ignore it


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] yabola commented on a diff in pull request #40495: test for reading footer within file range

Posted by "yabola (via GitHub)" <gi...@apache.org>.

yabola commented on code in PR #40495:
URL: https://github.com/apache/spark/pull/40495#discussion_r1142841503


##########
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/parquet/ParquetPartitionReaderFactory.scala:
##########
@@ -92,8 +93,13 @@ case class ParquetPartitionReaderFactory(
     if (aggregation.isEmpty) {
       ParquetFooterReader.readFooter(conf, filePath, SKIP_ROW_GROUPS)
     } else {
+      val split = new FileSplit(file.toPath, file.start, file.length, Array.empty[String])
       // For aggregate push down, we will get max/min/count from footer statistics.
-      ParquetFooterReader.readFooter(conf, filePath, NO_FILTER)

Review Comment:
   @huaxingao Hi~when reviewing the code, I have a point of doubt, why not take the RowGroup metrics information in the current file range here?
   Here is a example in `SpecificParquetRecordReaderBase`
   https://github.com/apache/spark/blob/61035129a354d0b31c66908106238b12b1f2f7b0/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/SpecificParquetRecordReaderBase.java#L96-L102



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] yabola closed pull request #40495: test reading footer within file range

Posted by "yabola (via GitHub)" <gi...@apache.org>.

yabola closed pull request #40495: test reading footer within file range
URL: https://github.com/apache/spark/pull/40495


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] huaxingao commented on a diff in pull request #40495: test for reading footer within file range

Posted by "huaxingao (via GitHub)" <gi...@apache.org>.

huaxingao commented on code in PR #40495:
URL: https://github.com/apache/spark/pull/40495#discussion_r1142902856


##########
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/parquet/ParquetPartitionReaderFactory.scala:
##########
@@ -92,8 +93,13 @@ case class ParquetPartitionReaderFactory(
     if (aggregation.isEmpty) {
       ParquetFooterReader.readFooter(conf, filePath, SKIP_ROW_GROUPS)
     } else {
+      val split = new FileSplit(file.toPath, file.start, file.length, Array.empty[String])
       // For aggregate push down, we will get max/min/count from footer statistics.
-      ParquetFooterReader.readFooter(conf, filePath, NO_FILTER)

Review Comment:
   @yabola Thanks for the ping. I was assuming that we always need to read the whole `PartitionedFile`, i.e. the start is always 0 and the length is always the file length. Maybe it's safer to add the range.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] LuciferYang commented on a diff in pull request #40495: test reading footer within file range

Posted by "LuciferYang (via GitHub)" <gi...@apache.org>.

LuciferYang commented on code in PR #40495:
URL: https://github.com/apache/spark/pull/40495#discussion_r1143464079


##########
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/parquet/ParquetPartitionReaderFactory.scala:
##########
@@ -92,8 +93,13 @@ case class ParquetPartitionReaderFactory(
     if (aggregation.isEmpty) {
       ParquetFooterReader.readFooter(conf, filePath, SKIP_ROW_GROUPS)
     } else {
+      val split = new FileSplit(file.toPath, file.start, file.length, Array.empty[String])
       // For aggregate push down, we will get max/min/count from footer statistics.
-      ParquetFooterReader.readFooter(conf, filePath, NO_FILTER)

Review Comment:
   https://github.com/apache/spark/blob/ef0a76eeea30fabb04499908b04124464225f5fd/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/parquet/ParquetScan.scala#L59-L63
   
   `isSplitable` if false when `pushedAggregate.isEmpty` is false, so I think @huaxingao is right



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] yabola commented on a diff in pull request #40495: test for reading footer within file range

Posted by "yabola (via GitHub)" <gi...@apache.org>.

yabola commented on code in PR #40495:
URL: https://github.com/apache/spark/pull/40495#discussion_r1142841503


##########
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/parquet/ParquetPartitionReaderFactory.scala:
##########
@@ -92,8 +93,13 @@ case class ParquetPartitionReaderFactory(
     if (aggregation.isEmpty) {
       ParquetFooterReader.readFooter(conf, filePath, SKIP_ROW_GROUPS)
     } else {
+      val split = new FileSplit(file.toPath, file.start, file.length, Array.empty[String])
       // For aggregate push down, we will get max/min/count from footer statistics.
-      ParquetFooterReader.readFooter(conf, filePath, NO_FILTER)

Review Comment:
   @huaxingao Hi~when reviewing the code, I have a point of doubt, why not take the RowGroup metrics information in the current file range here? I feel like there's going to be a problem here...
   Here is a example in `SpecificParquetRecordReaderBase`
   https://github.com/apache/spark/blob/61035129a354d0b31c66908106238b12b1f2f7b0/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/SpecificParquetRecordReaderBase.java#L96-L102



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] huaxingao commented on a diff in pull request #40495: test for reading footer within file range

Posted by "huaxingao (via GitHub)" <gi...@apache.org>.

huaxingao commented on code in PR #40495:
URL: https://github.com/apache/spark/pull/40495#discussion_r1142903084


##########
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/parquet/ParquetPartitionReaderFactory.scala:
##########
@@ -92,8 +93,13 @@ case class ParquetPartitionReaderFactory(
     if (aggregation.isEmpty) {
       ParquetFooterReader.readFooter(conf, filePath, SKIP_ROW_GROUPS)
     } else {
+      val split = new FileSplit(file.toPath, file.start, file.length, Array.empty[String])
       // For aggregate push down, we will get max/min/count from footer statistics.
-      ParquetFooterReader.readFooter(conf, filePath, NO_FILTER)

Review Comment:
   @sunchao What do you think?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] LuciferYang commented on a diff in pull request #40495: test reading footer within file range

Posted by "LuciferYang (via GitHub)" <gi...@apache.org>.

LuciferYang commented on code in PR #40495:
URL: https://github.com/apache/spark/pull/40495#discussion_r1143464079


##########
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/parquet/ParquetPartitionReaderFactory.scala:
##########
@@ -92,8 +93,13 @@ case class ParquetPartitionReaderFactory(
     if (aggregation.isEmpty) {
       ParquetFooterReader.readFooter(conf, filePath, SKIP_ROW_GROUPS)
     } else {
+      val split = new FileSplit(file.toPath, file.start, file.length, Array.empty[String])
       // For aggregate push down, we will get max/min/count from footer statistics.
-      ParquetFooterReader.readFooter(conf, filePath, NO_FILTER)

Review Comment:
   https://github.com/apache/spark/blob/ef0a76eeea30fabb04499908b04124464225f5fd/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/parquet/ParquetScan.scala#L59-L63
   
   `isSplitable` if false, When `pushedAggregate.isEmpty` is false, so @huaxingao is right



##########
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/parquet/ParquetPartitionReaderFactory.scala:
##########
@@ -92,8 +93,13 @@ case class ParquetPartitionReaderFactory(
     if (aggregation.isEmpty) {
       ParquetFooterReader.readFooter(conf, filePath, SKIP_ROW_GROUPS)
     } else {
+      val split = new FileSplit(file.toPath, file.start, file.length, Array.empty[String])
       // For aggregate push down, we will get max/min/count from footer statistics.
-      ParquetFooterReader.readFooter(conf, filePath, NO_FILTER)

Review Comment:
   https://github.com/apache/spark/blob/ef0a76eeea30fabb04499908b04124464225f5fd/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/parquet/ParquetScan.scala#L59-L63
   
   `isSplitable` if false, When `pushedAggregate.isEmpty` is false, so I think @huaxingao is right



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] yabola commented on a diff in pull request #40495: test reading footer within file range

Posted by "yabola (via GitHub)" <gi...@apache.org>.

yabola commented on code in PR #40495:
URL: https://github.com/apache/spark/pull/40495#discussion_r1143525196


##########
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/parquet/ParquetPartitionReaderFactory.scala:
##########
@@ -92,8 +93,13 @@ case class ParquetPartitionReaderFactory(
     if (aggregation.isEmpty) {
       ParquetFooterReader.readFooter(conf, filePath, SKIP_ROW_GROUPS)
     } else {
+      val split = new FileSplit(file.toPath, file.start, file.length, Array.empty[String])
       // For aggregate push down, we will get max/min/count from footer statistics.
-      ParquetFooterReader.readFooter(conf, filePath, NO_FILTER)

Review Comment:
   @LuciferYang @huaxingao  oh~ thank you! Yes, using `NO_FILTER` is no problem here.
   https://github.com/apache/spark/blob/df2e2516188b46537349aa7a5f279de6141c6450/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/parquet/ParquetScan.scala#L50-L57
   
    But do you think it will be always safer to use file range? and I have made some changes here, I want to unify them  https://github.com/apache/spark/pull/39950



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] LuciferYang commented on a diff in pull request #40495: test reading footer within file range

Posted by "LuciferYang (via GitHub)" <gi...@apache.org>.

LuciferYang commented on code in PR #40495:
URL: https://github.com/apache/spark/pull/40495#discussion_r1143464079


##########
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/parquet/ParquetPartitionReaderFactory.scala:
##########
@@ -92,8 +93,13 @@ case class ParquetPartitionReaderFactory(
     if (aggregation.isEmpty) {
       ParquetFooterReader.readFooter(conf, filePath, SKIP_ROW_GROUPS)
     } else {
+      val split = new FileSplit(file.toPath, file.start, file.length, Array.empty[String])
       // For aggregate push down, we will get max/min/count from footer statistics.
-      ParquetFooterReader.readFooter(conf, filePath, NO_FILTER)

Review Comment:
   https://github.com/apache/spark/blob/ef0a76eeea30fabb04499908b04124464225f5fd/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/parquet/ParquetScan.scala#L59-L63
   
   `isSplitable` if false when `pushedAggregate` is nonEmpty, so I think @huaxingao is right



##########
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/parquet/ParquetPartitionReaderFactory.scala:
##########
@@ -92,8 +93,13 @@ case class ParquetPartitionReaderFactory(
     if (aggregation.isEmpty) {
       ParquetFooterReader.readFooter(conf, filePath, SKIP_ROW_GROUPS)
     } else {
+      val split = new FileSplit(file.toPath, file.start, file.length, Array.empty[String])
       // For aggregate push down, we will get max/min/count from footer statistics.
-      ParquetFooterReader.readFooter(conf, filePath, NO_FILTER)

Review Comment:
   https://github.com/apache/spark/blob/ef0a76eeea30fabb04499908b04124464225f5fd/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/parquet/ParquetScan.scala#L59-L63
   
   `isSplitable` is false when `pushedAggregate` is nonEmpty, so I think @huaxingao is right



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org