You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by "cxzl25 (via GitHub)" <gi...@apache.org> on 2023/07/26 10:56:54 UTC

[GitHub] [spark] cxzl25 opened a new pull request, #42168: [SPARK-44556][SQL] Reuse `OrcTail` when enable vectorizedReader

cxzl25 opened a new pull request, #42168:
URL: https://github.com/apache/spark/pull/42168

   ### What changes were proposed in this pull request?
   Reuse `OrcTail`
   
   ### Why are the changes needed?
   
   Avoid constructing `OrcTail` repeatedly when reading the same ORC file.
   
   The ORC reader updates the `OrcTail` to the `ReaderOptions`.
   
   org.apache.orc.impl.ReaderImpl#ReaderImpl
   ```java
         OrcTail orcTail = options.getOrcTail();
         if (orcTail == null) {
           tail = extractFileTail(getFileSystem(), path, options.getMaxLength());
           options.orcTail(tail);
         } else {
           checkOrcVersion(path, orcTail.getPostScript());
           tail = orcTail;
   ```
   https://github.com/apache/orc/blob/715f4f8a9956aa71a9360f2d0e04a173acb5c61d/java/core/src/java/org/apache/orc/impl/ReaderImpl.java#L567-L573
   
   
   ### Does this PR introduce _any_ user-facing change?
   No
   
   ### How was this patch tested?
   exist UT


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cxzl25 commented on pull request #42168: [SPARK-44556][SQL] Reuse `OrcTail` when enable vectorizedReader

Posted by "cxzl25 (via GitHub)" <gi...@apache.org>.
cxzl25 commented on PR #42168:
URL: https://github.com/apache/spark/pull/42168#issuecomment-1655006863

   @dongjoon-hyun @LuciferYang 
   
   Please help to review, thanks.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] LuciferYang commented on a diff in pull request #42168: [SPARK-44556][SQL] Reuse `OrcTail` when enable vectorizedReader

Posted by "LuciferYang (via GitHub)" <gi...@apache.org>.
LuciferYang commented on code in PR #42168:
URL: https://github.com/apache/spark/pull/42168#discussion_r1280137914


##########
sql/core/src/main/java/org/apache/spark/sql/execution/datasources/orc/OrcColumnarBatchReader.java:
##########
@@ -49,6 +50,7 @@ public class OrcColumnarBatchReader extends RecordReader<Void, ColumnarBatch> {
 
   // The capacity of vectorized batch.
   private int capacity;
+  private OrcTail orcTail;

Review Comment:
   really need this field? Maybe we can ref https://github.com/apache/spark/blob/52c1068190803d856959ba563642a3e440cc086c/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/SpecificParquetRecordReaderBase.java#L90-L105



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-44556][SQL] Reuse `OrcTail` when enable vectorizedReader [spark]

Posted by "dongjoon-hyun (via GitHub)" <gi...@apache.org>.
dongjoon-hyun closed pull request #42168: [SPARK-44556][SQL] Reuse `OrcTail` when enable vectorizedReader
URL: https://github.com/apache/spark/pull/42168


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] LuciferYang commented on a diff in pull request #42168: [SPARK-44556][SQL] Reuse `OrcTail` when enable vectorizedReader

Posted by "LuciferYang (via GitHub)" <gi...@apache.org>.
LuciferYang commented on code in PR #42168:
URL: https://github.com/apache/spark/pull/42168#discussion_r1318324551


##########
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/orc/OrcPartitionReaderFactory.scala:
##########
@@ -163,13 +166,17 @@ case class OrcPartitionReaderFactory(
 
   private def createORCReader(filePath: Path, conf: Configuration): Reader = {

Review Comment:
   How about letting the existing createORCReader method return a (Reader, OrcTail)? Or can we rebuild a usable OrcTail from Reader? For example, new OrcTail(reader.getFileTail, reader.getSerializedFileFooter)?
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] LuciferYang commented on pull request #42168: [SPARK-44556][SQL] Reuse `OrcTail` when enable vectorizedReader

Posted by "LuciferYang (via GitHub)" <gi...@apache.org>.
LuciferYang commented on PR #42168:
URL: https://github.com/apache/spark/pull/42168#issuecomment-1659614744

   cc @dongjoon-hyun FYI


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-44556][SQL] Reuse `OrcTail` when enable vectorizedReader [spark]

Posted by "mridulm (via GitHub)" <gi...@apache.org>.
mridulm commented on PR #42168:
URL: https://github.com/apache/spark/pull/42168#issuecomment-1876562861

   +CC @shardulm94 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] LuciferYang commented on a diff in pull request #42168: [SPARK-44556][SQL] Reuse `OrcTail` when enable vectorizedReader

Posted by "LuciferYang (via GitHub)" <gi...@apache.org>.
LuciferYang commented on code in PR #42168:
URL: https://github.com/apache/spark/pull/42168#discussion_r1318324551


##########
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/orc/OrcPartitionReaderFactory.scala:
##########
@@ -163,13 +166,17 @@ case class OrcPartitionReaderFactory(
 
   private def createORCReader(filePath: Path, conf: Configuration): Reader = {

Review Comment:
   How about letting the existing createORCReader method return Tuple (Reader, OrcTail)? Or can we rebuild a usable OrcTail from Reader? For example, new OrcTail(reader.getFileTail, reader.getSerializedFileFooter)?
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] LuciferYang commented on pull request #42168: [SPARK-44556][SQL] Reuse `OrcTail` when enable vectorizedReader

Posted by "LuciferYang (via GitHub)" <gi...@apache.org>.
LuciferYang commented on PR #42168:
URL: https://github.com/apache/spark/pull/42168#issuecomment-1709789709

   friendly ping @dongjoon-hyun 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org