You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by "cxzl25 (via GitHub)" <gi...@apache.org> on 2023/07/26 10:56:54 UTC
[GitHub] [spark] cxzl25 opened a new pull request, #42168: [SPARK-44556][SQL] Reuse `OrcTail` when enable vectorizedReader
cxzl25 opened a new pull request, #42168:
URL: https://github.com/apache/spark/pull/42168
### What changes were proposed in this pull request?
Reuse `OrcTail`
### Why are the changes needed?
Avoid constructing `OrcTail` repeatedly when reading the same ORC file.
The ORC reader updates the `OrcTail` to the `ReaderOptions`.
org.apache.orc.impl.ReaderImpl#ReaderImpl
```java
OrcTail orcTail = options.getOrcTail();
if (orcTail == null) {
tail = extractFileTail(getFileSystem(), path, options.getMaxLength());
options.orcTail(tail);
} else {
checkOrcVersion(path, orcTail.getPostScript());
tail = orcTail;
```
https://github.com/apache/orc/blob/715f4f8a9956aa71a9360f2d0e04a173acb5c61d/java/core/src/java/org/apache/orc/impl/ReaderImpl.java#L567-L573
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
exist UT
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] cxzl25 commented on pull request #42168: [SPARK-44556][SQL] Reuse `OrcTail` when enable vectorizedReader
Posted by "cxzl25 (via GitHub)" <gi...@apache.org>.
cxzl25 commented on PR #42168:
URL: https://github.com/apache/spark/pull/42168#issuecomment-1655006863
@dongjoon-hyun @LuciferYang
Please help to review, thanks.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] LuciferYang commented on a diff in pull request #42168: [SPARK-44556][SQL] Reuse `OrcTail` when enable vectorizedReader
Posted by "LuciferYang (via GitHub)" <gi...@apache.org>.
LuciferYang commented on code in PR #42168:
URL: https://github.com/apache/spark/pull/42168#discussion_r1280137914
##########
sql/core/src/main/java/org/apache/spark/sql/execution/datasources/orc/OrcColumnarBatchReader.java:
##########
@@ -49,6 +50,7 @@ public class OrcColumnarBatchReader extends RecordReader<Void, ColumnarBatch> {
// The capacity of vectorized batch.
private int capacity;
+ private OrcTail orcTail;
Review Comment:
really need this field? Maybe we can ref https://github.com/apache/spark/blob/52c1068190803d856959ba563642a3e440cc086c/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/SpecificParquetRecordReaderBase.java#L90-L105
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
Re: [PR] [SPARK-44556][SQL] Reuse `OrcTail` when enable vectorizedReader [spark]
Posted by "dongjoon-hyun (via GitHub)" <gi...@apache.org>.
dongjoon-hyun closed pull request #42168: [SPARK-44556][SQL] Reuse `OrcTail` when enable vectorizedReader
URL: https://github.com/apache/spark/pull/42168
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] LuciferYang commented on a diff in pull request #42168: [SPARK-44556][SQL] Reuse `OrcTail` when enable vectorizedReader
Posted by "LuciferYang (via GitHub)" <gi...@apache.org>.
LuciferYang commented on code in PR #42168:
URL: https://github.com/apache/spark/pull/42168#discussion_r1318324551
##########
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/orc/OrcPartitionReaderFactory.scala:
##########
@@ -163,13 +166,17 @@ case class OrcPartitionReaderFactory(
private def createORCReader(filePath: Path, conf: Configuration): Reader = {
Review Comment:
How about letting the existing createORCReader method return a (Reader, OrcTail)? Or can we rebuild a usable OrcTail from Reader? For example, new OrcTail(reader.getFileTail, reader.getSerializedFileFooter)?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] LuciferYang commented on pull request #42168: [SPARK-44556][SQL] Reuse `OrcTail` when enable vectorizedReader
Posted by "LuciferYang (via GitHub)" <gi...@apache.org>.
LuciferYang commented on PR #42168:
URL: https://github.com/apache/spark/pull/42168#issuecomment-1659614744
cc @dongjoon-hyun FYI
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
Re: [PR] [SPARK-44556][SQL] Reuse `OrcTail` when enable vectorizedReader [spark]
Posted by "mridulm (via GitHub)" <gi...@apache.org>.
mridulm commented on PR #42168:
URL: https://github.com/apache/spark/pull/42168#issuecomment-1876562861
+CC @shardulm94
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] LuciferYang commented on a diff in pull request #42168: [SPARK-44556][SQL] Reuse `OrcTail` when enable vectorizedReader
Posted by "LuciferYang (via GitHub)" <gi...@apache.org>.
LuciferYang commented on code in PR #42168:
URL: https://github.com/apache/spark/pull/42168#discussion_r1318324551
##########
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/orc/OrcPartitionReaderFactory.scala:
##########
@@ -163,13 +166,17 @@ case class OrcPartitionReaderFactory(
private def createORCReader(filePath: Path, conf: Configuration): Reader = {
Review Comment:
How about letting the existing createORCReader method return Tuple (Reader, OrcTail)? Or can we rebuild a usable OrcTail from Reader? For example, new OrcTail(reader.getFileTail, reader.getSerializedFileFooter)?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] LuciferYang commented on pull request #42168: [SPARK-44556][SQL] Reuse `OrcTail` when enable vectorizedReader
Posted by "LuciferYang (via GitHub)" <gi...@apache.org>.
LuciferYang commented on PR #42168:
URL: https://github.com/apache/spark/pull/42168#issuecomment-1709789709
friendly ping @dongjoon-hyun
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org