You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by "cloud-fan (via GitHub)" <gi...@apache.org> on 2024/01/04 14:19:34 UTC

[PR] [SPARK-46598][SQL] OrcColumnarBatchReader should respect the memory mode when creating c… [spark]

cloud-fan opened a new pull request, #44598:
URL: https://github.com/apache/spark/pull/44598

…olumn vectors for the missing column

### What changes were proposed in this pull request?

This PR fixes a long-standing bug that `OrcColumnarBatchReader` does not respect the memory mode when creating column vectors. This PR fixes it.

### Why are the changes needed?

Try our best to use off-heap memory with off-heap mode.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

new test

### Was this patch authored or co-authored using generative AI tooling?

no

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-46598][SQL] OrcColumnarBatchReader should respect the memory mode when creating column vectors for the missing column [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.

cloud-fan commented on code in PR #44598:
URL: https://github.com/apache/spark/pull/44598#discussion_r1442555023


##########
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcFileFormat.scala:
##########
@@ -152,6 +153,12 @@ class OrcFileFormat
       assert(supportBatch(sparkSession, resultSchema))
     }
 
+    val memoryMode = if (sqlConf.offHeapColumnVectorEnabled) {

Review Comment:
   I moved it outside of the lambda, so that we don't hit NPE by referencing `sqlConf`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-46598][SQL] OrcColumnarBatchReader should respect the memory mode when creating column vectors for the missing column [spark]

Posted by "LuciferYang (via GitHub)" <gi...@apache.org>.

LuciferYang commented on code in PR #44598:
URL: https://github.com/apache/spark/pull/44598#discussion_r1441954433


##########
sql/core/src/main/java/org/apache/spark/sql/execution/datasources/orc/OrcColumnarBatchReader.java:
##########
@@ -186,7 +188,12 @@ public void initBatch(
         int colId = requestedDataColIds[i];
         // Initialize the missing columns once.
         if (colId == -1) {
-          OnHeapColumnVector missingCol = new OnHeapColumnVector(capacity, dt);
+          final WritableColumnVector missingCol;

Review Comment:
   Is it possible to use `ConstantColumnVector` for the `missingCol`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-46598][SQL] OrcColumnarBatchReader should should use ConstantColumnVector for missing columns [spark]

Posted by "LuciferYang (via GitHub)" <gi...@apache.org>.

LuciferYang commented on PR #44598:
URL: https://github.com/apache/spark/pull/44598#issuecomment-1878163361

   ```
   [info] - SPARK-39557 INSERT INTO statements with tables with array defaults *** FAILED *** (448 milliseconds)
   [info]   org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 711.0 failed 1 times, most recent failure: Lost task 0.0 in stage 711.0 (TID 965) (localhost executor driver): java.lang.RuntimeException: DataType ARRAY<INT> is not supported in column vectorized reader.
   [info] 	at org.apache.spark.sql.execution.vectorized.ColumnVectorUtils.populate(ColumnVectorUtils.java:96)
   [info] 	at org.apache.spark.sql.execution.datasources.orc.OrcColumnarBatchReader.initBatch(OrcColumnarBatchReader.java:197)
   [info] 	at org.apache.spark.sql.execution.datasources.orc.OrcFileFormat.$anonfun$buildReaderWithPartitionValues$2(OrcFileFormat.scala:214)
   [info] 	at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:218)
   ```
   Seems if using `ConstantColumnVector`, some refactoring is needed for the `ColumnVectorUtils.populate` method.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-46598][SQL] OrcColumnarBatchReader should respect the memory mode when creating column vectors for the missing column [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.

cloud-fan commented on code in PR #44598:
URL: https://github.com/apache/spark/pull/44598#discussion_r1442465349


##########
sql/core/src/main/java/org/apache/spark/sql/execution/datasources/orc/OrcColumnarBatchReader.java:
##########
@@ -186,7 +188,12 @@ public void initBatch(
         int colId = requestedDataColIds[i];
         // Initialize the missing columns once.
         if (colId == -1) {
-          OnHeapColumnVector missingCol = new OnHeapColumnVector(capacity, dt);
+          final WritableColumnVector missingCol;

Review Comment:
   It seems simpler to use `ConstantColumnVector` here. I've updated the PR



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-46598][SQL] OrcColumnarBatchReader should respect the memory mode when creating column vectors for the missing column [spark]

Posted by "LuciferYang (via GitHub)" <gi...@apache.org>.

LuciferYang commented on code in PR #44598:
URL: https://github.com/apache/spark/pull/44598#discussion_r1441996872


##########
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcFileFormat.scala:
##########
@@ -196,7 +197,12 @@ class OrcFileFormat
         val taskAttemptContext = new TaskAttemptContextImpl(taskConf, attemptId)
 
         if (enableVectorizedReader) {
-          val batchReader = new OrcColumnarBatchReader(capacity)
+          val memoryMode = if (sqlConf.offHeapColumnVectorEnabled) {

Review Comment:
   ```
   [info] - SPARK-28156: self-join should not miss cached view *** FAILED *** (216 milliseconds)
   [info]   org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 1888.0 failed 1 times, most recent failure: Lost task 1.0 in stage 1888.0 (TID 2251) (localhost executor driver): java.lang.NullPointerException: Cannot invoke "org.apache.spark.internal.config.ConfigReader.get(String)" because "reader" is null
   [info] 	at org.apache.spark.internal.config.ConfigEntry.readString(ConfigEntry.scala:94)
   [info] 	at org.apache.spark.internal.config.FallbackConfigEntry.readFrom(ConfigEntry.scala:270)
   [info] 	at org.apache.spark.sql.internal.SQLConf.getConf(SQLConf.scala:5573)
   [info] 	at org.apache.spark.sql.internal.SQLConf.offHeapColumnVectorEnabled(SQLConf.scala:5103)
   [info] 	at org.apache.spark.sql.execution.datasources.orc.OrcFileFormat.$anonfun$buildReaderWithPartitionValues$2(OrcFileFormat.scala:200)
   [info] 	at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:218)
   ```
   some tests failed



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-46598][SQL] OrcColumnarBatchReader should respect the memory mode when creating column vectors for the missing column [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.

cloud-fan commented on PR #44598:
URL: https://github.com/apache/spark/pull/44598#issuecomment-1878222395

   I changed it back. It seems non-trivial to make `ConstantColumnVector` to support array/struct/map, and we need to create a real vector anyway to keep the data of an array, so we must pass the memory mode.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-46598][SQL] OrcColumnarBatchReader should respect the memory mode when creating column vectors for the missing column [spark]

Posted by "dongjoon-hyun (via GitHub)" <gi...@apache.org>.

dongjoon-hyun commented on code in PR #44598:
URL: https://github.com/apache/spark/pull/44598#discussion_r1442271738


##########
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcFileFormat.scala:
##########
@@ -196,7 +197,12 @@ class OrcFileFormat
         val taskAttemptContext = new TaskAttemptContextImpl(taskConf, attemptId)
 
         if (enableVectorizedReader) {
-          val batchReader = new OrcColumnarBatchReader(capacity)
+          val memoryMode = if (sqlConf.offHeapColumnVectorEnabled) {

Review Comment:
   Ya, this part fails.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-46598][SQL] OrcColumnarBatchReader should respect the memory mode when creating column vectors for the missing column [spark]

Posted by "dongjoon-hyun (via GitHub)" <gi...@apache.org>.

dongjoon-hyun closed pull request #44598: [SPARK-46598][SQL] OrcColumnarBatchReader should respect the memory mode when creating column vectors for the missing column
URL: https://github.com/apache/spark/pull/44598


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-46598][SQL] OrcColumnarBatchReader should respect the memory mode when creating column vectors for the missing column [spark]

Posted by "dongjoon-hyun (via GitHub)" <gi...@apache.org>.

dongjoon-hyun commented on PR #44598:
URL: https://github.com/apache/spark/pull/44598#issuecomment-1879824562

   Thank you, @cloud-fan and all. 
   Merged to master/3.5/3.4.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-46598][SQL] OrcColumnarBatchReader should respect the memory mode when creating column vectors for the missing column [spark]

Posted by "LuciferYang (via GitHub)" <gi...@apache.org>.

LuciferYang commented on code in PR #44598:
URL: https://github.com/apache/spark/pull/44598#discussion_r1441954433


##########
sql/core/src/main/java/org/apache/spark/sql/execution/datasources/orc/OrcColumnarBatchReader.java:
##########
@@ -186,7 +188,12 @@ public void initBatch(
         int colId = requestedDataColIds[i];
         // Initialize the missing columns once.
         if (colId == -1) {
-          OnHeapColumnVector missingCol = new OnHeapColumnVector(capacity, dt);
+          final WritableColumnVector missingCol;

Review Comment:
   Is it possible to use `ConstantColumnVector` for the `missingCol`? This maybe another story.
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-46598][SQL] OrcColumnarBatchReader should respect the memory mode when creating column vectors for the missing column [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.

cloud-fan commented on PR #44598:
URL: https://github.com/apache/spark/pull/44598#issuecomment-1877173142

   cc @viirya @yaooqinn 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org