You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by "bersprockets (via GitHub)" <gi...@apache.org> on 2023/09/07 15:47:09 UTC

[GitHub] [spark] bersprockets opened a new pull request, #42850: [SPARK-44805][SQL] getBytes/getShorts/getInts/etc. should work in a column vector that has a dictionary

bersprockets opened a new pull request, #42850:
URL: https://github.com/apache/spark/pull/42850

   ### What changes were proposed in this pull request?
   
   Change getBytes/getShorts/getInts/getLongs/getFloats/getDoubles in `OnHeapColumnVector` and `OffHeapColumnVector` to use the dictionary, if present.
   
   ### Why are the changes needed?
   
   The following query gets incorrect results:
   ```
   drop table if exists t1;
   
   create table t1 using parquet as
   select * from values
   (named_struct('f1', array(1, 2, 3), 'f2', array(1, 1, 2)))
   as (value);
   
   select cast(value as struct<f1:array<double>,f2:array<int>>) AS value from t1;
   
   {"f1":[1.0,2.0,3.0],"f2":[0,0,0]}
   
   ```
   The result should be:
   ```
   {"f1":[1.0,2.0,3.0],"f2":[1,2,3]}
   ```
   The cast operation copies the second array by calling `ColumnarArray#copy`, which in turn calls `ColumnarArray#toIntArray`, which in turn calls `ColumnVector#getInts` on the underlying column vector (which is either an `OnHeapColumnVector` or an `OffHeapColumnVector`). The implementation of `getInts` in either concrete class assumes there is no dictionary and does not use it if it is present (in fact, it even asserts that there is no dictionary). However, in the above example, the column vector associated with the second array does have a dictionary:
   ```
   java -cp ~/github/parquet-mr/parquet-tools/target/parquet-tools-1.10.1.jar org.apache.parquet.tools.Main meta ./spark-warehouse/t1/part-00000-122fdd53-8166-407b-aec5-08e0c2845c3d-c000.snappy.parquet
   ...
   row group 1: RC:1 TS:112 OFFSET:4 
   -------------------------------------------------------------------------------------------------------------------------------------------------------
   value:       
   .f1:         
   ..list:      
   ...element:   INT32 SNAPPY DO:0 FPO:4 SZ:47/47/1.00 VC:3 ENC:RLE,PLAIN ST:[min: 1, max: 3, num_nulls: 0]
   .f2:         
   ..list:      
   ...element:   INT32 SNAPPY DO:51 FPO:80 SZ:69/65/0.94 VC:3 ENC:RLE,PLAIN_DICTIONARY ST:[min: 1, max: 2, num_nulls: 0]
   
   ```
   The same bug also occurs when field f2 is a map. This PR fixes that case as well.
   
   ### Does this PR introduce _any_ user-facing change?
   
   No, except for fixing the correctness issue.
   
   ### How was this patch tested?
   
   New tests.
   
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   No.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] sunchao commented on pull request #42850: [SPARK-44805][SQL] getBytes/getShorts/getInts/etc. should work in a column vector that has a dictionary

Posted by "sunchao (via GitHub)" <gi...@apache.org>.
sunchao commented on PR #42850:
URL: https://github.com/apache/spark/pull/42850#issuecomment-1712180825

   late LGTM, thanks @bersprockets !


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on pull request #42850: [SPARK-44805][SQL] getBytes/getShorts/getInts/etc. should work in a column vector that has a dictionary

Posted by "dongjoon-hyun (via GitHub)" <gi...@apache.org>.
dongjoon-hyun commented on PR #42850:
URL: https://github.com/apache/spark/pull/42850#issuecomment-1712172218

   Merged to master/3.5/3.4/3.3.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun closed pull request #42850: [SPARK-44805][SQL] getBytes/getShorts/getInts/etc. should work in a column vector that has a dictionary

Posted by "dongjoon-hyun (via GitHub)" <gi...@apache.org>.
dongjoon-hyun closed pull request #42850: [SPARK-44805][SQL] getBytes/getShorts/getInts/etc. should work in a column vector that has a dictionary
URL: https://github.com/apache/spark/pull/42850


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on pull request #42850: [SPARK-44805][SQL] getBytes/getShorts/getInts/etc. should work in a column vector that has a dictionary

Posted by "dongjoon-hyun (via GitHub)" <gi...@apache.org>.
dongjoon-hyun commented on PR #42850:
URL: https://github.com/apache/spark/pull/42850#issuecomment-1712156308

   cc @sunchao , too


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] wangyum commented on pull request #42850: [SPARK-44805][SQL] getBytes/getShorts/getInts/etc. should work in a column vector that has a dictionary

Posted by "wangyum (via GitHub)" <gi...@apache.org>.
wangyum commented on PR #42850:
URL: https://github.com/apache/spark/pull/42850#issuecomment-1711769350

   cc @cloud-fan 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] wangyum commented on a diff in pull request #42850: [SPARK-44805][SQL] getBytes/getShorts/getInts/etc. should work in a column vector that has a dictionary

Posted by "wangyum (via GitHub)" <gi...@apache.org>.
wangyum commented on code in PR #42850:
URL: https://github.com/apache/spark/pull/42850#discussion_r1319309285


##########
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetQuerySuite.scala:
##########
@@ -1108,6 +1108,17 @@ abstract class ParquetQuerySuite extends QueryTest with ParquetTest with SharedS
       checkAnswer(sql("select * from tbl"), expected)
     }
   }
+
+  test("SPARK-44805: cast of struct with two arrays") {
+    withTable("tbl") {
+      sql("create table tbl (value struct<f1:array<int>,f2:array<int>>) using parquet")
+      sql("insert into tbl values (named_struct('f1', array(1, 2, 3), 'f2', array(1, 1, 2)))")
+      val df = sql("""select cast(value as struct<f1:array<double>,f2:array<int>>) AS value
+                     |from tbl""".stripMargin)

Review Comment:
   ```suggestion
         val df = sql("select cast(value as struct<f1:array<double>,f2:array<int>>) AS value from tbl")
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org