You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by "bersprockets (via GitHub)" <gi...@apache.org> on 2023/09/07 15:47:09 UTC
[GitHub] [spark] bersprockets opened a new pull request, #42850: [SPARK-44805][SQL] getBytes/getShorts/getInts/etc. should work in a column vector that has a dictionary
bersprockets opened a new pull request, #42850:
URL: https://github.com/apache/spark/pull/42850
### What changes were proposed in this pull request?
Change getBytes/getShorts/getInts/getLongs/getFloats/getDoubles in `OnHeapColumnVector` and `OffHeapColumnVector` to use the dictionary, if present.
### Why are the changes needed?
The following query gets incorrect results:
```
drop table if exists t1;
create table t1 using parquet as
select * from values
(named_struct('f1', array(1, 2, 3), 'f2', array(1, 1, 2)))
as (value);
select cast(value as struct<f1:array<double>,f2:array<int>>) AS value from t1;
{"f1":[1.0,2.0,3.0],"f2":[0,0,0]}
```
The result should be:
```
{"f1":[1.0,2.0,3.0],"f2":[1,2,3]}
```
The cast operation copies the second array by calling `ColumnarArray#copy`, which in turn calls `ColumnarArray#toIntArray`, which in turn calls `ColumnVector#getInts` on the underlying column vector (which is either an `OnHeapColumnVector` or an `OffHeapColumnVector`). The implementation of `getInts` in either concrete class assumes there is no dictionary and does not use it if it is present (in fact, it even asserts that there is no dictionary). However, in the above example, the column vector associated with the second array does have a dictionary:
```
java -cp ~/github/parquet-mr/parquet-tools/target/parquet-tools-1.10.1.jar org.apache.parquet.tools.Main meta ./spark-warehouse/t1/part-00000-122fdd53-8166-407b-aec5-08e0c2845c3d-c000.snappy.parquet
...
row group 1: RC:1 TS:112 OFFSET:4
-------------------------------------------------------------------------------------------------------------------------------------------------------
value:
.f1:
..list:
...element: INT32 SNAPPY DO:0 FPO:4 SZ:47/47/1.00 VC:3 ENC:RLE,PLAIN ST:[min: 1, max: 3, num_nulls: 0]
.f2:
..list:
...element: INT32 SNAPPY DO:51 FPO:80 SZ:69/65/0.94 VC:3 ENC:RLE,PLAIN_DICTIONARY ST:[min: 1, max: 2, num_nulls: 0]
```
The same bug also occurs when field f2 is a map. This PR fixes that case as well.
### Does this PR introduce _any_ user-facing change?
No, except for fixing the correctness issue.
### How was this patch tested?
New tests.
### Was this patch authored or co-authored using generative AI tooling?
No.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] sunchao commented on pull request #42850: [SPARK-44805][SQL] getBytes/getShorts/getInts/etc. should work in a column vector that has a dictionary
Posted by "sunchao (via GitHub)" <gi...@apache.org>.
sunchao commented on PR #42850:
URL: https://github.com/apache/spark/pull/42850#issuecomment-1712180825
late LGTM, thanks @bersprockets !
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on pull request #42850: [SPARK-44805][SQL] getBytes/getShorts/getInts/etc. should work in a column vector that has a dictionary
Posted by "dongjoon-hyun (via GitHub)" <gi...@apache.org>.
dongjoon-hyun commented on PR #42850:
URL: https://github.com/apache/spark/pull/42850#issuecomment-1712172218
Merged to master/3.5/3.4/3.3.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] dongjoon-hyun closed pull request #42850: [SPARK-44805][SQL] getBytes/getShorts/getInts/etc. should work in a column vector that has a dictionary
Posted by "dongjoon-hyun (via GitHub)" <gi...@apache.org>.
dongjoon-hyun closed pull request #42850: [SPARK-44805][SQL] getBytes/getShorts/getInts/etc. should work in a column vector that has a dictionary
URL: https://github.com/apache/spark/pull/42850
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on pull request #42850: [SPARK-44805][SQL] getBytes/getShorts/getInts/etc. should work in a column vector that has a dictionary
Posted by "dongjoon-hyun (via GitHub)" <gi...@apache.org>.
dongjoon-hyun commented on PR #42850:
URL: https://github.com/apache/spark/pull/42850#issuecomment-1712156308
cc @sunchao , too
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] wangyum commented on pull request #42850: [SPARK-44805][SQL] getBytes/getShorts/getInts/etc. should work in a column vector that has a dictionary
Posted by "wangyum (via GitHub)" <gi...@apache.org>.
wangyum commented on PR #42850:
URL: https://github.com/apache/spark/pull/42850#issuecomment-1711769350
cc @cloud-fan
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] wangyum commented on a diff in pull request #42850: [SPARK-44805][SQL] getBytes/getShorts/getInts/etc. should work in a column vector that has a dictionary
Posted by "wangyum (via GitHub)" <gi...@apache.org>.
wangyum commented on code in PR #42850:
URL: https://github.com/apache/spark/pull/42850#discussion_r1319309285
##########
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetQuerySuite.scala:
##########
@@ -1108,6 +1108,17 @@ abstract class ParquetQuerySuite extends QueryTest with ParquetTest with SharedS
checkAnswer(sql("select * from tbl"), expected)
}
}
+
+ test("SPARK-44805: cast of struct with two arrays") {
+ withTable("tbl") {
+ sql("create table tbl (value struct<f1:array<int>,f2:array<int>>) using parquet")
+ sql("insert into tbl values (named_struct('f1', array(1, 2, 3), 'f2', array(1, 1, 2)))")
+ val df = sql("""select cast(value as struct<f1:array<double>,f2:array<int>>) AS value
+ |from tbl""".stripMargin)
Review Comment:
```suggestion
val df = sql("select cast(value as struct<f1:array<double>,f2:array<int>>) AS value from tbl")
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org