You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Dongjoon Hyun (Jira)" <ji...@apache.org> on 2021/10/26 09:11:00 UTC

[jira] [Commented] (SPARK-37051) The filter operator gets wrong results in char type

    [ https://issues.apache.org/jira/browse/SPARK-37051?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17434214#comment-17434214 ] 

Dongjoon Hyun commented on SPARK-37051:
---------------------------------------

Got it. If that happens on Parquet, we had better drop `ORC` from the JIRA title. I removed it first.
> This scenario also occur on Parquet.

> The filter operator gets wrong results in char type
> ---------------------------------------------------
>
>                 Key: SPARK-37051
>                 URL: https://issues.apache.org/jira/browse/SPARK-37051
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 3.1.2, 3.2.1, 3.3.0
>         Environment: Spark 3.1.2
> Scala 2.12 / Java 1.8
>            Reporter: frankli
>            Priority: Critical
>
> When I try the following sample SQL on  the TPCDS data, the filter operator returns an empty row set (shown in web ui).
> _select * from item where i_category = 'Music' limit 100;_
> The table is in ORC format, and i_category is char(50) type. 
> Data is inserted by hive, and queried by Spark.
> I guest that the char(50) type will remains redundant blanks after the actual word.
> It will affect the boolean value of  "x.equals(Y)", and results in wrong results.
> Luckily, the varchar type is OK. 
>  
> This bug can be reproduced by a few steps.
> >>> desc t2_orc;
>  +------------+-----------++------------+
> |col_name|data_type|comment|
> +------------+-----------++------------+
> |a|string      |NULL|
> |b|char(50)  |NULL|
> |c|int            |NULL|
> +------------+-----------++----------–+
> >>> select * from t2_orc where a='a';
>  +-----+---++------+
> |a|b|c|
> +-----+---++------+
> |a|b|1|
> |a|b|2|
> |a|b|3|
> |a|b|4|
> |a|b|5|
> +-----+---++----–+
> >>> select * from t2_orc where b='b';
>  +-----+---++------+
> |a|b|c|
> +-----+---++------+
>  +-----+---++------+
>  
> By the way, Spark's tests should add more cases on the char type.
>  
> == Physical Plan ==
>  CollectLimit (3)
>  +- Filter (2)
>  +- Scan orc tpcds_bin_partitioned_orc_2.item (1)
> (1) Scan orc tpcds_bin_partitioned_orc_2.item
>  Output [22]: [i_item_sk#0L, i_item_id#1, i_rec_start_date#2, i_rec_end_date#3, i_item_desc#4, i_current_price#5, i_wholesale_cost#6, i_brand_id#7, i_brand#8, i_class_id#9, i_class#10, i_category_id#11, i_category#12, i_manufact_id#13, i_manufact#14, i_size#15, i_formulation#16, i_color#17, i_units#18, i_container#19, i_manager_id#20, i_product_name#21|#0L, i_item_id#1, i_rec_start_date#2, i_rec_end_date#3, i_item_desc#4, i_current_price#5, i_wholesale_cost#6, i_brand_id#7, i_brand#8, i_class_id#9, i_class#10, i_category_id#11, i_category#12, i_manufact_id#13, i_manufact#14, i_size#15, i_formulation#16, i_color#17, i_units#18, i_container#19, i_manager_id#20, i_product_name#21]
>  Batched: false
>  Location: InMemoryFileIndex [hdfs://tpcds_bin_partitioned_orc_2.db/item]
>  PushedFilters: [IsNotNull(i_category), +EqualTo(i_category,+Music         )]++++
>  ReadSchema: struct<i_item_sk:bigint,i_item_id:string,i_rec_start_date:date,i_rec_end_date:date,i_item_desc:string,i_current_price:decimal(7,2),i_wholesale_cost:decimal(7,2),i_brand_id:int,i_brand:string,i_class_id:int,i_class:string,i_category_id:int,i_category:string,i_manufact_id:int,i_manufact:string,i_size:string,i_formulation:string,i_color:string,i_units:string,i_container:string,i_manager_id:int,i_product_name:string>
> (2) Filter
>  Input [22]: [i_item_sk#0L, i_item_id#1, i_rec_start_date#2, i_rec_end_date#3, i_item_desc#4, i_current_price#5, i_wholesale_cost#6, i_brand_id#7, i_brand#8, i_class_id#9, i_class#10, i_category_id#11, i_category#12, i_manufact_id#13, i_manufact#14, i_size#15, i_formulation#16, i_color#17, i_units#18, i_container#19, i_manager_id#20, i_product_name#21|#0L, i_item_id#1, i_rec_start_date#2, i_rec_end_date#3, i_item_desc#4, i_current_price#5, i_wholesale_cost#6, i_brand_id#7, i_brand#8, i_class_id#9, i_class#10, i_category_id#11, i_category#12, i_manufact_id#13, i_manufact#14, i_size#15, i_formulation#16, i_color#17, i_units#18, i_container#19, i_manager_id#20, i_product_name#21]
>  Condition : (isnotnull(i_category#12) AND +(i_category#12 = Music         ))+
> (3) CollectLimit
>  Input [22]: [i_item_sk#0L, i_item_id#1, i_rec_start_date#2, i_rec_end_date#3, i_item_desc#4, i_current_price#5, i_wholesale_cost#6, i_brand_id#7, i_brand#8, i_class_id#9, i_class#10, i_category_id#11, i_category#12, i_manufact_id#13, i_manufact#14, i_size#15, i_formulation#16, i_color#17, i_units#18, i_container#19, i_manager_id#20, i_product_name#21|#0L, i_item_id#1, i_rec_start_date#2, i_rec_end_date#3, i_item_desc#4, i_current_price#5, i_wholesale_cost#6, i_brand_id#7, i_brand#8, i_class_id#9, i_class#10, i_category_id#11, i_category#12, i_manufact_id#13, i_manufact#14, i_size#15, i_formulation#16, i_color#17, i_units#18, i_container#19, i_manager_id#20, i_product_name#21]
>  Arguments: 100
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org