You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "frankli (Jira)" <ji...@apache.org> on 2021/10/19 09:59:00 UTC

[jira] [Commented] (SPARK-37051) The filter operator gets wrong results in ORC's char type

    [ https://issues.apache.org/jira/browse/SPARK-37051?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17430450#comment-17430450 ] 

frankli commented on SPARK-37051:
---------------------------------

It seems to be affected by the right padding.

[SPARK-34192][SQL] [https://github.com/apache/spark/commit/d1177b52304217f4cb86506fd1887ec98879ed16]

[~yaoqiang]

> The filter operator gets wrong results in ORC's char type
> ---------------------------------------------------------
>
>                 Key: SPARK-37051
>                 URL: https://issues.apache.org/jira/browse/SPARK-37051
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 3.1.2, 3.2.1, 3.3.0
>         Environment: Spark 3.1.2
> Scala 2.12 / Java 1.8
>            Reporter: frankli
>            Priority: Critical
>
> When I try the following sample SQL on  the TPCDS data, the filter operator returns an empty row set (shown in web ui).
> _select * from item where i_category = 'Music' limit 100;_
> The table is in ORC format, and i_category is char(50) type.
> I guest that the char(50) type will remains redundant blanks after the actual word.
> It will affect the boolean value of  "x.equals(Y)", and results in wrong results.
> Luckily, the varchar type is OK. 
>  
> This bug can be reproduced by a few steps.
> >>> desc t2_orc;
> +-----------+------------+----------+--+
> | col_name | data_type | comment |
> +-----------+------------+----------+--+
> | a | string       | NULL |
> | b | char(50)  | NULL |
> | c | int            | NULL |
> +-----------+------------+----------+–+
> >>> select * from t2_orc where a='a';
> +----+----+----+--+
> | a | b | c |
> +----+----+----+--+
> | a | b | 1 |
> | a | b | 2 |
> | a | b | 3 |
> | a | b | 4 |
> | a | b | 5 |
> +----+----+----+–+
> >>> select * from t2_orc where b='b';
> +----+----+----+--+
> | a | b | c |
> +----+----+----+--+
> +----+----+----+--+
>  
> By the way, Spark's tests should add more cases on the char type.
>  
> == Physical Plan ==
>  CollectLimit (3)
>  +- Filter (2)
>  +- Scan orc tpcds_bin_partitioned_orc_2.item (1)
> (1) Scan orc tpcds_bin_partitioned_orc_2.item
>  Output [22]: [i_item_sk#0L, i_item_id#1, i_rec_start_date#2, i_rec_end_date#3, i_item_desc#4, i_current_price#5, i_wholesale_cost#6, i_brand_id#7, i_brand#8, i_class_id#9, i_class#10, i_category_id#11, i_category#12, i_manufact_id#13, i_manufact#14, i_size#15, i_formulation#16, i_color#17, i_units#18, i_container#19, i_manager_id#20, i_product_name#21|#0L, i_item_id#1, i_rec_start_date#2, i_rec_end_date#3, i_item_desc#4, i_current_price#5, i_wholesale_cost#6, i_brand_id#7, i_brand#8, i_class_id#9, i_class#10, i_category_id#11, i_category#12, i_manufact_id#13, i_manufact#14, i_size#15, i_formulation#16, i_color#17, i_units#18, i_container#19, i_manager_id#20, i_product_name#21]
>  Batched: false
>  Location: InMemoryFileIndex [hdfs://tpcds_bin_partitioned_orc_2.db/item]
>  PushedFilters: [IsNotNull(i_category), +EqualTo(i_category,+Music         )]++++
>  ReadSchema: struct<i_item_sk:bigint,i_item_id:string,i_rec_start_date:date,i_rec_end_date:date,i_item_desc:string,i_current_price:decimal(7,2),i_wholesale_cost:decimal(7,2),i_brand_id:int,i_brand:string,i_class_id:int,i_class:string,i_category_id:int,i_category:string,i_manufact_id:int,i_manufact:string,i_size:string,i_formulation:string,i_color:string,i_units:string,i_container:string,i_manager_id:int,i_product_name:string>
> (2) Filter
>  Input [22]: [i_item_sk#0L, i_item_id#1, i_rec_start_date#2, i_rec_end_date#3, i_item_desc#4, i_current_price#5, i_wholesale_cost#6, i_brand_id#7, i_brand#8, i_class_id#9, i_class#10, i_category_id#11, i_category#12, i_manufact_id#13, i_manufact#14, i_size#15, i_formulation#16, i_color#17, i_units#18, i_container#19, i_manager_id#20, i_product_name#21|#0L, i_item_id#1, i_rec_start_date#2, i_rec_end_date#3, i_item_desc#4, i_current_price#5, i_wholesale_cost#6, i_brand_id#7, i_brand#8, i_class_id#9, i_class#10, i_category_id#11, i_category#12, i_manufact_id#13, i_manufact#14, i_size#15, i_formulation#16, i_color#17, i_units#18, i_container#19, i_manager_id#20, i_product_name#21]
>  Condition : (isnotnull(i_category#12) AND +(i_category#12 = Music         ))+
> (3) CollectLimit
>  Input [22]: [i_item_sk#0L, i_item_id#1, i_rec_start_date#2, i_rec_end_date#3, i_item_desc#4, i_current_price#5, i_wholesale_cost#6, i_brand_id#7, i_brand#8, i_class_id#9, i_class#10, i_category_id#11, i_category#12, i_manufact_id#13, i_manufact#14, i_size#15, i_formulation#16, i_color#17, i_units#18, i_container#19, i_manager_id#20, i_product_name#21|#0L, i_item_id#1, i_rec_start_date#2, i_rec_end_date#3, i_item_desc#4, i_current_price#5, i_wholesale_cost#6, i_brand_id#7, i_brand#8, i_class_id#9, i_class#10, i_category_id#11, i_category#12, i_manufact_id#13, i_manufact#14, i_size#15, i_formulation#16, i_color#17, i_units#18, i_container#19, i_manager_id#20, i_product_name#21]
>  Arguments: 100
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org