You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "frankli (Jira)" <ji...@apache.org> on 2021/10/19 09:59:00 UTC
[jira] [Commented] (SPARK-37051) The filter operator gets wrong
results in ORC's char type
[ https://issues.apache.org/jira/browse/SPARK-37051?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17430450#comment-17430450 ]
frankli commented on SPARK-37051:
---------------------------------
It seems to be affected by the right padding.
[SPARK-34192][SQL] [https://github.com/apache/spark/commit/d1177b52304217f4cb86506fd1887ec98879ed16]
[~yaoqiang]
> The filter operator gets wrong results in ORC's char type
> ---------------------------------------------------------
>
> Key: SPARK-37051
> URL: https://issues.apache.org/jira/browse/SPARK-37051
> Project: Spark
> Issue Type: Bug
> Components: SQL
> Affects Versions: 3.1.2, 3.2.1, 3.3.0
> Environment: Spark 3.1.2
> Scala 2.12 / Java 1.8
> Reporter: frankli
> Priority: Critical
>
> When I try the following sample SQL on the TPCDS data, the filter operator returns an empty row set (shown in web ui).
> _select * from item where i_category = 'Music' limit 100;_
> The table is in ORC format, and i_category is char(50) type.
> I guest that the char(50) type will remains redundant blanks after the actual word.
> It will affect the boolean value of "x.equals(Y)", and results in wrong results.
> Luckily, the varchar type is OK.
>
> This bug can be reproduced by a few steps.
> >>> desc t2_orc;
> +-----------+------------+----------+--+
> | col_name | data_type | comment |
> +-----------+------------+----------+--+
> | a | string | NULL |
> | b | char(50) | NULL |
> | c | int | NULL |
> +-----------+------------+----------+–+
> >>> select * from t2_orc where a='a';
> +----+----+----+--+
> | a | b | c |
> +----+----+----+--+
> | a | b | 1 |
> | a | b | 2 |
> | a | b | 3 |
> | a | b | 4 |
> | a | b | 5 |
> +----+----+----+–+
> >>> select * from t2_orc where b='b';
> +----+----+----+--+
> | a | b | c |
> +----+----+----+--+
> +----+----+----+--+
>
> By the way, Spark's tests should add more cases on the char type.
>
> == Physical Plan ==
> CollectLimit (3)
> +- Filter (2)
> +- Scan orc tpcds_bin_partitioned_orc_2.item (1)
> (1) Scan orc tpcds_bin_partitioned_orc_2.item
> Output [22]: [i_item_sk#0L, i_item_id#1, i_rec_start_date#2, i_rec_end_date#3, i_item_desc#4, i_current_price#5, i_wholesale_cost#6, i_brand_id#7, i_brand#8, i_class_id#9, i_class#10, i_category_id#11, i_category#12, i_manufact_id#13, i_manufact#14, i_size#15, i_formulation#16, i_color#17, i_units#18, i_container#19, i_manager_id#20, i_product_name#21|#0L, i_item_id#1, i_rec_start_date#2, i_rec_end_date#3, i_item_desc#4, i_current_price#5, i_wholesale_cost#6, i_brand_id#7, i_brand#8, i_class_id#9, i_class#10, i_category_id#11, i_category#12, i_manufact_id#13, i_manufact#14, i_size#15, i_formulation#16, i_color#17, i_units#18, i_container#19, i_manager_id#20, i_product_name#21]
> Batched: false
> Location: InMemoryFileIndex [hdfs://tpcds_bin_partitioned_orc_2.db/item]
> PushedFilters: [IsNotNull(i_category), +EqualTo(i_category,+Music )]++++
> ReadSchema: struct<i_item_sk:bigint,i_item_id:string,i_rec_start_date:date,i_rec_end_date:date,i_item_desc:string,i_current_price:decimal(7,2),i_wholesale_cost:decimal(7,2),i_brand_id:int,i_brand:string,i_class_id:int,i_class:string,i_category_id:int,i_category:string,i_manufact_id:int,i_manufact:string,i_size:string,i_formulation:string,i_color:string,i_units:string,i_container:string,i_manager_id:int,i_product_name:string>
> (2) Filter
> Input [22]: [i_item_sk#0L, i_item_id#1, i_rec_start_date#2, i_rec_end_date#3, i_item_desc#4, i_current_price#5, i_wholesale_cost#6, i_brand_id#7, i_brand#8, i_class_id#9, i_class#10, i_category_id#11, i_category#12, i_manufact_id#13, i_manufact#14, i_size#15, i_formulation#16, i_color#17, i_units#18, i_container#19, i_manager_id#20, i_product_name#21|#0L, i_item_id#1, i_rec_start_date#2, i_rec_end_date#3, i_item_desc#4, i_current_price#5, i_wholesale_cost#6, i_brand_id#7, i_brand#8, i_class_id#9, i_class#10, i_category_id#11, i_category#12, i_manufact_id#13, i_manufact#14, i_size#15, i_formulation#16, i_color#17, i_units#18, i_container#19, i_manager_id#20, i_product_name#21]
> Condition : (isnotnull(i_category#12) AND +(i_category#12 = Music ))+
> (3) CollectLimit
> Input [22]: [i_item_sk#0L, i_item_id#1, i_rec_start_date#2, i_rec_end_date#3, i_item_desc#4, i_current_price#5, i_wholesale_cost#6, i_brand_id#7, i_brand#8, i_class_id#9, i_class#10, i_category_id#11, i_category#12, i_manufact_id#13, i_manufact#14, i_size#15, i_formulation#16, i_color#17, i_units#18, i_container#19, i_manager_id#20, i_product_name#21|#0L, i_item_id#1, i_rec_start_date#2, i_rec_end_date#3, i_item_desc#4, i_current_price#5, i_wholesale_cost#6, i_brand_id#7, i_brand#8, i_class_id#9, i_class#10, i_category_id#11, i_category#12, i_manufact_id#13, i_manufact#14, i_size#15, i_formulation#16, i_color#17, i_units#18, i_container#19, i_manager_id#20, i_product_name#21]
> Arguments: 100
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org