You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "frankli (Jira)" <ji...@apache.org> on 2021/10/19 03:33:00 UTC
[jira] [Commented] (SPARK-37051) The filter operator gets wrong results in ORC char/varchar types

    [ https://issues.apache.org/jira/browse/SPARK-37051?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17430297#comment-17430297 ] 

frankli commented on SPARK-37051:
---------------------------------

[~dongjoon] Can I trouble you to take a look. Thanks a lot.

> The filter operator gets wrong results in ORC char/varchar types
> ----------------------------------------------------------------
>
>                 Key: SPARK-37051
>                 URL: https://issues.apache.org/jira/browse/SPARK-37051
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 3.1.2
>         Environment: Spark 3.1.2
> Scala 2.12 / Java 1.8
>            Reporter: frankli
>            Priority: Major
>
> When I try the following sample SQL on  the TPCDS data, the filter operator returns an empty row set (shown in web ui).
> _select * from item where i_category = 'Music' limit 100;_
> The table is in ORC format, and i_category is char(50) type.
> I guest that the char(50) type will remains redundant blanks after the actual word.
> It will affect the boolean value of  "x.equals(Y)", and results in wrong results.
> By the way, Spark's tests should add more cases on ORC format.
>  
> == Physical Plan ==
> CollectLimit (3)
> +- Filter (2)
>  +- Scan orc tpcds_bin_partitioned_orc_2.item (1)
> (1) Scan orc tpcds_bin_partitioned_orc_2.item
> Output [22]: [i_item_sk#0L, i_item_id#1, i_rec_start_date#2, i_rec_end_date#3, i_item_desc#4, i_current_price#5, i_wholesale_cost#6, i_brand_id#7, i_brand#8, i_class_id#9, i_class#10, i_category_id#11, i_category#12, i_manufact_id#13, i_manufact#14, i_size#15, i_formulation#16, i_color#17, i_units#18, i_container#19, i_manager_id#20, i_product_name#21]
> Batched: false
> Location: InMemoryFileIndex [hdfs://tpcds_bin_partitioned_orc_2.db/item]
> PushedFilters: [IsNotNull(i_category), EqualTo(i_category,+Music         )]+
> ReadSchema: struct<i_item_sk:bigint,i_item_id:string,i_rec_start_date:date,i_rec_end_date:date,i_item_desc:string,i_current_price:decimal(7,2),i_wholesale_cost:decimal(7,2),i_brand_id:int,i_brand:string,i_class_id:int,i_class:string,i_category_id:int,i_category:string,i_manufact_id:int,i_manufact:string,i_size:string,i_formulation:string,i_color:string,i_units:string,i_container:string,i_manager_id:int,i_product_name:string>
> (2) Filter
> Input [22]: [i_item_sk#0L, i_item_id#1, i_rec_start_date#2, i_rec_end_date#3, i_item_desc#4, i_current_price#5, i_wholesale_cost#6, i_brand_id#7, i_brand#8, i_class_id#9, i_class#10, i_category_id#11, i_category#12, i_manufact_id#13, i_manufact#14, i_size#15, i_formulation#16, i_color#17, i_units#18, i_container#19, i_manager_id#20, i_product_name#21]
> Condition : (isnotnull(i_category#12) AND +(i_category#12 = Music         ))+
> (3) CollectLimit
> Input [22]: [i_item_sk#0L, i_item_id#1, i_rec_start_date#2, i_rec_end_date#3, i_item_desc#4, i_current_price#5, i_wholesale_cost#6, i_brand_id#7, i_brand#8, i_class_id#9, i_class#10, i_category_id#11, i_category#12, i_manufact_id#13, i_manufact#14, i_size#15, i_formulation#16, i_color#17, i_units#18, i_container#19, i_manager_id#20, i_product_name#21]
> Arguments: 100
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org