You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Yu Gan (Jira)" <ji...@apache.org> on 2020/08/06 10:39:00 UTC

[jira] [Comment Edited] (SPARK-12741) DataFrame count method return wrong size.

    [ https://issues.apache.org/jira/browse/SPARK-12741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17172226#comment-17172226 ] 

Yu Gan edited comment on SPARK-12741 at 8/6/20, 10:38 AM:
----------------------------------------------------------

Aha, I came across the similar issue. My sql is 

select
 p_brand,
 p_size,
 count(ps_suppkey) as supplier_cnt 
 from
 tpch.partsupp 
 inner join
 tpch.part 
 on p_partkey = ps_partkey 
 group by
 P_BRAND,
 p_size

the total row count are different:

dataSet.count()=1179, dataSet.rdd().count()=1178

 

Finally i found the root cause:

In org.apache.spark.sql.execution.datasources.FailureSafeParser#parse throws BadRecordException, when  in PermissiveMode (default mode) and corrupted record exists the result row would be None record. In this case, the none record will be filtered. 

BTW, spark version 2.4 


was (Author: gyustorm):
Aha, I came across the similar issue. My sql is 

select
 p_brand,
 p_size,
 count(ps_suppkey) as supplier_cnt 
 from
 tpch.partsupp 
 inner join
 tpch.part 
 on p_partkey = ps_partkey 
 group by
 P_BRAND,
 p_size

the total row count are different:

dataSet.count()=1179, dataSet.rdd().count()=1178

 

Finally i found the root cause:

In org.apache.spark.sql.execution.datasources.FailureSafeParser#parse throws BadRecordException, when  in PermissiveMode and corrupted record exists the result row would be None record. In this case, the none record will be filtered. 

BTW, spark version 2.4 

> DataFrame count method return wrong size.
> -----------------------------------------
>
>                 Key: SPARK-12741
>                 URL: https://issues.apache.org/jira/browse/SPARK-12741
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 1.5.0
>            Reporter: Sasi
>            Priority: Major
>
> Hi,
> I'm updating my report.
> I'm working with Spark 1.5.2, (used to be 1.5.0), I have a DataFrame and I have 2 method, one for collect data and other for count.
> method doQuery looks like:
> {code}
> dataFrame.collect()
> {code}
> method doQueryCount looks like:
> {code}
> dataFrame.count()
> {code}
> I have few scenarios with few results:
> 1) Non data exists on my NoSQLDatabase results: count 0 and collect() 0
> 2) 3 rows exists results: count 0 and collect 3.
> 3) 5 rows exists results: count 2 and collect 5. 
> I tried to change the count code to the below code, but got the same results as I mentioned above.
> {code}
> dataFrame.sql("select count(*) from tbl").count/collect[0]
> {code}
> Thanks,
> Sasi



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org