You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Jeffrey (JIRA)" <ji...@apache.org> on 2019/01/30 06:00:00 UTC

[jira] [Comment Edited] (SPARK-25420) Dataset.count() every time is different.

    [ https://issues.apache.org/jira/browse/SPARK-25420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16755709#comment-16755709 ] 

Jeffrey edited comment on SPARK-25420 at 1/30/19 5:59 AM:
----------------------------------------------------------

[~kabhwan] I cannot share the dataset since it is owned by my clients. I could elaborate more on the scenarios:

 
 >>drkcard_0_df = spark.read.csv("""[wasbs://etl@okprodstorage.blob.core.windows.net/AAA/WICTW/raw/TXN/POS/2018/09/**/*.gz]""")
  
 The dataset carries > 100,000 of records.
  
 >>drkcard_0_df.where("_c0='2018-09-21 00:00:00' and _c1='TDT_DSC_ITM' and _c2='83' and _c4='1809192127320082002000018'").show(2000,False)
|_c0|_c1|_c2|_c3|_c4|_c5|
|2018-09-21 00:00:00|TDT_DSC_ITM|83|A|1809192127320082002000018|John|
|2018-09-21 00:00:00|TDT_DSC_ITM|83|A|1809192127320082002000018|Tom|
|2018-09-21 00:00:00|TDT_DSC_ITM|83|B|1809192127320082002000018|Mary|
|2018-09-21 00:00:00|TDT_DSC_ITM|83|B|1809192127320082002000018|Mabel|
|2018-09-21 00:00:00|TDT_DSC_ITM|83|C|1809192127320082002000018|James|
|2018-09-21 00:00:00|TDT_DSC_ITM|83|C|1809192127320082002000018|Laurence|

 
 >>dropDup_0=drkcard_0_df.dropDuplicates(["_c0","_c1","_c2","_c3","_c4"])
  
 Running the where for 1st time: {color:#ff0000}Bug: Group C was mistakenly dropped.{color}
 >>dropDup_0.where("_c0='2018-09-21 00:00:00' and _c1='TDT_DSC_ITM' and _c2='83' and _c4='1809192127320082002000018'").show(2000,False)
  
|_c0|_c1|_c2|_c3|_c4|_c5|
|2018-09-21 00:00:00|TDT_DSC_ITM|83|A|1809192127320082002000018|John|
|2018-09-21 00:00:00|TDT_DSC_ITM|83|B|1809192127320082002000018|Mary|

 
 Running the where again: {color:#ff0000}Acceptable result{color}
  
  >>dropDup_0.where("_c0='2018-09-21 00:00:00' and _c1='TDT_DSC_ITM' and _c2='83' and _c4='1809192127320082002000018'").show(2000,False)
  
|_c0|_c1|_c2|_c3|_c4|_c5|
|2018-09-21 00:00:00|TDT_DSC_ITM|83|A|1809192127320082002000018|John|
|2018-09-21 00:00:00|TDT_DSC_ITM|83|B|1809192127320082002000018|Mabel|
|2018-09-21 00:00:00|TDT_DSC_ITM|83|C|1809192127320082002000018|Laurence|

 
 Running the where again:{color:#ff0000}Bug: Group A and C were mistakenly dropped.{color}
  >>dropDup_0.where("_c0='2018-09-21 00:00:00' and _c1='TDT_DSC_ITM' and _c2='83' and _c4='1809192127320082002000018'").show(2000,False)
|_c0|_c1|_c2|_c3|_c4|_c5|
|2018-09-21 00:00:00|TDT_DSC_ITM|83|B|1809192127320082002000018|Mabel|

 

Continue to repeat running the where statement, the number of rows keep changing.


was (Author: jeffrey.mak):
[~kabhwan] I cannot share the dataset since it is owned by my clients. I could elaborate more on the scenarios:

 
>>drkcard_0_df = spark.read.csv("""[wasbs://etl@okprodstorage.blob.core.windows.net/AAA/WICTW/raw/TXN/POS/2018/09/**/*.gz|wasbs://liyu@aswprodeastorage.blob.core.windows.net/ASW/WTCTW/raw/TlogParser/PARSED_TLOG/2018/09/**/*.gz]""")
 
The dataset carries > 100,000 of records.
 
>>drkcard_0_df.where("_c0='2018-09-21 00:00:00' and _c1='TDT_DSC_ITM' and _c2='83' and _c4='1809192127320082002000018'").show(2000,False)
|_c0|_c1|_c2|_c3|_c4|_c5|
|2018-09-21 00:00:00|TDT_DSC_ITM|83|A|1809192127320082002000018|John|
|2018-09-21 00:00:00|TDT_DSC_ITM|83|A|1809192127320082002000018|Tom|
|2018-09-21 00:00:00|TDT_DSC_ITM|83|B|1809192127320082002000018|Mary|
|2018-09-21 00:00:00|TDT_DSC_ITM|83|B|1809192127320082002000018|Mabel|
|2018-09-21 00:00:00|TDT_DSC_ITM|83|C|1809192127320082002000018|James|
|2018-09-21 00:00:00|TDT_DSC_ITM|83|C|1809192127320082002000018|Laurence|

 
>>dropDup_0=drkcard_0_df.dropDuplicates(["_c0","_c1","_c2","_c3","_c4"])
 
Running the where for 1st time: {color:#FF0000}Bug: Group C was mistakenly dropped.{color}
>>dropDup_0.where("_c0='2018-09-21 00:00:00' and _c1='TDT_DSC_ITM' and _c2='83' and _c4='1809192127320082002000018'").show(2000,False)
 
|_c0|_c1|_c2|_c3|_c4|_c5|
|2018-09-21 00:00:00|TDT_DSC_ITM|83|A|1809192127320082002000018|John|
|2018-09-21 00:00:00|TDT_DSC_ITM|83|B|1809192127320082002000018|Mary|
 
Running the where again: {color:#FF0000}Acceptable result{color}
 
 >>dropDup_0.where("_c0='2018-09-21 00:00:00' and _c1='TDT_DSC_ITM' and _c2='83' and _c4='1809192127320082002000018'").show(2000,False)
 
|_c0|_c1|_c2|_c3|_c4|_c5|
|2018-09-21 00:00:00|TDT_DSC_ITM|83|A|1809192127320082002000018|John|
|2018-09-21 00:00:00|TDT_DSC_ITM|83|B|1809192127320082002000018|Mabel|
|2018-09-21 00:00:00|TDT_DSC_ITM|83|C|1809192127320082002000018|Laurence|
 
Running the where again:{color:#FF0000}Bug: Group A and C were mistakenly dropped.{color}
 >>dropDup_0.where("_c0='2018-09-21 00:00:00' and _c1='TDT_DSC_ITM' and _c2='83' and _c4='1809192127320082002000018'").show(2000,False)
|_c0|_c1|_c2|_c3|_c4|_c5|
|2018-09-21 00:00:00|TDT_DSC_ITM|83|B|1809192127320082002000018|Mabel|

 

Continue to repeat running the where statement, the number of rows keep changing.

> Dataset.count()  every time is different.
> -----------------------------------------
>
>                 Key: SPARK-25420
>                 URL: https://issues.apache.org/jira/browse/SPARK-25420
>             Project: Spark
>          Issue Type: Question
>          Components: Spark Core
>    Affects Versions: 2.3.0
>         Environment: spark2.3
> standalone
>            Reporter: huanghuai
>            Priority: Major
>              Labels: SQL
>
> Dataset<Row> dataset = sparkSession.read().format("csv").option("sep", ",").option("inferSchema", "true")
>  .option("escape", Constants.DEFAULT_CSV_ESCAPE).option("header", "true")
>  .option("encoding", "UTF-8")
>  .load("hdfs://192.168.1.26:9000/data/caopan/07-08_WithHead30M.csv");
> System.out.println("source count="+dataset.count());
> Dataset<Row> dropDuplicates = dataset.dropDuplicates(new String[]\{"DATE","TIME","VEL","COMPANY"});
> System.out.println("dropDuplicates count1="+dropDuplicates.count());
> System.out.println("dropDuplicates count2="+dropDuplicates.count());
> Dataset<Row> filter = dropDuplicates.filter("jd > 120.85 and wd > 30.666666 and (status = 0 or status = 1)");
> System.out.println("filter count1="+filter.count());
> System.out.println("filter count2="+filter.count());
> System.out.println("filter count3="+filter.count());
> System.out.println("filter count4="+filter.count());
> System.out.println("filter count5="+filter.count());
>  
>  
> ------------------------------------------------------The above is code ---------------------------------------
>  
>  
> console output:
> source count=459275
> dropDuplicates count1=453987
> dropDuplicates count2=453987
> filter count1=445798
> filter count2=445797
> filter count3=445797
> filter count4=445798
> filter count5=445799
>  
> question:
>  
> Why is filter.count() different everytime?
> if I remove dropDuplicates() everything will be ok!!
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org