You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@spark.apache.org by "Rick Moritz (JIRA)" <ji...@apache.org> on 2015/06/15 18:20:00 UTC

[jira] [Created] (SPARK-8380) SparkR mis-counts

Rick Moritz created SPARK-8380:
----------------------------------

             Summary: SparkR mis-counts
                 Key: SPARK-8380
                 URL: https://issues.apache.org/jira/browse/SPARK-8380
             Project: Spark
          Issue Type: Bug
          Components: SparkR
    Affects Versions: 1.4.0
            Reporter: Rick Moritz


On my dataset of ~9 Million rows x 30 columns, queried via Hive, I can perform count operations on the entirety of the dataset and get the correct value, as double checked against the same code in scala.
When I start to add conditions or even do a simple partial ascending histogram, I get discrepancies.

In particular, there are missing values in SparkR, and massively so:
A top 6 count of a certain feature in my dataset results in an order of magnitude smaller numbers, than I get via scala.

The following logic, which I consider equivalent is the basis for this report:

counts<-summarize(groupBy(df, df$col_name), count = n(tdf$col_name))
head(arrange(counts, desc(counts$count)))

versus:

val table = sql("SELECT col_name, count(col_name) as value from df  group by col_name order by value desc")

The first, in particular, is taken directly from the SparkR programming guide. Since summarize isn't documented from what I can see, I'd hope it does what the programming guide indicates. In that case this would be a pretty serious logic bug (no errors are thrown). Otherwise, there's the possibility of a lack of documentation and badly worded example in the guide being behind my misperception of SparkRs functionality.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org