You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Yanbo Liang (JIRA)" <ji...@apache.org> on 2014/12/26 16:00:27 UTC

[jira] [Commented] (SPARK-4963) SchemaRDD.sample may return wrong results

    [ https://issues.apache.org/jira/browse/SPARK-4963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14259098#comment-14259098 ] 

Yanbo Liang commented on SPARK-4963:
------------------------------------

I also hit this bug.
In my experiment, last row of each partitions will be collect whether it pass predictive in Filter or not.
Further clue is different iterator behavior of  GenericRow & SpecificMutableRow.
I'm working on this issue,  [~lian cheng] could you assign to me and I will try to submit fix. 

> SchemaRDD.sample may return wrong results
> -----------------------------------------
>
>                 Key: SPARK-4963
>                 URL: https://issues.apache.org/jira/browse/SPARK-4963
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 1.2.0
>            Reporter: Cheng Lian
>
> This {{sbt/sbt hive/console}} session can easily reproduce this issue:
> {code}
> sql("SELECT * FROM src WHERE key % 2 = 0").
>   sample(withReplacement = false, fraction = 0.05).
>   registerTempTable("sampled")
> println(table("sampled").queryExecution)
> val query = sql("SELECT * FROM sampled WHERE key % 2 = 1")
> println(query.queryExecution)
> // Should print `true'
> println((1 to 10).map(_ => query.collect().isEmpty).reduce(_ && _))
> {code}
> Notice that when fraction is less than 0.4, {{GapSamplingIterator}} is used to do the sampling. My guess is that there’s something to do with the underlying mutable row objects used in {{HiveTableScan}}, but haven't figured out the root cause.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org