You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Takeshi Yamamuro (JIRA)" <ji...@apache.org> on 2017/07/31 10:09:00 UTC

[jira] [Commented] (SPARK-21581) Spark 2.x distinct return incorrect result

    [ https://issues.apache.org/jira/browse/SPARK-21581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16107081#comment-16107081 ] 

Takeshi Yamamuro commented on SPARK-21581:
------------------------------------------

This is an expected behaviour in spark-2.x and the behaviour changes when corrupt records hit.

{code}
// Spark-v1.6.3
scala> sqlContext.read.json("simple.json").show
+-------+------+--------------------+
|   name|salary|                 url|
+-------+------+--------------------+
| staff1| 600.0|http://example.ho...|
| staff2| 700.0|http://example.ho...|
| staff3| 800.0|http://example.ho...|
| staff4| 900.0|http://example.ho...|
| staff5|1000.0|http://example.ho...|
| staff6|  null|http://example.ho...|
| staff7|  null|http://example.ho...|
| staff8|  null|http://example.ho...|
| staff9|  null|http://example.ho...|
|staff10|  null|http://example.ho...|
+-------+------+--------------------+

// master
scala> spark.read.schema("name STRING, salary DOUBLE, url STRING, _malformed STRING").option("columnNameOfCorruptRecord", "_malformed").json("/Users/maropu/Desktop/simple.json").show
+------+------+--------------------+--------------------+
|  name|salary|                 url|          _malformed|
+------+------+--------------------+--------------------+
|staff1| 600.0|http://example.ho...|                null|
|staff2| 700.0|http://example.ho...|                null|
|staff3| 800.0|http://example.ho...|                null|
|staff4| 900.0|http://example.ho...|                null|
|staff5|1000.0|http://example.ho...|                null|
|  null|  null|                null|{"url": "http://e...|
|  null|  null|                null|{"url": "http://e...|
|  null|  null|                null|{"url": "http://e...|
|  null|  null|                null|{"url": "http://e...|
|  null|  null|                null|{"url": "http://e...|
+------+------+--------------------+--------------------+
{code}

> Spark 2.x distinct return incorrect result
> ------------------------------------------
>
>                 Key: SPARK-21581
>                 URL: https://issues.apache.org/jira/browse/SPARK-21581
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 2.0.0, 2.1.0, 2.2.0
>            Reporter: shengyao piao
>
> Hi all
> I'm using Spark2.x on cdh5.11
> I have a json file as follows.
> ・sample.json
> {code}
> {"url": "http://example.hoge/staff1", "name": "staff1", "salary":600.0}
> {"url": "http://example.hoge/staff2", "name": "staff2", "salary":700}
> {"url": "http://example.hoge/staff3", "name": "staff3", "salary":800}
> {"url": "http://example.hoge/staff4", "name": "staff4", "salary":900}
> {"url": "http://example.hoge/staff5", "name": "staff5", "salary":1000.0}
> {"url": "http://example.hoge/staff6", "name": "staff6", "salary":""}
> {"url": "http://example.hoge/staff7", "name": "staff7", "salary":""}
> {"url": "http://example.hoge/staff8", "name": "staff8", "salary":""}
> {"url": "http://example.hoge/staff9", "name": "staff9", "salary":""}
> {"url": "http://example.hoge/staff10", "name": "staff10", "salary":""}
> {code}
> And I try to read this file and distinct.
> ・spark code
> {code}
> val s=spark.read.json("sample.json")
> s.count
> res13: Long = 10
> s.distinct.count
> res14: Long = 6    < - It's should be 10
> {code}
> I know the cause of incorrect result is by mixed type in salary field.
> But when I try the same code in Spark 1.6 the result will be 10.
> So I think it's a bug in Spark 2.x.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org