You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Anand Nalya (JIRA)" <ji...@apache.org> on 2016/03/08 13:04:40 UTC

[jira] [Comment Edited] (SPARK-13301) PySpark Dataframe return wrong results with custom UDF

    [ https://issues.apache.org/jira/browse/SPARK-13301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15184828#comment-15184828 ] 

Anand Nalya edited comment on SPARK-13301 at 3/8/16 12:03 PM:
--------------------------------------------------------------

I also encountered the same issue running Spark 1.5.0 on CDH 5.5.1. Here {{dummy}} is just a substtring of {{eventname}} column and row 4 is incorrect.
{code}
    from pyspark import HiveContext
    from pyspark.sql.types import StringType
    def _dummy(strng):
        return strng[5:]
    sqlCtx = HiveContext(sc)
    sqlCtx.registerFunction('dummy', _dummy, StringType())
    data = sqlCtx.sql('select eventname, dummy(eventname) as dummy from reader_events limit 100')
    data.show(10,False)


+----------------------------------------+-----------------------------------+
|eventname                               |dummy                              |
+----------------------------------------+-----------------------------------+
|calypso_reader_infinite_scroll_performed|so_reader_infinite_scroll_performed|
|calypso_reader_infinite_scroll_performed|so_reader_infinite_scroll_performed|
|calypso_reader_infinite_scroll_performed|so_reader_infinite_scroll_performed|
|calypso_reader_discover_viewed          |so_reader_article_opened           |
|calypso_reader_article_liked            |so_reader_article_liked            |
|calypso_reader_article_liked            |so_reader_article_liked            |
|calypso_reader_infinite_scroll_performed|so_reader_infinite_scroll_performed|
|calypso_reader_article_opened           |so_reader_article_opened           |
|calypso_reader_article_commented_on     |so_reader_article_commented_on     |
|calypso_reader_article_opened           |so_reader_article_opened           |
+----------------------------------------+-----------------------------------+
only showing top 10 rows
{code}


was (Author: anandnalya):
I also encountered the same issue running Spark 1.5.0 on CDH 5.5.1. Here {{dummy}} is just a substtring of {{eventname}} column and row 4 is incorrect.
{code}
    from pyspark import HiveContext
    from pyspark.sql.types import StringType
    def _dummy(strng):
        return strng[5:]
    sqlCtx = HiveContext(sc)
    sqlCtx.registerFunction('dummy', _dummy, StringType())
    data = sqlCtx.sql('select eventname, dummy(eventname) as dummy from reader_events limit 5')
    data.show(5,False)


+----------------------------------------+-----------------------------------+
|eventname                               |dummy                              |
+----------------------------------------+-----------------------------------+
|calypso_reader_infinite_scroll_performed|so_reader_infinite_scroll_performed|
|calypso_reader_infinite_scroll_performed|so_reader_infinite_scroll_performed|
|calypso_reader_infinite_scroll_performed|so_reader_infinite_scroll_performed|
|calypso_reader_discover_viewed          |so_reader_article_opened           |
|calypso_reader_article_liked            |so_reader_article_liked            |
|calypso_reader_article_liked            |so_reader_article_liked            |
|calypso_reader_infinite_scroll_performed|so_reader_infinite_scroll_performed|
|calypso_reader_article_opened           |so_reader_article_opened           |
|calypso_reader_article_commented_on     |so_reader_article_commented_on     |
|calypso_reader_article_opened           |so_reader_article_opened           |
+----------------------------------------+-----------------------------------+
only showing top 10 rows
{code}

> PySpark Dataframe return wrong results with custom UDF
> ------------------------------------------------------
>
>                 Key: SPARK-13301
>                 URL: https://issues.apache.org/jira/browse/SPARK-13301
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 1.5.0
>         Environment: PySpark in yarn-client mode - CDH 5.5.1
>            Reporter: Simone
>            Priority: Critical
>
> Using a User Defined Function in PySpark inside the withColumn() method of Dataframe, gives wrong results.
> Here an example:
> from pyspark.sql import functions
> import string
> myFunc = functions.udf(lambda s: string.lower(s))
> myDF.select("col1", "col2").withColumn("col3", myFunc(myDF["col1"])).show()
> |                col1|       col2|                col3|
> |1265AB4F65C05740E...|        Ivo|4f00ae514e7c015be...|
> |1D94AB4F75C83B51E...|   Raffaele|4f00dcf6422100c0e...|
> |4F008903600A0133E...|   Cristina|4f008903600a0133e...|
> The results are wrong and seem to be random: some record are OK (for example the third) some others NO (for example the first 2).
> The problem seems not occur with Spark built-in functions:
> from pyspark.sql.functions import *
> myDF.select("col1", "col2").withColumn("col3", lower(myDF["col1"])).show()
> Without the withColumn() method, results seems to be always correct:
> myDF.select("col1", "col2", myFunc(myDF["col1"])).show()
> This can be considered only in part a workaround because you have to list each time all column of your Dataframe.
> Also in Scala/Java the problems seems not occur.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org