You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Luis Guerra (JIRA)" <ji...@apache.org> on 2015/07/17 09:41:10 UTC
[jira] [Created] (SPARK-9131) UDF change data values

Luis Guerra created SPARK-9131:
----------------------------------

             Summary: UDF change data values
                 Key: SPARK-9131
                 URL: https://issues.apache.org/jira/browse/SPARK-9131
             Project: Spark
          Issue Type: Bug
          Components: PySpark, SQL
    Affects Versions: 1.4.0
         Environment: Pyspark 1.4, Redhat 6.6
            Reporter: Luis Guerra
            Priority: Critical


I am having some troubles when using a custom udf in dataframes with pyspark 1.4.

I have rewritten the udf to simplify the problem and it gets even weirder. The udfs I am using do absolutely nothing, they just receive some value and output the same value with the same format.

I show you my code below:

c= a.join(b, a['ID'] == b['ID_new'], 'inner')

c.filter(c['ID'] == 'XX').show()

udf_A = UserDefinedFunction(lambda x: x, DateType())
udf_B = UserDefinedFunction(lambda x: x, DateType())
udf_C = UserDefinedFunction(lambda x: x, DateType())

d = c.select(c['ID'], c['t1'].alias('ta'), udf_A(vinc_muestra['t2']).alias('tb'), udf_B(vinc_muestra['t1']).alias('tc'), udf_C(vinc_muestra['t2']).alias('td'))

d.filter(d['ID'] == 'XX').show()

I am showing here the results from the outputs:

+----------------+----------------+----------+----------+
|          ID     |     ID_new  |     t1	 |   t2     |
+----------------+----------------+----------+----------+
|6000000002698917|   6000000002698917|   2012-02-28|   2014-02-28|
|6000000002698917|   6000000002698917|   2012-02-20|   2013-02-20|
|6000000002698917|   6000000002698917|   2012-02-28|   2014-02-28|
|6000000002698917|   6000000002698917|   2012-02-20|   2013-02-20|
|6000000002698917|   6000000002698917|   2012-02-20|   2013-02-20|
|6000000002698917|   6000000002698917|   2012-02-28|   2014-02-28|
|6000000002698917|   6000000002698917|   2012-02-28|   2014-02-28|
|6000000002698917|   6000000002698917|   2012-02-20|   2013-02-20|
+----------------+----------------+----------+----------+

+----------------+---------------+---------------+------------+------------+
|       ID        |	    ta	   |	   tb	     |	 tc	   |     td	  |
+----------------+---------------+---------------+------------+------------+
|6000000002698917|     2012-02-28|       2007-03-05|    2003-03-05|    20140228|
|6000000002698917|     2012-02-20|       2007-02-15|    20020215|    20130220|
|6000000002698917|     2012-02-28|       2007-03-10|    20050310|    20140228|
|6000000002698917|     2012-02-20|       20070305|    2003-03-05|    20130220|
|6000000002698917|     2012-02-20|       2013-08-02|    2013-01-02|    2013-02-20|
|6000000002698917|     2012-02-28|       2007-02-15|    20020215|    2014-02-28|
|6000000002698917|     2012-02-28|       20070215|    2002-02-15|    2014-02-28|
|6000000002698917|     2012-02-20|       2014-01-02|    2013-01-02|    2013-02-20|
+----------------+---------------+---------------+------------+------------+

The here is that values at columns 'tb', 'tc' and 'td' in dataframe 'd' are completely different from values 't1' and 't2' in dataframe c even when my udfs are doing nothing. It seems like if values were somehow got from other registers (or just invented). Results are different between executions (apparently random).

Thanks in advance



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org