You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Luis Guerra (JIRA)" <ji...@apache.org> on 2015/07/17 09:42:06 UTC
[jira] [Updated] (SPARK-9131) UDF change data values
[ https://issues.apache.org/jira/browse/SPARK-9131?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Luis Guerra updated SPARK-9131:
-------------------------------
Target Version/s: (was: 1.4.2)
> UDF change data values
> ----------------------
>
> Key: SPARK-9131
> URL: https://issues.apache.org/jira/browse/SPARK-9131
> Project: Spark
> Issue Type: Bug
> Components: PySpark, SQL
> Affects Versions: 1.4.0
> Environment: Pyspark 1.4, Redhat 6.6
> Reporter: Luis Guerra
> Priority: Critical
>
> I am having some troubles when using a custom udf in dataframes with pyspark 1.4.
> I have rewritten the udf to simplify the problem and it gets even weirder. The udfs I am using do absolutely nothing, they just receive some value and output the same value with the same format.
> I show you my code below:
> c= a.join(b, a['ID'] == b['ID_new'], 'inner')
> c.filter(c['ID'] == 'XX').show()
> udf_A = UserDefinedFunction(lambda x: x, DateType())
> udf_B = UserDefinedFunction(lambda x: x, DateType())
> udf_C = UserDefinedFunction(lambda x: x, DateType())
> d = c.select(c['ID'], c['t1'].alias('ta'), udf_A(vinc_muestra['t2']).alias('tb'), udf_B(vinc_muestra['t1']).alias('tc'), udf_C(vinc_muestra['t2']).alias('td'))
> d.filter(d['ID'] == 'XX').show()
> I am showing here the results from the outputs:
> +----------------+----------------+----------+----------+
> | ID | ID_new | t1 | t2 |
> +----------------+----------------+----------+----------+
> |6000000002698917| 6000000002698917| 2012-02-28| 2014-02-28|
> |6000000002698917| 6000000002698917| 2012-02-20| 2013-02-20|
> |6000000002698917| 6000000002698917| 2012-02-28| 2014-02-28|
> |6000000002698917| 6000000002698917| 2012-02-20| 2013-02-20|
> |6000000002698917| 6000000002698917| 2012-02-20| 2013-02-20|
> |6000000002698917| 6000000002698917| 2012-02-28| 2014-02-28|
> |6000000002698917| 6000000002698917| 2012-02-28| 2014-02-28|
> |6000000002698917| 6000000002698917| 2012-02-20| 2013-02-20|
> +----------------+----------------+----------+----------+
> +----------------+---------------+---------------+------------+------------+
> | ID | ta | tb | tc | td |
> +----------------+---------------+---------------+------------+------------+
> |6000000002698917| 2012-02-28| 2007-03-05| 2003-03-05| 20140228|
> |6000000002698917| 2012-02-20| 2007-02-15| 20020215| 20130220|
> |6000000002698917| 2012-02-28| 2007-03-10| 20050310| 20140228|
> |6000000002698917| 2012-02-20| 20070305| 2003-03-05| 20130220|
> |6000000002698917| 2012-02-20| 2013-08-02| 2013-01-02| 2013-02-20|
> |6000000002698917| 2012-02-28| 2007-02-15| 20020215| 2014-02-28|
> |6000000002698917| 2012-02-28| 20070215| 2002-02-15| 2014-02-28|
> |6000000002698917| 2012-02-20| 2014-01-02| 2013-01-02| 2013-02-20|
> +----------------+---------------+---------------+------------+------------+
> The here is that values at columns 'tb', 'tc' and 'td' in dataframe 'd' are completely different from values 't1' and 't2' in dataframe c even when my udfs are doing nothing. It seems like if values were somehow got from other registers (or just invented). Results are different between executions (apparently random).
> Thanks in advance
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org