You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Zeming Yu <ze...@gmail.com> on 2017/04/25 00:22:49 UTC

udf that handles null values

hi all,

I tried to write a UDF that handles null values:

def getMinutes(hString, minString):
    if (hString != None) & (minString != None): return int(hString) * 60 +
int(minString[:-1])
    else: return None

flight2 = (flight2.withColumn("duration_minutes",
udfGetMinutes("duration_h", "duration_m")))


but I got this error:

  File "<ipython-input-67-5eb2daa1c1f2>", line 6, in getMinutes
TypeError: int() argument must be a string, a bytes-like object or a
number, not 'NoneType'


Does anyone know how to do this?


Thanks,

Zeming

Re: udf that handles null values

Posted by Zeming Yu <ze...@gmail.com>.
Thank you both!

Here's the code that's working now. It's a bit hard to read due to so many
functions. Any idea how I can improve the readability?

from pyspark.sql.functions import trim, when, from_unixtime,
unix_timestamp, minute, hour

duration_test = flight2.select("stop_duration1")
duration_test.show()


duration_test.withColumn('duration_h',
when(duration_test.stop_duration1.isNull(), 999)

.otherwise(hour(unix_timestamp(duration_test.stop_duration1,"HH'h'mm'm'").cast("timestamp")))).show(20)


+--------------+
|stop_duration1|
+--------------+
|         0h50m|
|         3h15m|
|         8h35m|
|         1h30m|
|        12h15m|
|        11h50m|
|          2h5m|
|        10h25m|
|         8h20m|
|          null|
|         2h50m|
|         2h30m|
|         7h45m|
|         1h10m|
|         2h15m|
|          2h0m|
|        10h25m|
|         1h40m|
|         1h55m|
|         1h40m|
+--------------+
only showing top 20 rows

+--------------+----------+
|stop_duration1|duration_h|
+--------------+----------+
|         0h50m|         0|
|         3h15m|         3|
|         8h35m|         8|
|         1h30m|         1|
|        12h15m|        12|
|        11h50m|        11|
|          2h5m|         2|
|        10h25m|        10|
|         8h20m|         8|
|          null|       999|
|         2h50m|         2|
|         2h30m|         2|
|         7h45m|         7|
|         1h10m|         1|
|         2h15m|         2|
|          2h0m|         2|
|        10h25m|        10|
|         1h40m|         1|
|         1h55m|         1|
|         1h40m|         1|
+--------------+----------+
only showing top 20 rows





On Tue, Apr 25, 2017 at 11:29 AM, Pushkar.Gujar <pu...@gmail.com>
wrote:

> Someone had similar issue today at stackoverflow.
>
> http://stackoverflow.com/questions/43595201/python-how-
> to-convert-pyspark-column-to-date-type-if-there-are-null-
> values/43595728#43595728
>
>
> Thank you,
> *Pushkar Gujar*
>
>
> On Mon, Apr 24, 2017 at 8:22 PM, Zeming Yu <ze...@gmail.com> wrote:
>
>> hi all,
>>
>> I tried to write a UDF that handles null values:
>>
>> def getMinutes(hString, minString):
>>     if (hString != None) & (minString != None): return int(hString) * 60
>> + int(minString[:-1])
>>     else: return None
>>
>> flight2 = (flight2.withColumn("duration_minutes",
>> udfGetMinutes("duration_h", "duration_m")))
>>
>>
>> but I got this error:
>>
>>   File "<ipython-input-67-5eb2daa1c1f2>", line 6, in getMinutes
>> TypeError: int() argument must be a string, a bytes-like object or a number, not 'NoneType'
>>
>>
>> Does anyone know how to do this?
>>
>>
>> Thanks,
>>
>> Zeming
>>
>>
>

Re: udf that handles null values

Posted by "Pushkar.Gujar" <pu...@gmail.com>.
Someone had similar issue today at stackoverflow.

http://stackoverflow.com/questions/43595201/python-how-to-convert-pyspark-column-to-date-type-if-there-are-null-values/43595728#43595728



Thank you,
*Pushkar Gujar*


On Mon, Apr 24, 2017 at 8:22 PM, Zeming Yu <ze...@gmail.com> wrote:

> hi all,
>
> I tried to write a UDF that handles null values:
>
> def getMinutes(hString, minString):
>     if (hString != None) & (minString != None): return int(hString) * 60 +
> int(minString[:-1])
>     else: return None
>
> flight2 = (flight2.withColumn("duration_minutes",
> udfGetMinutes("duration_h", "duration_m")))
>
>
> but I got this error:
>
>   File "<ipython-input-67-5eb2daa1c1f2>", line 6, in getMinutes
> TypeError: int() argument must be a string, a bytes-like object or a number, not 'NoneType'
>
>
> Does anyone know how to do this?
>
>
> Thanks,
>
> Zeming
>
>

Re: udf that handles null values

Posted by cy h <ci...@gmail.com>.
Quoting Python's Coding Style Guidelines - PEP-008 

https://www.python.org/dev/peps/pep-0008/#programming-recommendations



Comparisons to singletons like Noneshould always be done with is or is not, never the equality operators.



Cinyoung

2017. 4. 25. 오전 9:22 Zeming Yu <ze...@gmail.com> 작성:

> hi all,
> 
> I tried to write a UDF that handles null values:
> 
> def getMinutes(hString, minString):
>     if (hString != None) & (minString != None): return int(hString) * 60 + int(minString[:-1])
>     else: return None
> 
> flight2 = (flight2.withColumn("duration_minutes", udfGetMinutes("duration_h", "duration_m")))
> 
> 
> but I got this error: 
>   File "<ipython-input-67-5eb2daa1c1f2>", line 6, in getMinutes
> TypeError: int() argument must be a string, a bytes-like object or a number, not 'NoneType'
> 
> Does anyone know how to do this?
> 
> Thanks,
> Zeming