You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Zeming Yu <ze...@gmail.com> on 2017/04/25 00:22:49 UTC
udf that handles null values
hi all,
I tried to write a UDF that handles null values:
def getMinutes(hString, minString):
if (hString != None) & (minString != None): return int(hString) * 60 +
int(minString[:-1])
else: return None
flight2 = (flight2.withColumn("duration_minutes",
udfGetMinutes("duration_h", "duration_m")))
but I got this error:
File "<ipython-input-67-5eb2daa1c1f2>", line 6, in getMinutes
TypeError: int() argument must be a string, a bytes-like object or a
number, not 'NoneType'
Does anyone know how to do this?
Thanks,
Zeming
Re: udf that handles null values
Posted by Zeming Yu <ze...@gmail.com>.
Thank you both!
Here's the code that's working now. It's a bit hard to read due to so many
functions. Any idea how I can improve the readability?
from pyspark.sql.functions import trim, when, from_unixtime,
unix_timestamp, minute, hour
duration_test = flight2.select("stop_duration1")
duration_test.show()
duration_test.withColumn('duration_h',
when(duration_test.stop_duration1.isNull(), 999)
.otherwise(hour(unix_timestamp(duration_test.stop_duration1,"HH'h'mm'm'").cast("timestamp")))).show(20)
+--------------+
|stop_duration1|
+--------------+
| 0h50m|
| 3h15m|
| 8h35m|
| 1h30m|
| 12h15m|
| 11h50m|
| 2h5m|
| 10h25m|
| 8h20m|
| null|
| 2h50m|
| 2h30m|
| 7h45m|
| 1h10m|
| 2h15m|
| 2h0m|
| 10h25m|
| 1h40m|
| 1h55m|
| 1h40m|
+--------------+
only showing top 20 rows
+--------------+----------+
|stop_duration1|duration_h|
+--------------+----------+
| 0h50m| 0|
| 3h15m| 3|
| 8h35m| 8|
| 1h30m| 1|
| 12h15m| 12|
| 11h50m| 11|
| 2h5m| 2|
| 10h25m| 10|
| 8h20m| 8|
| null| 999|
| 2h50m| 2|
| 2h30m| 2|
| 7h45m| 7|
| 1h10m| 1|
| 2h15m| 2|
| 2h0m| 2|
| 10h25m| 10|
| 1h40m| 1|
| 1h55m| 1|
| 1h40m| 1|
+--------------+----------+
only showing top 20 rows
On Tue, Apr 25, 2017 at 11:29 AM, Pushkar.Gujar <pu...@gmail.com>
wrote:
> Someone had similar issue today at stackoverflow.
>
> http://stackoverflow.com/questions/43595201/python-how-
> to-convert-pyspark-column-to-date-type-if-there-are-null-
> values/43595728#43595728
>
>
> Thank you,
> *Pushkar Gujar*
>
>
> On Mon, Apr 24, 2017 at 8:22 PM, Zeming Yu <ze...@gmail.com> wrote:
>
>> hi all,
>>
>> I tried to write a UDF that handles null values:
>>
>> def getMinutes(hString, minString):
>> if (hString != None) & (minString != None): return int(hString) * 60
>> + int(minString[:-1])
>> else: return None
>>
>> flight2 = (flight2.withColumn("duration_minutes",
>> udfGetMinutes("duration_h", "duration_m")))
>>
>>
>> but I got this error:
>>
>> File "<ipython-input-67-5eb2daa1c1f2>", line 6, in getMinutes
>> TypeError: int() argument must be a string, a bytes-like object or a number, not 'NoneType'
>>
>>
>> Does anyone know how to do this?
>>
>>
>> Thanks,
>>
>> Zeming
>>
>>
>
Re: udf that handles null values
Posted by "Pushkar.Gujar" <pu...@gmail.com>.
Someone had similar issue today at stackoverflow.
http://stackoverflow.com/questions/43595201/python-how-to-convert-pyspark-column-to-date-type-if-there-are-null-values/43595728#43595728
Thank you,
*Pushkar Gujar*
On Mon, Apr 24, 2017 at 8:22 PM, Zeming Yu <ze...@gmail.com> wrote:
> hi all,
>
> I tried to write a UDF that handles null values:
>
> def getMinutes(hString, minString):
> if (hString != None) & (minString != None): return int(hString) * 60 +
> int(minString[:-1])
> else: return None
>
> flight2 = (flight2.withColumn("duration_minutes",
> udfGetMinutes("duration_h", "duration_m")))
>
>
> but I got this error:
>
> File "<ipython-input-67-5eb2daa1c1f2>", line 6, in getMinutes
> TypeError: int() argument must be a string, a bytes-like object or a number, not 'NoneType'
>
>
> Does anyone know how to do this?
>
>
> Thanks,
>
> Zeming
>
>
Re: udf that handles null values
Posted by cy h <ci...@gmail.com>.
Quoting Python's Coding Style Guidelines - PEP-008
https://www.python.org/dev/peps/pep-0008/#programming-recommendations
Comparisons to singletons like Noneshould always be done with is or is not, never the equality operators.
Cinyoung
2017. 4. 25. 오전 9:22 Zeming Yu <ze...@gmail.com> 작성:
> hi all,
>
> I tried to write a UDF that handles null values:
>
> def getMinutes(hString, minString):
> if (hString != None) & (minString != None): return int(hString) * 60 + int(minString[:-1])
> else: return None
>
> flight2 = (flight2.withColumn("duration_minutes", udfGetMinutes("duration_h", "duration_m")))
>
>
> but I got this error:
> File "<ipython-input-67-5eb2daa1c1f2>", line 6, in getMinutes
> TypeError: int() argument must be a string, a bytes-like object or a number, not 'NoneType'
>
> Does anyone know how to do this?
>
> Thanks,
> Zeming