You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@spark.apache.org by Olivier Girardot <o....@lateral-thoughts.com> on 2015/04/20 11:17:05 UTC

Dataframe.fillna from 1.3.0

Hi everyone,
let's assume I'm stuck in 1.3.0, how can I benefit from the *fillna* API in
PySpark, is there any efficient alternative to mapping the records myself ?

Regards,

Olivier.

Re: Dataframe.fillna from 1.3.0

Posted by Reynold Xin <rx...@databricks.com>.

The changes look good to me. Jenkins is somehow not responding. Will merge
once Jenkins comes back happy.


On Fri, Apr 24, 2015 at 2:38 AM, Olivier Girardot <
o.girardot@lateral-thoughts.com> wrote:

> done : https://github.com/apache/spark/pull/5683 and
> https://issues.apache.org/jira/browse/SPARK-7118
> thx
>
> Le ven. 24 avr. 2015 à 07:34, Olivier Girardot <
> o.girardot@lateral-thoughts.com> a écrit :
>
>> I'll try thanks
>>
>> Le ven. 24 avr. 2015 à 00:09, Reynold Xin <rx...@databricks.com> a écrit :
>>
>>> You can do it similar to the way countDistinct is done, can't you?
>>>
>>>
>>> https://github.com/apache/spark/blob/master/python/pyspark/sql/functions.py#L78
>>>
>>>
>>>
>>> On Thu, Apr 23, 2015 at 1:59 PM, Olivier Girardot <
>>> o.girardot@lateral-thoughts.com> wrote:
>>>
>>>> I found another way setting a SPARK_HOME on a released version and
>>>> launching an ipython to load the contexts.
>>>> I may need your insight however, I found why it hasn't been done at the
>>>> same time, this method (like some others) uses a varargs in Scala and for
>>>> now the way functions are called only one parameter is supported.
>>>>
>>>> So at first I tried to just generalise the helper function "_" in the
>>>> functions.py file to multiple arguments, but py4j's handling of varargs
>>>> forces me to create an Array[Column] if the target method is expecting
>>>> varargs.
>>>>
>>>> But from Python's perspective, we have no idea of whether the target
>>>> method will be expecting varargs or just multiple arguments (to un-tuple).
>>>> I can create a special case for "coalesce" or "for method that takes of
>>>> list of columns as arguments" considering they will be varargs based (and
>>>> therefore needs an Array[Column] instead of just a list of arguments)
>>>>
>>>> But this seems very specific and very prone to future mistakes.
>>>> Is there any way in Py4j to know before calling it the signature of a
>>>> method ?
>>>>
>>>>
>>>> Le jeu. 23 avr. 2015 à 22:17, Olivier Girardot <
>>>> o.girardot@lateral-thoughts.com> a écrit :
>>>>
>>>>> What is the way of testing/building the pyspark part of Spark ?
>>>>>
>>>>> Le jeu. 23 avr. 2015 à 22:06, Olivier Girardot <
>>>>> o.girardot@lateral-thoughts.com> a écrit :
>>>>>
>>>>>> yep :) I'll open the jira when I've got the time.
>>>>>> Thanks
>>>>>>
>>>>>> Le jeu. 23 avr. 2015 à 19:31, Reynold Xin <rx...@databricks.com> a
>>>>>> écrit :
>>>>>>
>>>>>>> Ah damn. We need to add it to the Python list. Would you like to
>>>>>>> give it a shot?
>>>>>>>
>>>>>>>
>>>>>>> On Thu, Apr 23, 2015 at 4:31 AM, Olivier Girardot <
>>>>>>> o.girardot@lateral-thoughts.com> wrote:
>>>>>>>
>>>>>>>> Yep no problem, but I can't seem to find the coalesce fonction in
>>>>>>>> pyspark.sql.{*, functions, types or whatever :) }
>>>>>>>>
>>>>>>>> Olivier.
>>>>>>>>
>>>>>>>> Le lun. 20 avr. 2015 à 11:48, Olivier Girardot <
>>>>>>>> o.girardot@lateral-thoughts.com> a écrit :
>>>>>>>>
>>>>>>>> > a UDF might be a good idea no ?
>>>>>>>> >
>>>>>>>> > Le lun. 20 avr. 2015 à 11:17, Olivier Girardot <
>>>>>>>> > o.girardot@lateral-thoughts.com> a écrit :
>>>>>>>> >
>>>>>>>> >> Hi everyone,
>>>>>>>> >> let's assume I'm stuck in 1.3.0, how can I benefit from the
>>>>>>>> *fillna* API
>>>>>>>> >> in PySpark, is there any efficient alternative to mapping the
>>>>>>>> records
>>>>>>>> >> myself ?
>>>>>>>> >>
>>>>>>>> >> Regards,
>>>>>>>> >>
>>>>>>>> >> Olivier.
>>>>>>>> >>
>>>>>>>> >
>>>>>>>>
>>>>>>>
>>>>>>>
>>>

Re: Dataframe.fillna from 1.3.0

Posted by Olivier Girardot <o....@lateral-thoughts.com>.

done : https://github.com/apache/spark/pull/5683 and
https://issues.apache.org/jira/browse/SPARK-7118
thx

Le ven. 24 avr. 2015 à 07:34, Olivier Girardot <
o.girardot@lateral-thoughts.com> a écrit :

> I'll try thanks
>
> Le ven. 24 avr. 2015 à 00:09, Reynold Xin <rx...@databricks.com> a écrit :
>
>> You can do it similar to the way countDistinct is done, can't you?
>>
>>
>> https://github.com/apache/spark/blob/master/python/pyspark/sql/functions.py#L78
>>
>>
>>
>> On Thu, Apr 23, 2015 at 1:59 PM, Olivier Girardot <
>> o.girardot@lateral-thoughts.com> wrote:
>>
>>> I found another way setting a SPARK_HOME on a released version and
>>> launching an ipython to load the contexts.
>>> I may need your insight however, I found why it hasn't been done at the
>>> same time, this method (like some others) uses a varargs in Scala and for
>>> now the way functions are called only one parameter is supported.
>>>
>>> So at first I tried to just generalise the helper function "_" in the
>>> functions.py file to multiple arguments, but py4j's handling of varargs
>>> forces me to create an Array[Column] if the target method is expecting
>>> varargs.
>>>
>>> But from Python's perspective, we have no idea of whether the target
>>> method will be expecting varargs or just multiple arguments (to un-tuple).
>>> I can create a special case for "coalesce" or "for method that takes of
>>> list of columns as arguments" considering they will be varargs based (and
>>> therefore needs an Array[Column] instead of just a list of arguments)
>>>
>>> But this seems very specific and very prone to future mistakes.
>>> Is there any way in Py4j to know before calling it the signature of a
>>> method ?
>>>
>>>
>>> Le jeu. 23 avr. 2015 à 22:17, Olivier Girardot <
>>> o.girardot@lateral-thoughts.com> a écrit :
>>>
>>>> What is the way of testing/building the pyspark part of Spark ?
>>>>
>>>> Le jeu. 23 avr. 2015 à 22:06, Olivier Girardot <
>>>> o.girardot@lateral-thoughts.com> a écrit :
>>>>
>>>>> yep :) I'll open the jira when I've got the time.
>>>>> Thanks
>>>>>
>>>>> Le jeu. 23 avr. 2015 à 19:31, Reynold Xin <rx...@databricks.com> a
>>>>> écrit :
>>>>>
>>>>>> Ah damn. We need to add it to the Python list. Would you like to give
>>>>>> it a shot?
>>>>>>
>>>>>>
>>>>>> On Thu, Apr 23, 2015 at 4:31 AM, Olivier Girardot <
>>>>>> o.girardot@lateral-thoughts.com> wrote:
>>>>>>
>>>>>>> Yep no problem, but I can't seem to find the coalesce fonction in
>>>>>>> pyspark.sql.{*, functions, types or whatever :) }
>>>>>>>
>>>>>>> Olivier.
>>>>>>>
>>>>>>> Le lun. 20 avr. 2015 à 11:48, Olivier Girardot <
>>>>>>> o.girardot@lateral-thoughts.com> a écrit :
>>>>>>>
>>>>>>> > a UDF might be a good idea no ?
>>>>>>> >
>>>>>>> > Le lun. 20 avr. 2015 à 11:17, Olivier Girardot <
>>>>>>> > o.girardot@lateral-thoughts.com> a écrit :
>>>>>>> >
>>>>>>> >> Hi everyone,
>>>>>>> >> let's assume I'm stuck in 1.3.0, how can I benefit from the
>>>>>>> *fillna* API
>>>>>>> >> in PySpark, is there any efficient alternative to mapping the
>>>>>>> records
>>>>>>> >> myself ?
>>>>>>> >>
>>>>>>> >> Regards,
>>>>>>> >>
>>>>>>> >> Olivier.
>>>>>>> >>
>>>>>>> >
>>>>>>>
>>>>>>
>>>>>>
>>

Re: Dataframe.fillna from 1.3.0

Posted by Olivier Girardot <o....@lateral-thoughts.com>.

I'll try thanks

Le ven. 24 avr. 2015 à 00:09, Reynold Xin <rx...@databricks.com> a écrit :

> You can do it similar to the way countDistinct is done, can't you?
>
>
> https://github.com/apache/spark/blob/master/python/pyspark/sql/functions.py#L78
>
>
>
> On Thu, Apr 23, 2015 at 1:59 PM, Olivier Girardot <
> o.girardot@lateral-thoughts.com> wrote:
>
>> I found another way setting a SPARK_HOME on a released version and
>> launching an ipython to load the contexts.
>> I may need your insight however, I found why it hasn't been done at the
>> same time, this method (like some others) uses a varargs in Scala and for
>> now the way functions are called only one parameter is supported.
>>
>> So at first I tried to just generalise the helper function "_" in the
>> functions.py file to multiple arguments, but py4j's handling of varargs
>> forces me to create an Array[Column] if the target method is expecting
>> varargs.
>>
>> But from Python's perspective, we have no idea of whether the target
>> method will be expecting varargs or just multiple arguments (to un-tuple).
>> I can create a special case for "coalesce" or "for method that takes of
>> list of columns as arguments" considering they will be varargs based (and
>> therefore needs an Array[Column] instead of just a list of arguments)
>>
>> But this seems very specific and very prone to future mistakes.
>> Is there any way in Py4j to know before calling it the signature of a
>> method ?
>>
>>
>> Le jeu. 23 avr. 2015 à 22:17, Olivier Girardot <
>> o.girardot@lateral-thoughts.com> a écrit :
>>
>>> What is the way of testing/building the pyspark part of Spark ?
>>>
>>> Le jeu. 23 avr. 2015 à 22:06, Olivier Girardot <
>>> o.girardot@lateral-thoughts.com> a écrit :
>>>
>>>> yep :) I'll open the jira when I've got the time.
>>>> Thanks
>>>>
>>>> Le jeu. 23 avr. 2015 à 19:31, Reynold Xin <rx...@databricks.com> a
>>>> écrit :
>>>>
>>>>> Ah damn. We need to add it to the Python list. Would you like to give
>>>>> it a shot?
>>>>>
>>>>>
>>>>> On Thu, Apr 23, 2015 at 4:31 AM, Olivier Girardot <
>>>>> o.girardot@lateral-thoughts.com> wrote:
>>>>>
>>>>>> Yep no problem, but I can't seem to find the coalesce fonction in
>>>>>> pyspark.sql.{*, functions, types or whatever :) }
>>>>>>
>>>>>> Olivier.
>>>>>>
>>>>>> Le lun. 20 avr. 2015 à 11:48, Olivier Girardot <
>>>>>> o.girardot@lateral-thoughts.com> a écrit :
>>>>>>
>>>>>> > a UDF might be a good idea no ?
>>>>>> >
>>>>>> > Le lun. 20 avr. 2015 à 11:17, Olivier Girardot <
>>>>>> > o.girardot@lateral-thoughts.com> a écrit :
>>>>>> >
>>>>>> >> Hi everyone,
>>>>>> >> let's assume I'm stuck in 1.3.0, how can I benefit from the
>>>>>> *fillna* API
>>>>>> >> in PySpark, is there any efficient alternative to mapping the
>>>>>> records
>>>>>> >> myself ?
>>>>>> >>
>>>>>> >> Regards,
>>>>>> >>
>>>>>> >> Olivier.
>>>>>> >>
>>>>>> >
>>>>>>
>>>>>
>>>>>
>

Re: Dataframe.fillna from 1.3.0

Posted by Reynold Xin <rx...@databricks.com>.

You can do it similar to the way countDistinct is done, can't you?

https://github.com/apache/spark/blob/master/python/pyspark/sql/functions.py#L78



On Thu, Apr 23, 2015 at 1:59 PM, Olivier Girardot <
o.girardot@lateral-thoughts.com> wrote:

> I found another way setting a SPARK_HOME on a released version and
> launching an ipython to load the contexts.
> I may need your insight however, I found why it hasn't been done at the
> same time, this method (like some others) uses a varargs in Scala and for
> now the way functions are called only one parameter is supported.
>
> So at first I tried to just generalise the helper function "_" in the
> functions.py file to multiple arguments, but py4j's handling of varargs
> forces me to create an Array[Column] if the target method is expecting
> varargs.
>
> But from Python's perspective, we have no idea of whether the target
> method will be expecting varargs or just multiple arguments (to un-tuple).
> I can create a special case for "coalesce" or "for method that takes of
> list of columns as arguments" considering they will be varargs based (and
> therefore needs an Array[Column] instead of just a list of arguments)
>
> But this seems very specific and very prone to future mistakes.
> Is there any way in Py4j to know before calling it the signature of a
> method ?
>
>
> Le jeu. 23 avr. 2015 à 22:17, Olivier Girardot <
> o.girardot@lateral-thoughts.com> a écrit :
>
>> What is the way of testing/building the pyspark part of Spark ?
>>
>> Le jeu. 23 avr. 2015 à 22:06, Olivier Girardot <
>> o.girardot@lateral-thoughts.com> a écrit :
>>
>>> yep :) I'll open the jira when I've got the time.
>>> Thanks
>>>
>>> Le jeu. 23 avr. 2015 à 19:31, Reynold Xin <rx...@databricks.com> a
>>> écrit :
>>>
>>>> Ah damn. We need to add it to the Python list. Would you like to give
>>>> it a shot?
>>>>
>>>>
>>>> On Thu, Apr 23, 2015 at 4:31 AM, Olivier Girardot <
>>>> o.girardot@lateral-thoughts.com> wrote:
>>>>
>>>>> Yep no problem, but I can't seem to find the coalesce fonction in
>>>>> pyspark.sql.{*, functions, types or whatever :) }
>>>>>
>>>>> Olivier.
>>>>>
>>>>> Le lun. 20 avr. 2015 à 11:48, Olivier Girardot <
>>>>> o.girardot@lateral-thoughts.com> a écrit :
>>>>>
>>>>> > a UDF might be a good idea no ?
>>>>> >
>>>>> > Le lun. 20 avr. 2015 à 11:17, Olivier Girardot <
>>>>> > o.girardot@lateral-thoughts.com> a écrit :
>>>>> >
>>>>> >> Hi everyone,
>>>>> >> let's assume I'm stuck in 1.3.0, how can I benefit from the
>>>>> *fillna* API
>>>>> >> in PySpark, is there any efficient alternative to mapping the
>>>>> records
>>>>> >> myself ?
>>>>> >>
>>>>> >> Regards,
>>>>> >>
>>>>> >> Olivier.
>>>>> >>
>>>>> >
>>>>>
>>>>
>>>>

Re: Dataframe.fillna from 1.3.0

Posted by Olivier Girardot <o....@lateral-thoughts.com>.

I found another way setting a SPARK_HOME on a released version and
launching an ipython to load the contexts.
I may need your insight however, I found why it hasn't been done at the
same time, this method (like some others) uses a varargs in Scala and for
now the way functions are called only one parameter is supported.

So at first I tried to just generalise the helper function "_" in the
functions.py file to multiple arguments, but py4j's handling of varargs
forces me to create an Array[Column] if the target method is expecting
varargs.

But from Python's perspective, we have no idea of whether the target method
will be expecting varargs or just multiple arguments (to un-tuple).
I can create a special case for "coalesce" or "for method that takes of
list of columns as arguments" considering they will be varargs based (and
therefore needs an Array[Column] instead of just a list of arguments)

But this seems very specific and very prone to future mistakes.
Is there any way in Py4j to know before calling it the signature of a
method ?

Le jeu. 23 avr. 2015 à 22:17, Olivier Girardot <
o.girardot@lateral-thoughts.com> a écrit :

> What is the way of testing/building the pyspark part of Spark ?
>
> Le jeu. 23 avr. 2015 à 22:06, Olivier Girardot <
> o.girardot@lateral-thoughts.com> a écrit :
>
>> yep :) I'll open the jira when I've got the time.
>> Thanks
>>
>> Le jeu. 23 avr. 2015 à 19:31, Reynold Xin <rx...@databricks.com> a écrit :
>>
>>> Ah damn. We need to add it to the Python list. Would you like to give it
>>> a shot?
>>>
>>>
>>> On Thu, Apr 23, 2015 at 4:31 AM, Olivier Girardot <
>>> o.girardot@lateral-thoughts.com> wrote:
>>>
>>>> Yep no problem, but I can't seem to find the coalesce fonction in
>>>> pyspark.sql.{*, functions, types or whatever :) }
>>>>
>>>> Olivier.
>>>>
>>>> Le lun. 20 avr. 2015 à 11:48, Olivier Girardot <
>>>> o.girardot@lateral-thoughts.com> a écrit :
>>>>
>>>> > a UDF might be a good idea no ?
>>>> >
>>>> > Le lun. 20 avr. 2015 à 11:17, Olivier Girardot <
>>>> > o.girardot@lateral-thoughts.com> a écrit :
>>>> >
>>>> >> Hi everyone,
>>>> >> let's assume I'm stuck in 1.3.0, how can I benefit from the *fillna*
>>>> API
>>>> >> in PySpark, is there any efficient alternative to mapping the records
>>>> >> myself ?
>>>> >>
>>>> >> Regards,
>>>> >>
>>>> >> Olivier.
>>>> >>
>>>> >
>>>>
>>>
>>>

Re: Dataframe.fillna from 1.3.0

Posted by Reynold Xin <rx...@databricks.com>.

You need to first have the Spark assembly jar built with "sbt/sbt
assembly/assembly"

Then usually I go into python/run-tests and comment out the non-SQL tests:

#run_core_tests
run_sql_tests
#run_mllib_tests
#run_ml_tests
#run_streaming_tests

And then you can run "python/run-tests"




On Thu, Apr 23, 2015 at 1:17 PM, Olivier Girardot <
o.girardot@lateral-thoughts.com> wrote:

> What is the way of testing/building the pyspark part of Spark ?
>
> Le jeu. 23 avr. 2015 à 22:06, Olivier Girardot <
> o.girardot@lateral-thoughts.com> a écrit :
>
>> yep :) I'll open the jira when I've got the time.
>> Thanks
>>
>> Le jeu. 23 avr. 2015 à 19:31, Reynold Xin <rx...@databricks.com> a écrit :
>>
>>> Ah damn. We need to add it to the Python list. Would you like to give it
>>> a shot?
>>>
>>>
>>> On Thu, Apr 23, 2015 at 4:31 AM, Olivier Girardot <
>>> o.girardot@lateral-thoughts.com> wrote:
>>>
>>>> Yep no problem, but I can't seem to find the coalesce fonction in
>>>> pyspark.sql.{*, functions, types or whatever :) }
>>>>
>>>> Olivier.
>>>>
>>>> Le lun. 20 avr. 2015 à 11:48, Olivier Girardot <
>>>> o.girardot@lateral-thoughts.com> a écrit :
>>>>
>>>> > a UDF might be a good idea no ?
>>>> >
>>>> > Le lun. 20 avr. 2015 à 11:17, Olivier Girardot <
>>>> > o.girardot@lateral-thoughts.com> a écrit :
>>>> >
>>>> >> Hi everyone,
>>>> >> let's assume I'm stuck in 1.3.0, how can I benefit from the *fillna*
>>>> API
>>>> >> in PySpark, is there any efficient alternative to mapping the records
>>>> >> myself ?
>>>> >>
>>>> >> Regards,
>>>> >>
>>>> >> Olivier.
>>>> >>
>>>> >
>>>>
>>>
>>>

Re: Dataframe.fillna from 1.3.0

Posted by Olivier Girardot <o....@lateral-thoughts.com>.

What is the way of testing/building the pyspark part of Spark ?

Le jeu. 23 avr. 2015 à 22:06, Olivier Girardot <
o.girardot@lateral-thoughts.com> a écrit :

> yep :) I'll open the jira when I've got the time.
> Thanks
>
> Le jeu. 23 avr. 2015 à 19:31, Reynold Xin <rx...@databricks.com> a écrit :
>
>> Ah damn. We need to add it to the Python list. Would you like to give it
>> a shot?
>>
>>
>> On Thu, Apr 23, 2015 at 4:31 AM, Olivier Girardot <
>> o.girardot@lateral-thoughts.com> wrote:
>>
>>> Yep no problem, but I can't seem to find the coalesce fonction in
>>> pyspark.sql.{*, functions, types or whatever :) }
>>>
>>> Olivier.
>>>
>>> Le lun. 20 avr. 2015 à 11:48, Olivier Girardot <
>>> o.girardot@lateral-thoughts.com> a écrit :
>>>
>>> > a UDF might be a good idea no ?
>>> >
>>> > Le lun. 20 avr. 2015 à 11:17, Olivier Girardot <
>>> > o.girardot@lateral-thoughts.com> a écrit :
>>> >
>>> >> Hi everyone,
>>> >> let's assume I'm stuck in 1.3.0, how can I benefit from the *fillna*
>>> API
>>> >> in PySpark, is there any efficient alternative to mapping the records
>>> >> myself ?
>>> >>
>>> >> Regards,
>>> >>
>>> >> Olivier.
>>> >>
>>> >
>>>
>>
>>

Re: Dataframe.fillna from 1.3.0

Posted by Olivier Girardot <o....@lateral-thoughts.com>.

yep :) I'll open the jira when I've got the time.
Thanks

Le jeu. 23 avr. 2015 à 19:31, Reynold Xin <rx...@databricks.com> a écrit :

> Ah damn. We need to add it to the Python list. Would you like to give it a
> shot?
>
>
> On Thu, Apr 23, 2015 at 4:31 AM, Olivier Girardot <
> o.girardot@lateral-thoughts.com> wrote:
>
>> Yep no problem, but I can't seem to find the coalesce fonction in
>> pyspark.sql.{*, functions, types or whatever :) }
>>
>> Olivier.
>>
>> Le lun. 20 avr. 2015 à 11:48, Olivier Girardot <
>> o.girardot@lateral-thoughts.com> a écrit :
>>
>> > a UDF might be a good idea no ?
>> >
>> > Le lun. 20 avr. 2015 à 11:17, Olivier Girardot <
>> > o.girardot@lateral-thoughts.com> a écrit :
>> >
>> >> Hi everyone,
>> >> let's assume I'm stuck in 1.3.0, how can I benefit from the *fillna*
>> API
>> >> in PySpark, is there any efficient alternative to mapping the records
>> >> myself ?
>> >>
>> >> Regards,
>> >>
>> >> Olivier.
>> >>
>> >
>>
>
>

Re: Dataframe.fillna from 1.3.0

Posted by Reynold Xin <rx...@databricks.com>.

Ah damn. We need to add it to the Python list. Would you like to give it a
shot?


On Thu, Apr 23, 2015 at 4:31 AM, Olivier Girardot <
o.girardot@lateral-thoughts.com> wrote:

> Yep no problem, but I can't seem to find the coalesce fonction in
> pyspark.sql.{*, functions, types or whatever :) }
>
> Olivier.
>
> Le lun. 20 avr. 2015 à 11:48, Olivier Girardot <
> o.girardot@lateral-thoughts.com> a écrit :
>
> > a UDF might be a good idea no ?
> >
> > Le lun. 20 avr. 2015 à 11:17, Olivier Girardot <
> > o.girardot@lateral-thoughts.com> a écrit :
> >
> >> Hi everyone,
> >> let's assume I'm stuck in 1.3.0, how can I benefit from the *fillna* API
> >> in PySpark, is there any efficient alternative to mapping the records
> >> myself ?
> >>
> >> Regards,
> >>
> >> Olivier.
> >>
> >
>

Re: Dataframe.fillna from 1.3.0

Posted by Olivier Girardot <o....@lateral-thoughts.com>.

Yep no problem, but I can't seem to find the coalesce fonction in
pyspark.sql.{*, functions, types or whatever :) }

Olivier.

Le lun. 20 avr. 2015 à 11:48, Olivier Girardot <
o.girardot@lateral-thoughts.com> a écrit :

> a UDF might be a good idea no ?
>
> Le lun. 20 avr. 2015 à 11:17, Olivier Girardot <
> o.girardot@lateral-thoughts.com> a écrit :
>
>> Hi everyone,
>> let's assume I'm stuck in 1.3.0, how can I benefit from the *fillna* API
>> in PySpark, is there any efficient alternative to mapping the records
>> myself ?
>>
>> Regards,
>>
>> Olivier.
>>
>

Re: Dataframe.fillna from 1.3.0

Posted by Reynold Xin <rx...@databricks.com>.

It is actually different.

coalesce expression is to pick the first value that is not null:
https://msdn.microsoft.com/en-us/library/ms190349.aspx

Would be great to update the documentation for it (both Scala and Java) to
explain that it is different from coalesce function on a DataFrame/RDD. Do
you want to submit a pull request?



On Wed, Apr 22, 2015 at 3:05 AM, Olivier Girardot <
o.girardot@lateral-thoughts.com> wrote:

> I think I found the Coalesce you were talking about, but this is a
> catalyst class that I think is not available from pyspark
>
> Regards,
>
> Olivier.
>
> Le mer. 22 avr. 2015 à 11:56, Olivier Girardot <
> o.girardot@lateral-thoughts.com> a écrit :
>
>> Where should this *coalesce* come from ? Is it related to the partition
>> manipulation coalesce method ?
>> Thanks !
>>
>> Le lun. 20 avr. 2015 à 22:48, Reynold Xin <rx...@databricks.com> a écrit :
>>
>>> Ah ic. You can do something like
>>>
>>>
>>> df.select(coalesce(df("a"), lit(0.0)))
>>>
>>> On Mon, Apr 20, 2015 at 1:44 PM, Olivier Girardot <
>>> o.girardot@lateral-thoughts.com> wrote:
>>>
>>>> From PySpark it seems to me that the fillna is relying on Java/Scala
>>>> code, that's why I was wondering.
>>>> Thank you for answering :)
>>>>
>>>> Le lun. 20 avr. 2015 à 22:22, Reynold Xin <rx...@databricks.com> a
>>>> écrit :
>>>>
>>>>> You can just create fillna function based on the 1.3.1 implementation
>>>>> of fillna, no?
>>>>>
>>>>>
>>>>> On Mon, Apr 20, 2015 at 2:48 AM, Olivier Girardot <
>>>>> o.girardot@lateral-thoughts.com> wrote:
>>>>>
>>>>>> a UDF might be a good idea no ?
>>>>>>
>>>>>> Le lun. 20 avr. 2015 à 11:17, Olivier Girardot <
>>>>>> o.girardot@lateral-thoughts.com> a écrit :
>>>>>>
>>>>>> > Hi everyone,
>>>>>> > let's assume I'm stuck in 1.3.0, how can I benefit from the
>>>>>> *fillna* API
>>>>>> > in PySpark, is there any efficient alternative to mapping the
>>>>>> records
>>>>>> > myself ?
>>>>>> >
>>>>>> > Regards,
>>>>>> >
>>>>>> > Olivier.
>>>>>> >
>>>>>>
>>>>>
>>>>>
>>>

Re: Dataframe.fillna from 1.3.0

Posted by Olivier Girardot <o....@lateral-thoughts.com>.

I think I found the Coalesce you were talking about, but this is a catalyst
class that I think is not available from pyspark

Regards,

Olivier.

Le mer. 22 avr. 2015 à 11:56, Olivier Girardot <
o.girardot@lateral-thoughts.com> a écrit :

> Where should this *coalesce* come from ? Is it related to the partition
> manipulation coalesce method ?
> Thanks !
>
> Le lun. 20 avr. 2015 à 22:48, Reynold Xin <rx...@databricks.com> a écrit :
>
>> Ah ic. You can do something like
>>
>>
>> df.select(coalesce(df("a"), lit(0.0)))
>>
>> On Mon, Apr 20, 2015 at 1:44 PM, Olivier Girardot <
>> o.girardot@lateral-thoughts.com> wrote:
>>
>>> From PySpark it seems to me that the fillna is relying on Java/Scala
>>> code, that's why I was wondering.
>>> Thank you for answering :)
>>>
>>> Le lun. 20 avr. 2015 à 22:22, Reynold Xin <rx...@databricks.com> a
>>> écrit :
>>>
>>>> You can just create fillna function based on the 1.3.1 implementation
>>>> of fillna, no?
>>>>
>>>>
>>>> On Mon, Apr 20, 2015 at 2:48 AM, Olivier Girardot <
>>>> o.girardot@lateral-thoughts.com> wrote:
>>>>
>>>>> a UDF might be a good idea no ?
>>>>>
>>>>> Le lun. 20 avr. 2015 à 11:17, Olivier Girardot <
>>>>> o.girardot@lateral-thoughts.com> a écrit :
>>>>>
>>>>> > Hi everyone,
>>>>> > let's assume I'm stuck in 1.3.0, how can I benefit from the *fillna*
>>>>> API
>>>>> > in PySpark, is there any efficient alternative to mapping the records
>>>>> > myself ?
>>>>> >
>>>>> > Regards,
>>>>> >
>>>>> > Olivier.
>>>>> >
>>>>>
>>>>
>>>>
>>

Re: Dataframe.fillna from 1.3.0

Posted by Olivier Girardot <o....@lateral-thoughts.com>.

Where should this *coalesce* come from ? Is it related to the partition
manipulation coalesce method ?
Thanks !

Le lun. 20 avr. 2015 à 22:48, Reynold Xin <rx...@databricks.com> a écrit :

> Ah ic. You can do something like
>
>
> df.select(coalesce(df("a"), lit(0.0)))
>
> On Mon, Apr 20, 2015 at 1:44 PM, Olivier Girardot <
> o.girardot@lateral-thoughts.com> wrote:
>
>> From PySpark it seems to me that the fillna is relying on Java/Scala
>> code, that's why I was wondering.
>> Thank you for answering :)
>>
>> Le lun. 20 avr. 2015 à 22:22, Reynold Xin <rx...@databricks.com> a écrit :
>>
>>> You can just create fillna function based on the 1.3.1 implementation of
>>> fillna, no?
>>>
>>>
>>> On Mon, Apr 20, 2015 at 2:48 AM, Olivier Girardot <
>>> o.girardot@lateral-thoughts.com> wrote:
>>>
>>>> a UDF might be a good idea no ?
>>>>
>>>> Le lun. 20 avr. 2015 à 11:17, Olivier Girardot <
>>>> o.girardot@lateral-thoughts.com> a écrit :
>>>>
>>>> > Hi everyone,
>>>> > let's assume I'm stuck in 1.3.0, how can I benefit from the *fillna*
>>>> API
>>>> > in PySpark, is there any efficient alternative to mapping the records
>>>> > myself ?
>>>> >
>>>> > Regards,
>>>> >
>>>> > Olivier.
>>>> >
>>>>
>>>
>>>
>

Re: Dataframe.fillna from 1.3.0

Posted by Reynold Xin <rx...@databricks.com>.

Ah ic. You can do something like


df.select(coalesce(df("a"), lit(0.0)))

On Mon, Apr 20, 2015 at 1:44 PM, Olivier Girardot <
o.girardot@lateral-thoughts.com> wrote:

> From PySpark it seems to me that the fillna is relying on Java/Scala code,
> that's why I was wondering.
> Thank you for answering :)
>
> Le lun. 20 avr. 2015 à 22:22, Reynold Xin <rx...@databricks.com> a écrit :
>
>> You can just create fillna function based on the 1.3.1 implementation of
>> fillna, no?
>>
>>
>> On Mon, Apr 20, 2015 at 2:48 AM, Olivier Girardot <
>> o.girardot@lateral-thoughts.com> wrote:
>>
>>> a UDF might be a good idea no ?
>>>
>>> Le lun. 20 avr. 2015 à 11:17, Olivier Girardot <
>>> o.girardot@lateral-thoughts.com> a écrit :
>>>
>>> > Hi everyone,
>>> > let's assume I'm stuck in 1.3.0, how can I benefit from the *fillna*
>>> API
>>> > in PySpark, is there any efficient alternative to mapping the records
>>> > myself ?
>>> >
>>> > Regards,
>>> >
>>> > Olivier.
>>> >
>>>
>>
>>

Re: Dataframe.fillna from 1.3.0

Posted by Olivier Girardot <o....@lateral-thoughts.com>.

>From PySpark it seems to me that the fillna is relying on Java/Scala code,
that's why I was wondering.
Thank you for answering :)

Le lun. 20 avr. 2015 à 22:22, Reynold Xin <rx...@databricks.com> a écrit :

> You can just create fillna function based on the 1.3.1 implementation of
> fillna, no?
>
>
> On Mon, Apr 20, 2015 at 2:48 AM, Olivier Girardot <
> o.girardot@lateral-thoughts.com> wrote:
>
>> a UDF might be a good idea no ?
>>
>> Le lun. 20 avr. 2015 à 11:17, Olivier Girardot <
>> o.girardot@lateral-thoughts.com> a écrit :
>>
>> > Hi everyone,
>> > let's assume I'm stuck in 1.3.0, how can I benefit from the *fillna* API
>> > in PySpark, is there any efficient alternative to mapping the records
>> > myself ?
>> >
>> > Regards,
>> >
>> > Olivier.
>> >
>>
>
>

Re: Dataframe.fillna from 1.3.0

Posted by Reynold Xin <rx...@databricks.com>.

You can just create fillna function based on the 1.3.1 implementation of
fillna, no?


On Mon, Apr 20, 2015 at 2:48 AM, Olivier Girardot <
o.girardot@lateral-thoughts.com> wrote:

> a UDF might be a good idea no ?
>
> Le lun. 20 avr. 2015 à 11:17, Olivier Girardot <
> o.girardot@lateral-thoughts.com> a écrit :
>
> > Hi everyone,
> > let's assume I'm stuck in 1.3.0, how can I benefit from the *fillna* API
> > in PySpark, is there any efficient alternative to mapping the records
> > myself ?
> >
> > Regards,
> >
> > Olivier.
> >
>

Re: Dataframe.fillna from 1.3.0

Posted by Olivier Girardot <o....@lateral-thoughts.com>.

a UDF might be a good idea no ?

Le lun. 20 avr. 2015 à 11:17, Olivier Girardot <
o.girardot@lateral-thoughts.com> a écrit :

> Hi everyone,
> let's assume I'm stuck in 1.3.0, how can I benefit from the *fillna* API
> in PySpark, is there any efficient alternative to mapping the records
> myself ?
>
> Regards,
>
> Olivier.
>