You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@spark.apache.org by Holden Karau <ho...@pigscanfly.ca> on 2018/10/26 16:15:35 UTC

Helper methods for PySpark discussion

Coming out of https://github.com/apache/spark/pull/21654 it was agreed the
helper methods in question made sense but there was some desire for a plan
as to which helper methods we should use.

I'd like to purpose a light weight solution to start with for helper
methods that match either Pandas or general Python collection helper
methods:
1) If the helper method doesn't collect the DataFrame back or force
evaluation to the driver then we should add it without discussion
2) If the method forces evaluation this matches most obvious way that would
implemented then we should add it with a note in the docstring
3) If the method does collect the DataFrame back to the driver and that is
the most obvious way it would implemented (e.g. calling list to get back a
list would have to collect the DataFrame) then we should add it with a
warning in the docstring
4) If the method collects the DataFrame but a reasonable Python developer
wouldn't expect that behaviour not implementing the helper method would be
better

What do folks think?
-- 
Twitter: https://twitter.com/holdenkarau
Books (Learning Spark, High Performance Spark, etc.):
https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
YouTube Live Streams: https://www.youtube.com/user/holdenkarau

Re: Helper methods for PySpark discussion

Posted by Reynold Xin <rx...@databricks.com>.

I agree - it is very easy for users to shoot themselves in the foot if we
don't put in the safeguards, or mislead them by giving them the impression
that operations are cheap. DataFrame in Spark isn't like a single node
in-memory data structure.

Note that the repr string work is very different. There it is off by
default, and requires opt-in, and is designed for a specific use case if
you go read my original email that proposed adding this.



On Sat, Oct 27, 2018 at 8:40 AM Leif Walsh <le...@gmail.com> wrote:

> In the case of len, I think we should examine how python does iterators
> and generators. https://docs.python.org/3/library/collections.abc.html
>
> Iterators have __iter__ and __next__ but are not Sized so they don’t have
> __len__. If you ask for the len() of a generator (like len(x for x in
> range(10) if x % 2 == 0)) you get a reasonable error message and might
> respond by calling len(list(g)) if you know you can afford to materialize
> g’s contents. Of course, with a DataFrame materializing all the columns for
> all rows back on the python side is way more expensive than df.count(), so
> we don’t want to ever steer people to call len(list(df)), but I think
> len(df) having an expensive side effect would definitely surprise me.
>
> Perhaps we can consider the abstract base classes that DataFrames and RDDs
> should implement. I actually think it’s not many of them, we don’t have
> random access, or sizes, or even a cheap way to do set membership.
>
> For the case of len(), I think the best option is to show an error message
> that tells you to call count instead.
> On Fri, Oct 26, 2018 at 21:06 Holden Karau <ho...@pigscanfly.ca> wrote:
>
>> Ok so let's say you made a spark dataframe, you call length -- what do
>> you expect to happen?
>>
>> Personallt I expect Spark to evaluate the dataframe, this is what happens
>> with collections and even iterables.
>>
>> The interplay with cache is a bit strange, but presumably if you've
>> marked your Dataframe for caching you want to cache it (we don't
>> automatically madk Dataframes for caching outside of some cases inside ML
>> pipelines where this would not apply).
>>
>> On Fri, Oct 26, 2018, 10:56 AM Li Jin <ice.xelloss@gmail.com wrote:
>>
>>> > (2) If the method forces evaluation this matches most obvious way that
>>> would implemented then we should add it with a note in the docstring
>>>
>>> I am not sure about this because force evaluation could be something
>>> that has side effect. For example, df.count() can realize a cache and if we
>>> implement __len__ to call df.count() then len(df) would end up populating
>>> some cache and can be unintuitive.
>>>
>>> On Fri, Oct 26, 2018 at 1:21 PM Leif Walsh <le...@gmail.com> wrote:
>>>
>>>> That all sounds reasonable but I think in the case of 4 and maybe also
>>>> 3 I would rather see it implemented to raise an error message that explains
>>>> what’s going on and suggests the explicit operation that would do the most
>>>> equivalent thing. And perhaps raise a warning (using the warnings module)
>>>> for things that might be unintuitively expensive.
>>>> On Fri, Oct 26, 2018 at 12:15 Holden Karau <ho...@pigscanfly.ca>
>>>> wrote:
>>>>
>>>>> Coming out of https://github.com/apache/spark/pull/21654 it was
>>>>> agreed the helper methods in question made sense but there was some desire
>>>>> for a plan as to which helper methods we should use.
>>>>>
>>>>> I'd like to purpose a light weight solution to start with for helper
>>>>> methods that match either Pandas or general Python collection helper
>>>>> methods:
>>>>> 1) If the helper method doesn't collect the DataFrame back or force
>>>>> evaluation to the driver then we should add it without discussion
>>>>> 2) If the method forces evaluation this matches most obvious way that
>>>>> would implemented then we should add it with a note in the docstring
>>>>> 3) If the method does collect the DataFrame back to the driver and
>>>>> that is the most obvious way it would implemented (e.g. calling list to get
>>>>> back a list would have to collect the DataFrame) then we should add it with
>>>>> a warning in the docstring
>>>>> 4) If the method collects the DataFrame but a reasonable Python
>>>>> developer wouldn't expect that behaviour not implementing the helper method
>>>>> would be better
>>>>>
>>>>> What do folks think?
>>>>> --
>>>>> Twitter: https://twitter.com/holdenkarau
>>>>> Books (Learning Spark, High Performance Spark, etc.):
>>>>> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
>>>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>>>>
>>>> --
>>>> --
>>>> Cheers,
>>>> Leif
>>>>
>>> --
> --
> Cheers,
> Leif
>

Re: Helper methods for PySpark discussion

Posted by Leif Walsh <le...@gmail.com>.

In the case of len, I think we should examine how python does iterators and
generators. https://docs.python.org/3/library/collections.abc.html

Iterators have __iter__ and __next__ but are not Sized so they don’t have
__len__. If you ask for the len() of a generator (like len(x for x in
range(10) if x % 2 == 0)) you get a reasonable error message and might
respond by calling len(list(g)) if you know you can afford to materialize
g’s contents. Of course, with a DataFrame materializing all the columns for
all rows back on the python side is way more expensive than df.count(), so
we don’t want to ever steer people to call len(list(df)), but I think
len(df) having an expensive side effect would definitely surprise me.

Perhaps we can consider the abstract base classes that DataFrames and RDDs
should implement. I actually think it’s not many of them, we don’t have
random access, or sizes, or even a cheap way to do set membership.

For the case of len(), I think the best option is to show an error message
that tells you to call count instead.
On Fri, Oct 26, 2018 at 21:06 Holden Karau <ho...@pigscanfly.ca> wrote:

> Ok so let's say you made a spark dataframe, you call length -- what do you
> expect to happen?
>
> Personallt I expect Spark to evaluate the dataframe, this is what happens
> with collections and even iterables.
>
> The interplay with cache is a bit strange, but presumably if you've marked
> your Dataframe for caching you want to cache it (we don't automatically
> madk Dataframes for caching outside of some cases inside ML pipelines where
> this would not apply).
>
> On Fri, Oct 26, 2018, 10:56 AM Li Jin <ice.xelloss@gmail.com wrote:
>
>> > (2) If the method forces evaluation this matches most obvious way that
>> would implemented then we should add it with a note in the docstring
>>
>> I am not sure about this because force evaluation could be something that
>> has side effect. For example, df.count() can realize a cache and if we
>> implement __len__ to call df.count() then len(df) would end up populating
>> some cache and can be unintuitive.
>>
>> On Fri, Oct 26, 2018 at 1:21 PM Leif Walsh <le...@gmail.com> wrote:
>>
>>> That all sounds reasonable but I think in the case of 4 and maybe also 3
>>> I would rather see it implemented to raise an error message that explains
>>> what’s going on and suggests the explicit operation that would do the most
>>> equivalent thing. And perhaps raise a warning (using the warnings module)
>>> for things that might be unintuitively expensive.
>>> On Fri, Oct 26, 2018 at 12:15 Holden Karau <ho...@pigscanfly.ca> wrote:
>>>
>>>> Coming out of https://github.com/apache/spark/pull/21654 it was agreed
>>>> the helper methods in question made sense but there was some desire for a
>>>> plan as to which helper methods we should use.
>>>>
>>>> I'd like to purpose a light weight solution to start with for helper
>>>> methods that match either Pandas or general Python collection helper
>>>> methods:
>>>> 1) If the helper method doesn't collect the DataFrame back or force
>>>> evaluation to the driver then we should add it without discussion
>>>> 2) If the method forces evaluation this matches most obvious way that
>>>> would implemented then we should add it with a note in the docstring
>>>> 3) If the method does collect the DataFrame back to the driver and that
>>>> is the most obvious way it would implemented (e.g. calling list to get back
>>>> a list would have to collect the DataFrame) then we should add it with a
>>>> warning in the docstring
>>>> 4) If the method collects the DataFrame but a reasonable Python
>>>> developer wouldn't expect that behaviour not implementing the helper method
>>>> would be better
>>>>
>>>> What do folks think?
>>>> --
>>>> Twitter: https://twitter.com/holdenkarau
>>>> Books (Learning Spark, High Performance Spark, etc.):
>>>> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
>>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>>>
>>> --
>>> --
>>> Cheers,
>>> Leif
>>>
>> --
-- 
Cheers,
Leif

Re: Helper methods for PySpark discussion

Posted by Holden Karau <ho...@pigscanfly.ca>.

Ok so let's say you made a spark dataframe, you call length -- what do you
expect to happen?

Personallt I expect Spark to evaluate the dataframe, this is what happens
with collections and even iterables.

The interplay with cache is a bit strange, but presumably if you've marked
your Dataframe for caching you want to cache it (we don't automatically
madk Dataframes for caching outside of some cases inside ML pipelines where
this would not apply).

On Fri, Oct 26, 2018, 10:56 AM Li Jin <ice.xelloss@gmail.com wrote:

> > (2) If the method forces evaluation this matches most obvious way that
> would implemented then we should add it with a note in the docstring
>
> I am not sure about this because force evaluation could be something that
> has side effect. For example, df.count() can realize a cache and if we
> implement __len__ to call df.count() then len(df) would end up populating
> some cache and can be unintuitive.
>
> On Fri, Oct 26, 2018 at 1:21 PM Leif Walsh <le...@gmail.com> wrote:
>
>> That all sounds reasonable but I think in the case of 4 and maybe also 3
>> I would rather see it implemented to raise an error message that explains
>> what’s going on and suggests the explicit operation that would do the most
>> equivalent thing. And perhaps raise a warning (using the warnings module)
>> for things that might be unintuitively expensive.
>> On Fri, Oct 26, 2018 at 12:15 Holden Karau <ho...@pigscanfly.ca> wrote:
>>
>>> Coming out of https://github.com/apache/spark/pull/21654 it was agreed
>>> the helper methods in question made sense but there was some desire for a
>>> plan as to which helper methods we should use.
>>>
>>> I'd like to purpose a light weight solution to start with for helper
>>> methods that match either Pandas or general Python collection helper
>>> methods:
>>> 1) If the helper method doesn't collect the DataFrame back or force
>>> evaluation to the driver then we should add it without discussion
>>> 2) If the method forces evaluation this matches most obvious way that
>>> would implemented then we should add it with a note in the docstring
>>> 3) If the method does collect the DataFrame back to the driver and that
>>> is the most obvious way it would implemented (e.g. calling list to get back
>>> a list would have to collect the DataFrame) then we should add it with a
>>> warning in the docstring
>>> 4) If the method collects the DataFrame but a reasonable Python
>>> developer wouldn't expect that behaviour not implementing the helper method
>>> would be better
>>>
>>> What do folks think?
>>> --
>>> Twitter: https://twitter.com/holdenkarau
>>> Books (Learning Spark, High Performance Spark, etc.):
>>> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>>
>> --
>> --
>> Cheers,
>> Leif
>>
>

Re: Helper methods for PySpark discussion

Posted by Li Jin <ic...@gmail.com>.

> (2) If the method forces evaluation this matches most obvious way that
would implemented then we should add it with a note in the docstring

I am not sure about this because force evaluation could be something that
has side effect. For example, df.count() can realize a cache and if we
implement __len__ to call df.count() then len(df) would end up populating
some cache and can be unintuitive.

On Fri, Oct 26, 2018 at 1:21 PM Leif Walsh <le...@gmail.com> wrote:

> That all sounds reasonable but I think in the case of 4 and maybe also 3 I
> would rather see it implemented to raise an error message that explains
> what’s going on and suggests the explicit operation that would do the most
> equivalent thing. And perhaps raise a warning (using the warnings module)
> for things that might be unintuitively expensive.
> On Fri, Oct 26, 2018 at 12:15 Holden Karau <ho...@pigscanfly.ca> wrote:
>
>> Coming out of https://github.com/apache/spark/pull/21654 it was agreed
>> the helper methods in question made sense but there was some desire for a
>> plan as to which helper methods we should use.
>>
>> I'd like to purpose a light weight solution to start with for helper
>> methods that match either Pandas or general Python collection helper
>> methods:
>> 1) If the helper method doesn't collect the DataFrame back or force
>> evaluation to the driver then we should add it without discussion
>> 2) If the method forces evaluation this matches most obvious way that
>> would implemented then we should add it with a note in the docstring
>> 3) If the method does collect the DataFrame back to the driver and that
>> is the most obvious way it would implemented (e.g. calling list to get back
>> a list would have to collect the DataFrame) then we should add it with a
>> warning in the docstring
>> 4) If the method collects the DataFrame but a reasonable Python developer
>> wouldn't expect that behaviour not implementing the helper method would be
>> better
>>
>> What do folks think?
>> --
>> Twitter: https://twitter.com/holdenkarau
>> Books (Learning Spark, High Performance Spark, etc.):
>> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>
> --
> --
> Cheers,
> Leif
>

Re: Helper methods for PySpark discussion

Posted by Leif Walsh <le...@gmail.com>.

That all sounds reasonable but I think in the case of 4 and maybe also 3 I
would rather see it implemented to raise an error message that explains
what’s going on and suggests the explicit operation that would do the most
equivalent thing. And perhaps raise a warning (using the warnings module)
for things that might be unintuitively expensive.
On Fri, Oct 26, 2018 at 12:15 Holden Karau <ho...@pigscanfly.ca> wrote:

> Coming out of https://github.com/apache/spark/pull/21654 it was agreed
> the helper methods in question made sense but there was some desire for a
> plan as to which helper methods we should use.
>
> I'd like to purpose a light weight solution to start with for helper
> methods that match either Pandas or general Python collection helper
> methods:
> 1) If the helper method doesn't collect the DataFrame back or force
> evaluation to the driver then we should add it without discussion
> 2) If the method forces evaluation this matches most obvious way that
> would implemented then we should add it with a note in the docstring
> 3) If the method does collect the DataFrame back to the driver and that is
> the most obvious way it would implemented (e.g. calling list to get back a
> list would have to collect the DataFrame) then we should add it with a
> warning in the docstring
> 4) If the method collects the DataFrame but a reasonable Python developer
> wouldn't expect that behaviour not implementing the helper method would be
> better
>
> What do folks think?
> --
> Twitter: https://twitter.com/holdenkarau
> Books (Learning Spark, High Performance Spark, etc.):
> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>
-- 
-- 
Cheers,
Leif