You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@spark.apache.org by Pablo Alcain <pa...@wildlifestudios.com> on 2021/12/29 11:54:27 UTC

Re: PySpark Dynamic DataFrame for easier inheritance

Hey everyone! I'm re-sending this e-mail, now with a PR proposal (
https://github.com/apache/spark/pull/35045 if you want to take a look at
the code with a couple of examples). The proposed change includes only a
new class that would extend only the Python API without doing any change to
the underlying scala code. The benefit would be that the new code only
extends previous functionality without breaking any existing application
code, allowing pyspark users to try it out and see if it turns out to
be useful. Hyukjin Kwon <https://github.com/HyukjinKwon> commented that a
drawback with this would be that, if we do this, it would be hard to
deprecate later the `DynamicDataFrame` API. The other option, if we want
this inheritance to be feasible, is to directly implement this "casting"
directly on the `DataFrame` code, so for example it would change from

def limit(self, num: int) -> "DataFrame":
    jdf = self._jdf.limit(num)
    return DataFrame(jdf, self.sql_ctx)

to

def limit(self, num: int) -> "DataFrame":
    jdf = self._jdf.li mit(num)
    return self.__class__(jdf, self.sql_ctx) # type(self) would work as well

This approach would probably need to implement similar changes on the Scala
API as well in order to allow this kind of inheritance on Scala as well
(unfortunately I'm not knowledgable enough in Scala to figure out what the
changes would be exactly)

I wanted to gather your input on this idea, whether you think it can be
helpful or not, and what would be the best strategy, in your opinion, to
pursue it.

Thank you very much!
Pablo

On Thu, Nov 4, 2021 at 9:44 PM Pablo Alcain <
pablo.alcain@wildlifestudios.com> wrote:

> tl;dr: a proposal for a pyspark "DynamicDataFrame" class that would make
> it easier to inherit from it while keeping dataframe methods.
>
> Hello everyone. We have been working for a long time with PySpark and more
> specifically with DataFrames. In our pipelines we have several tables, with
> specific purposes, that we usually load as DataFrames. As you might expect,
> there are a handful of queries and transformations per dataframe that are
> done many times, so we thought of ways that we could abstract them:
>
> 1. Functions: using functions that call dataframes and returns them
> transformed. It had a couple of pitfalls: we had to manage the namespaces
> carefully, and also the "chainability" didn't feel very pyspark-y.
> 2. MonkeyPatching DataFrame: we monkeypatched (
> https://stackoverflow.com/questions/5626193/what-is-monkey-patching)
> methods with the regularly done queries inside the DataFrame class. This
> one kept it pyspark-y, but there was no easy way to handle segregated
> namespaces/
> 3. Inheritances: create the class `MyBusinessDataFrame`, inherit from
> `DataFrame` and implement the methods there. This one solves all the
> issues, but with a caveat: the chainable methods cast the result explicitly
> to `DataFrame` (see
> https://github.com/apache/spark/blob/master/python/pyspark/sql/dataframe.py#L1910
> e g). Therefore, everytime you use one of the parent's methods you'd have
> to re-cast to `MyBusinessDataFrame`, making the code cumbersome.
>
> In view of these pitfalls we decided to go for a slightly different
> approach, inspired by #3: We created a class called `DynamicDataFrame` that
> overrides the explicit call to `DataFrame` as done in PySpark but instead
> casted dynamically to `self.__class__` (see
> https://gist.github.com/pabloalcain/de79938507ad2d823a866238b3c8a66e#file-dynamic_dataframe_minimal-py-L21
> e g). This allows the fluent methods to always keep the same class, making
> chainability as smooth as it is with pyspark dataframes.
>
> As an example implementation, here's a link to a gist (
> https://gist.github.com/pabloalcain/de79938507ad2d823a866238b3c8a66e)
> that implemented dynamically `withColumn` and `select` methods and the
> expected output.
>
> I'm sharing this here in case you feel like this approach can be useful
> for anyone else. In our case it greatly sped up the development of
> abstraction layers and allowed us to write cleaner code. One of the
> advantages is that it would simply be a "plugin" over pyspark, that does
> not modify anyhow already existing code or application interfaces.
>
> If you think that this can be helpful, I can write a PR as a more refined
> proof of concept.
>
> Thanks!
>
> Pablo
>

Re: PySpark Dynamic DataFrame for easier inheritance

Posted by Takuya Ueshin <ue...@databricks.com>.

I'm afraid I'm also against the proposal so far.

What's wrong with going with "1. Functions" and using transform which allows
chaining functions?
I was not sure what you mean by "manage the namespaces", though.


def with_price(df, factor: float = 2.0):
    return df.withColumn("price", F.col("price") * factor)

df.transform(with_price).show()


I have to admit that the current transform is a bit annoying when the
function takes parameters:


df.transform(lambda input_df: with_price(input_df, 100)).show()


but we can improve transform to take the parameters for the function.


Or, I'd also recommend using a wrapper as Maciej suggested, but without
delegating all functions.
I'd expose only functions which are really necessary; otherwise management
of the dataframe would be rather more difficult.

For example, with a MyBusinessDataFrame
<https://gist.github.com/pabloalcain/de79938507ad2d823a866238b3c8a66e#file-dynamic_dataframe_minimal-py-L28-L30>
,


base_dataframe = spark.createDataFrame(
        data=[['product_1', 2], ['product_2', 4]],
        schema=["name", "price"],
        )
dyn_business = MyBusinessDataFrame(base_dataframe)
dyn_business.select("name").my_business_query(2.0)


will raise an AnalysisException because there is not the price column
anymore.
We should manage the dataframe in the wrapper properly.


Thanks.


On Wed, Dec 29, 2021 at 8:49 AM Maciej <ms...@gmail.com> wrote:

> On 12/29/21 16:18, Pablo Alcain wrote:
> > Hey Maciej! Thanks for your answer and the comments :)
> >
> > On Wed, Dec 29, 2021 at 3:06 PM Maciej <mszymkiewicz@gmail.com
> > <ma...@gmail.com>> wrote:
> >
> >     This seems like a lot of trouble for not so common use case that has
> >     viable alternatives. Once you assume that class is intended for
> >     inheritance (which, arguably we neither do or imply a the moment)
> you're
> >     even more restricted that we are right now, according to the project
> >     policy and need for keeping things synchronized across all languages.
> >
> > By "this" you mean the modification of the DataFrame, the implementation
> > of a new pyspark class (DynamicDataFrame in this case) or the approach
> > in general?
>
> I mean promoting DataFrame as extensible in general. It is a risk of
> getting stuck with specific API, even more than we are right now, with
> little reward at the end.
>
> Additionally:
>
> - As far as I am aware nothing suggests that it is widely requested
> feature (corresponding SO questions didn't get much traffic over the
> years and I don't think we have any preceding JIRA tickets).
> - It can be addressed outside the project (within user codebase or as a
> standalone package) with minimal or no overhead.
>
> That being said ‒ if we're going to rewrite Python DataFrame methods to
> return instance type, I strongly believe that the existing methods
> should be marked as final.
>
> >
> >
> >
> >     On Scala side, I would rather expect to see type classes than direct
> >     inheritance so this might be a dead feature from the start.
> >
> >     As of Python (sorry if I missed something in the preceding
> discussion),
> >     quite natural approach would be to wrap DataFrame instance in your
> >     business class and delegate calls to the wrapped object. A very naive
> >     implementation could look like this
> >
> >     from functools import wraps
> >
> >     class BusinessModel:
> >         @classmethod
> >         def delegate(cls, a):
> >             def _(*args, **kwargs):
> >                 result = a(*args, **kwargs)
> >                 if isinstance(result, DataFrame):
> >                     return  cls(result)
> >                 else:
> >                     return result
> >
> >             if callable(a):
> >                 return wraps(a)(_)
> >             else:
> >                 return a
> >
> >         def __init__(self, df):
> >             self._df = df
> >
> >         def __getattr__(self, name):
> >             return BusinessModel.delegate(getattr(self._df, name))
> >
> >         def with_price(self, price=42):
> >             return self.selectExpr("*", f"{price} as price")
> >
> >
> >
> > Yes, effectively the solution is very similar to this one. I believe
> > that the advantage of doing it without hijacking with the decorator the
> > delegation is that you can still maintain static typing.
>
> You can maintain type checker compatibility (it is easier with stubs,
> but you can do it with inline hints as well, if I recall correctly) here
> as well.
>
> > On the other
> > hand (and this is probably a minor issue), when following this approach
> > with the `isinstance` checking for the casting you might end up casting
> > the `.summary()` and `.describe()` methods that probably you want still
> > to keep as "pure" DataFrames. If you see it from this perspective, then
> > "DynamicDataFrame" would be the boilerplate code that allows you to
> > decide more granularly what methods you want to delegate.
>
> You can do it with `__getattr__` as well. There are probably some edge
> cases (especially when accessing columns with `.`), but it should be
> still manageable.
>
>
> Just to be clear ‒ I am not insisting that this is somehow superior
> solution (there are things that cannot be done through delegation).
>
> >
> >     (BusinessModel(spark.createDataFrame([(1, "DEC")], ("id", "month")))
> >         .select("id")
> >         .with_price(0.0)
> >         .select("price")
> >         .show())
> >
> >
> >     but it can be easily adjusted to handle more complex uses cases,
> >     including inheritance.
> >
> >
> >
> >     On 12/29/21 12:54, Pablo Alcain wrote:
> >     > Hey everyone! I'm re-sending this e-mail, now with a PR proposal
> >     > (https://github.com/apache/spark/pull/35045
> >     <https://github.com/apache/spark/pull/35045>
> >     > <https://github.com/apache/spark/pull/35045
> >     <https://github.com/apache/spark/pull/35045>> if you want to take a
> look
> >     > at the code with a couple of examples). The proposed change
> includes
> >     > only a new class that would extend only the Python API without
> >     doing any
> >     > change to the underlying scala code. The benefit would be that the
> new
> >     > code only extends previous functionality without breaking any
> existing
> >     > application code, allowing pyspark users to try it out and see if
> it
> >     > turns out to be useful. Hyukjin Kwon
> >     > <https://github.com/HyukjinKwon
> >     <https://github.com/HyukjinKwon>> commented that a drawback with
> this
> >     > would be that, if we do this, it would be hard to deprecate later
> the
> >     > `DynamicDataFrame` API. The other option, if we want this
> >     inheritance to
> >     > be feasible, is to directly implement this "casting" directly on
> the
> >     > `DataFrame` code, so for example it would change from
> >     >
> >     > def limit(self, num: int) -> "DataFrame":
> >     >     jdf = self._jdf.limit(num)
> >     >     return DataFrame(jdf, self.sql_ctx)
> >     >
> >     > to
> >     >
> >     > def limit(self, num: int) -> "DataFrame":
> >     >     jdf = self._jdf.li <http://jdf.li> <http://jdf.li
> >     <http://jdf.li>> mit(num)
> >     >     return self.__class__(jdf, self.sql_ctx) # type(self) would
> >     work as well
> >     >
> >     > This approach would probably need to implement similar changes on
> the
> >     > Scala API as well in order to allow this kind of inheritance on
> >     Scala as
> >     > well (unfortunately I'm not knowledgable enough in Scala to figure
> out
> >     > what the changes would be exactly)
> >     >
> >     > I wanted to gather your input on this idea, whether you think it
> >     can be
> >     > helpful or not, and what would be the best strategy, in your
> >     opinion, to
> >     > pursue it.
> >     >
> >     > Thank you very much!
> >     > Pablo
> >     >
> >     > On Thu, Nov 4, 2021 at 9:44 PM Pablo Alcain
> >     > <pablo.alcain@wildlifestudios.com
> >     <ma...@wildlifestudios.com>
> >     > <mailto:pablo.alcain@wildlifestudios.com
> >     <ma...@wildlifestudios.com>>> wrote:
> >     >
> >     >     tl;dr: a proposal for a pyspark "DynamicDataFrame" class that
> >     would
> >     >     make it easier to inherit from it while keeping dataframe
> methods.
> >     >
> >     >     Hello everyone. We have been working for a long time with
> PySpark
> >     >     and more specifically with DataFrames. In our pipelines we have
> >     >     several tables, with specific purposes, that we usually load as
> >     >     DataFrames. As you might expect, there are a handful of
> >     queries and
> >     >     transformations per dataframe that are done many times, so we
> >     >     thought of ways that we could abstract them:
> >     >
> >     >     1. Functions: using functions that call dataframes and returns
> >     them
> >     >     transformed. It had a couple of pitfalls: we had to manage the
> >     >     namespaces carefully, and also the "chainability" didn't feel
> very
> >     >     pyspark-y.
> >     >     2. MonkeyPatching DataFrame: we monkeypatched
> >     >
> >      (
> https://stackoverflow.com/questions/5626193/what-is-monkey-patching <
> https://stackoverflow.com/questions/5626193/what-is-monkey-patching>
> >     >
> >      <
> https://stackoverflow.com/questions/5626193/what-is-monkey-patching <
> https://stackoverflow.com/questions/5626193/what-is-monkey-patching>>)
> >     >     methods with the regularly done queries inside the DataFrame
> >     class.
> >     >     This one kept it pyspark-y, but there was no easy way to handle
> >     >     segregated namespaces/
> >     >     3. Inheritances: create the class `MyBusinessDataFrame`,
> inherit
> >     >     from `DataFrame` and implement the methods there. This one
> solves
> >     >     all the issues, but with a caveat: the chainable methods cast
> the
> >     >     result explicitly to `DataFrame` (see
> >     >
> >
> https://github.com/apache/spark/blob/master/python/pyspark/sql/dataframe.py#L1910
> >     <
> https://github.com/apache/spark/blob/master/python/pyspark/sql/dataframe.py#L1910
> >
> >     >
> >      <
> https://github.com/apache/spark/blob/master/python/pyspark/sql/dataframe.py#L1910
> >     <
> https://github.com/apache/spark/blob/master/python/pyspark/sql/dataframe.py#L1910
> >>
> >     >     e g). Therefore, everytime you use one of the parent's methods
> >     you'd
> >     >     have to re-cast to `MyBusinessDataFrame`, making the code
> >     cumbersome.
> >     >
> >     >     In view of these pitfalls we decided to go for a slightly
> >     different
> >     >     approach, inspired by #3: We created a class called
> >     >     `DynamicDataFrame` that overrides the explicit call to
> `DataFrame`
> >     >     as done in PySpark but instead casted dynamically to
> >     >     `self.__class__` (see
> >     >
> >
> https://gist.github.com/pabloalcain/de79938507ad2d823a866238b3c8a66e#file-dynamic_dataframe_minimal-py-L21
> >     <
> https://gist.github.com/pabloalcain/de79938507ad2d823a866238b3c8a66e#file-dynamic_dataframe_minimal-py-L21
> >
> >     >
> >      <
> https://gist.github.com/pabloalcain/de79938507ad2d823a866238b3c8a66e#file-dynamic_dataframe_minimal-py-L21
> >     <
> https://gist.github.com/pabloalcain/de79938507ad2d823a866238b3c8a66e#file-dynamic_dataframe_minimal-py-L21
> >>
> >     >     e g). This allows the fluent methods to always keep the same
> >     class,
> >     >     making chainability as smooth as it is with pyspark dataframes.
> >     >
> >     >     As an example implementation, here's a link to a gist
> >     >
> >      (
> https://gist.github.com/pabloalcain/de79938507ad2d823a866238b3c8a66e
> >     <
> https://gist.github.com/pabloalcain/de79938507ad2d823a866238b3c8a66e>
> >     <
> https://gist.github.com/pabloalcain/de79938507ad2d823a866238b3c8a66e <
> https://gist.github.com/pabloalcain/de79938507ad2d823a866238b3c8a66e>>)
> >     >     that implemented dynamically `withColumn` and `select` methods
> and
> >     >     the expected output.
> >     >
> >     >     I'm sharing this here in case you feel like this approach can
> be
> >     >     useful for anyone else. In our case it greatly sped up the
> >     >     development of abstraction layers and allowed us to write
> cleaner
> >     >     code. One of the advantages is that it would simply be a
> "plugin"
> >     >     over pyspark, that does not modify anyhow already existing
> code or
> >     >     application interfaces.
> >     >
> >     >     If you think that this can be helpful, I can write a PR as a
> more
> >     >     refined proof of concept.
> >     >
> >     >     Thanks!
> >     >
> >     >     Pablo
> >     >
> >
> >
> >     --
> >     Best regards,
> >     Maciej Szymkiewicz
> >
> >     Web: https://zero323.net <https://zero323.net>
> >     PGP: A30CEF0C31A501EC
> >
>
>
> --
> Best regards,
> Maciej Szymkiewicz
>
> Web: https://zero323.net
> PGP: A30CEF0C31A501EC
>

Re: PySpark Dynamic DataFrame for easier inheritance

Posted by Maciej <ms...@gmail.com>.

On 12/29/21 16:18, Pablo Alcain wrote:
> Hey Maciej! Thanks for your answer and the comments :) 
> 
> On Wed, Dec 29, 2021 at 3:06 PM Maciej <mszymkiewicz@gmail.com
> <ma...@gmail.com>> wrote:
> 
>     This seems like a lot of trouble for not so common use case that has
>     viable alternatives. Once you assume that class is intended for
>     inheritance (which, arguably we neither do or imply a the moment) you're
>     even more restricted that we are right now, according to the project
>     policy and need for keeping things synchronized across all languages.
> 
> By "this" you mean the modification of the DataFrame, the implementation
> of a new pyspark class (DynamicDataFrame in this case) or the approach
> in general?

I mean promoting DataFrame as extensible in general. It is a risk of
getting stuck with specific API, even more than we are right now, with
little reward at the end.

Additionally:

- As far as I am aware nothing suggests that it is widely requested
feature (corresponding SO questions didn't get much traffic over the
years and I don't think we have any preceding JIRA tickets).
- It can be addressed outside the project (within user codebase or as a
standalone package) with minimal or no overhead.

That being said ‒ if we're going to rewrite Python DataFrame methods to
return instance type, I strongly believe that the existing methods
should be marked as final.

>  
> 
> 
>     On Scala side, I would rather expect to see type classes than direct
>     inheritance so this might be a dead feature from the start.
> 
>     As of Python (sorry if I missed something in the preceding discussion),
>     quite natural approach would be to wrap DataFrame instance in your
>     business class and delegate calls to the wrapped object. A very naive
>     implementation could look like this
> 
>     from functools import wraps
> 
>     class BusinessModel:
>         @classmethod
>         def delegate(cls, a):
>             def _(*args, **kwargs):
>                 result = a(*args, **kwargs)
>                 if isinstance(result, DataFrame):
>                     return  cls(result)
>                 else:
>                     return result
> 
>             if callable(a):
>                 return wraps(a)(_)
>             else:
>                 return a
> 
>         def __init__(self, df):
>             self._df = df
> 
>         def __getattr__(self, name):
>             return BusinessModel.delegate(getattr(self._df, name))
> 
>         def with_price(self, price=42):
>             return self.selectExpr("*", f"{price} as price")
> 
> 
> 
> Yes, effectively the solution is very similar to this one. I believe
> that the advantage of doing it without hijacking with the decorator the
> delegation is that you can still maintain static typing.

You can maintain type checker compatibility (it is easier with stubs,
but you can do it with inline hints as well, if I recall correctly) here
as well.

> On the other
> hand (and this is probably a minor issue), when following this approach
> with the `isinstance` checking for the casting you might end up casting
> the `.summary()` and `.describe()` methods that probably you want still
> to keep as "pure" DataFrames. If you see it from this perspective, then
> "DynamicDataFrame" would be the boilerplate code that allows you to
> decide more granularly what methods you want to delegate.

You can do it with `__getattr__` as well. There are probably some edge
cases (especially when accessing columns with `.`), but it should be
still manageable.


Just to be clear ‒ I am not insisting that this is somehow superior
solution (there are things that cannot be done through delegation).

> 
>     (BusinessModel(spark.createDataFrame([(1, "DEC")], ("id", "month")))
>         .select("id")
>         .with_price(0.0)
>         .select("price")
>         .show())
> 
> 
>     but it can be easily adjusted to handle more complex uses cases,
>     including inheritance.
> 
> 
> 
>     On 12/29/21 12:54, Pablo Alcain wrote:
>     > Hey everyone! I'm re-sending this e-mail, now with a PR proposal
>     > (https://github.com/apache/spark/pull/35045
>     <https://github.com/apache/spark/pull/35045>
>     > <https://github.com/apache/spark/pull/35045
>     <https://github.com/apache/spark/pull/35045>> if you want to take a look
>     > at the code with a couple of examples). The proposed change includes
>     > only a new class that would extend only the Python API without
>     doing any
>     > change to the underlying scala code. The benefit would be that the new
>     > code only extends previous functionality without breaking any existing
>     > application code, allowing pyspark users to try it out and see if it
>     > turns out to be useful. Hyukjin Kwon
>     > <https://github.com/HyukjinKwon
>     <https://github.com/HyukjinKwon>> commented that a drawback with this
>     > would be that, if we do this, it would be hard to deprecate later the
>     > `DynamicDataFrame` API. The other option, if we want this
>     inheritance to
>     > be feasible, is to directly implement this "casting" directly on the
>     > `DataFrame` code, so for example it would change from 
>     >
>     > def limit(self, num: int) -> "DataFrame":
>     >     jdf = self._jdf.limit(num)
>     >     return DataFrame(jdf, self.sql_ctx)
>     >
>     > to
>     >
>     > def limit(self, num: int) -> "DataFrame":
>     >     jdf = self._jdf.li <http://jdf.li> <http://jdf.li
>     <http://jdf.li>> mit(num)
>     >     return self.__class__(jdf, self.sql_ctx) # type(self) would
>     work as well
>     >
>     > This approach would probably need to implement similar changes on the
>     > Scala API as well in order to allow this kind of inheritance on
>     Scala as
>     > well (unfortunately I'm not knowledgable enough in Scala to figure out
>     > what the changes would be exactly)
>     >
>     > I wanted to gather your input on this idea, whether you think it
>     can be
>     > helpful or not, and what would be the best strategy, in your
>     opinion, to
>     > pursue it.
>     >
>     > Thank you very much!
>     > Pablo
>     >
>     > On Thu, Nov 4, 2021 at 9:44 PM Pablo Alcain
>     > <pablo.alcain@wildlifestudios.com
>     <ma...@wildlifestudios.com>
>     > <mailto:pablo.alcain@wildlifestudios.com
>     <ma...@wildlifestudios.com>>> wrote:
>     >
>     >     tl;dr: a proposal for a pyspark "DynamicDataFrame" class that
>     would
>     >     make it easier to inherit from it while keeping dataframe methods.
>     >
>     >     Hello everyone. We have been working for a long time with PySpark
>     >     and more specifically with DataFrames. In our pipelines we have
>     >     several tables, with specific purposes, that we usually load as
>     >     DataFrames. As you might expect, there are a handful of
>     queries and
>     >     transformations per dataframe that are done many times, so we
>     >     thought of ways that we could abstract them:
>     >
>     >     1. Functions: using functions that call dataframes and returns
>     them
>     >     transformed. It had a couple of pitfalls: we had to manage the
>     >     namespaces carefully, and also the "chainability" didn't feel very
>     >     pyspark-y.
>     >     2. MonkeyPatching DataFrame: we monkeypatched
>     >   
>      (https://stackoverflow.com/questions/5626193/what-is-monkey-patching <https://stackoverflow.com/questions/5626193/what-is-monkey-patching>
>     >   
>      <https://stackoverflow.com/questions/5626193/what-is-monkey-patching <https://stackoverflow.com/questions/5626193/what-is-monkey-patching>>)
>     >     methods with the regularly done queries inside the DataFrame
>     class.
>     >     This one kept it pyspark-y, but there was no easy way to handle
>     >     segregated namespaces/
>     >     3. Inheritances: create the class `MyBusinessDataFrame`, inherit
>     >     from `DataFrame` and implement the methods there. This one solves
>     >     all the issues, but with a caveat: the chainable methods cast the
>     >     result explicitly to `DataFrame` (see
>     >   
>      https://github.com/apache/spark/blob/master/python/pyspark/sql/dataframe.py#L1910
>     <https://github.com/apache/spark/blob/master/python/pyspark/sql/dataframe.py#L1910>
>     >   
>      <https://github.com/apache/spark/blob/master/python/pyspark/sql/dataframe.py#L1910
>     <https://github.com/apache/spark/blob/master/python/pyspark/sql/dataframe.py#L1910>>
>     >     e g). Therefore, everytime you use one of the parent's methods
>     you'd
>     >     have to re-cast to `MyBusinessDataFrame`, making the code
>     cumbersome.
>     >
>     >     In view of these pitfalls we decided to go for a slightly
>     different
>     >     approach, inspired by #3: We created a class called
>     >     `DynamicDataFrame` that overrides the explicit call to `DataFrame`
>     >     as done in PySpark but instead casted dynamically to
>     >     `self.__class__` (see
>     >   
>      https://gist.github.com/pabloalcain/de79938507ad2d823a866238b3c8a66e#file-dynamic_dataframe_minimal-py-L21
>     <https://gist.github.com/pabloalcain/de79938507ad2d823a866238b3c8a66e#file-dynamic_dataframe_minimal-py-L21>
>     >   
>      <https://gist.github.com/pabloalcain/de79938507ad2d823a866238b3c8a66e#file-dynamic_dataframe_minimal-py-L21
>     <https://gist.github.com/pabloalcain/de79938507ad2d823a866238b3c8a66e#file-dynamic_dataframe_minimal-py-L21>>
>     >     e g). This allows the fluent methods to always keep the same
>     class,
>     >     making chainability as smooth as it is with pyspark dataframes.
>     >
>     >     As an example implementation, here's a link to a gist
>     >   
>      (https://gist.github.com/pabloalcain/de79938507ad2d823a866238b3c8a66e
>     <https://gist.github.com/pabloalcain/de79938507ad2d823a866238b3c8a66e>
>     <https://gist.github.com/pabloalcain/de79938507ad2d823a866238b3c8a66e <https://gist.github.com/pabloalcain/de79938507ad2d823a866238b3c8a66e>>)
>     >     that implemented dynamically `withColumn` and `select` methods and
>     >     the expected output.
>     >
>     >     I'm sharing this here in case you feel like this approach can be
>     >     useful for anyone else. In our case it greatly sped up the
>     >     development of abstraction layers and allowed us to write cleaner
>     >     code. One of the advantages is that it would simply be a "plugin"
>     >     over pyspark, that does not modify anyhow already existing code or
>     >     application interfaces.
>     >
>     >     If you think that this can be helpful, I can write a PR as a more
>     >     refined proof of concept.
>     >
>     >     Thanks!
>     >
>     >     Pablo
>     >
> 
> 
>     -- 
>     Best regards,
>     Maciej Szymkiewicz
> 
>     Web: https://zero323.net <https://zero323.net>
>     PGP: A30CEF0C31A501EC
> 


-- 
Best regards,
Maciej Szymkiewicz

Web: https://zero323.net
PGP: A30CEF0C31A501EC

Re: PySpark Dynamic DataFrame for easier inheritance

Posted by Pablo Alcain <pa...@wildlifestudios.com>.

Hey Maciej! Thanks for your answer and the comments :)

On Wed, Dec 29, 2021 at 3:06 PM Maciej <ms...@gmail.com> wrote:

> This seems like a lot of trouble for not so common use case that has
> viable alternatives. Once you assume that class is intended for
> inheritance (which, arguably we neither do or imply a the moment) you're
> even more restricted that we are right now, according to the project
> policy and need for keeping things synchronized across all languages.
>
By "this" you mean the modification of the DataFrame, the implementation of
a new pyspark class (DynamicDataFrame in this case) or the approach in
general?


>
> On Scala side, I would rather expect to see type classes than direct
> inheritance so this might be a dead feature from the start.
>
> As of Python (sorry if I missed something in the preceding discussion),
> quite natural approach would be to wrap DataFrame instance in your
> business class and delegate calls to the wrapped object. A very naive
> implementation could look like this
>
> from functools import wraps
>
> class BusinessModel:
>     @classmethod
>     def delegate(cls, a):
>         def _(*args, **kwargs):
>             result = a(*args, **kwargs)
>             if isinstance(result, DataFrame):
>                 return  cls(result)
>             else:
>                 return result
>
>         if callable(a):
>             return wraps(a)(_)
>         else:
>             return a
>
>     def __init__(self, df):
>         self._df = df
>
>     def __getattr__(self, name):
>         return BusinessModel.delegate(getattr(self._df, name))
>
>     def with_price(self, price=42):
>         return self.selectExpr("*", f"{price} as price")
>
>
>
Yes, effectively the solution is very similar to this one. I believe that
the advantage of doing it without hijacking with the decorator the
delegation is that you can still maintain static typing. On the other hand
(and this is probably a minor issue), when following this approach with the
`isinstance` checking for the casting you might end up casting the
`.summary()` and `.describe()` methods that probably you want still to keep
as "pure" DataFrames. If you see it from this perspective, then
"DynamicDataFrame" would be the boilerplate code that allows you to decide
more granularly what methods you want to delegate.


> (BusinessModel(spark.createDataFrame([(1, "DEC")], ("id", "month")))
>     .select("id")
>     .with_price(0.0)
>     .select("price")
>     .show())
>
>
> but it can be easily adjusted to handle more complex uses cases,
> including inheritance.
>
>
>
> On 12/29/21 12:54, Pablo Alcain wrote:
> > Hey everyone! I'm re-sending this e-mail, now with a PR proposal
> > (https://github.com/apache/spark/pull/35045
> > <https://github.com/apache/spark/pull/35045> if you want to take a look
> > at the code with a couple of examples). The proposed change includes
> > only a new class that would extend only the Python API without doing any
> > change to the underlying scala code. The benefit would be that the new
> > code only extends previous functionality without breaking any existing
> > application code, allowing pyspark users to try it out and see if it
> > turns out to be useful. Hyukjin Kwon
> > <https://github.com/HyukjinKwon> commented that a drawback with this
> > would be that, if we do this, it would be hard to deprecate later the
> > `DynamicDataFrame` API. The other option, if we want this inheritance to
> > be feasible, is to directly implement this "casting" directly on the
> > `DataFrame` code, so for example it would change from
> >
> > def limit(self, num: int) -> "DataFrame":
> >     jdf = self._jdf.limit(num)
> >     return DataFrame(jdf, self.sql_ctx)
> >
> > to
> >
> > def limit(self, num: int) -> "DataFrame":
> >     jdf = self._jdf.li <http://jdf.li> mit(num)
> >     return self.__class__(jdf, self.sql_ctx) # type(self) would work as
> well
> >
> > This approach would probably need to implement similar changes on the
> > Scala API as well in order to allow this kind of inheritance on Scala as
> > well (unfortunately I'm not knowledgable enough in Scala to figure out
> > what the changes would be exactly)
> >
> > I wanted to gather your input on this idea, whether you think it can be
> > helpful or not, and what would be the best strategy, in your opinion, to
> > pursue it.
> >
> > Thank you very much!
> > Pablo
> >
> > On Thu, Nov 4, 2021 at 9:44 PM Pablo Alcain
> > <pablo.alcain@wildlifestudios.com
> > <ma...@wildlifestudios.com>> wrote:
> >
> >     tl;dr: a proposal for a pyspark "DynamicDataFrame" class that would
> >     make it easier to inherit from it while keeping dataframe methods.
> >
> >     Hello everyone. We have been working for a long time with PySpark
> >     and more specifically with DataFrames. In our pipelines we have
> >     several tables, with specific purposes, that we usually load as
> >     DataFrames. As you might expect, there are a handful of queries and
> >     transformations per dataframe that are done many times, so we
> >     thought of ways that we could abstract them:
> >
> >     1. Functions: using functions that call dataframes and returns them
> >     transformed. It had a couple of pitfalls: we had to manage the
> >     namespaces carefully, and also the "chainability" didn't feel very
> >     pyspark-y.
> >     2. MonkeyPatching DataFrame: we monkeypatched
> >     (https://stackoverflow.com/questions/5626193/what-is-monkey-patching
> >     <https://stackoverflow.com/questions/5626193/what-is-monkey-patching
> >)
> >     methods with the regularly done queries inside the DataFrame class.
> >     This one kept it pyspark-y, but there was no easy way to handle
> >     segregated namespaces/
> >     3. Inheritances: create the class `MyBusinessDataFrame`, inherit
> >     from `DataFrame` and implement the methods there. This one solves
> >     all the issues, but with a caveat: the chainable methods cast the
> >     result explicitly to `DataFrame` (see
> >
> https://github.com/apache/spark/blob/master/python/pyspark/sql/dataframe.py#L1910
> >     <
> https://github.com/apache/spark/blob/master/python/pyspark/sql/dataframe.py#L1910
> >
> >     e g). Therefore, everytime you use one of the parent's methods you'd
> >     have to re-cast to `MyBusinessDataFrame`, making the code cumbersome.
> >
> >     In view of these pitfalls we decided to go for a slightly different
> >     approach, inspired by #3: We created a class called
> >     `DynamicDataFrame` that overrides the explicit call to `DataFrame`
> >     as done in PySpark but instead casted dynamically to
> >     `self.__class__` (see
> >
> https://gist.github.com/pabloalcain/de79938507ad2d823a866238b3c8a66e#file-dynamic_dataframe_minimal-py-L21
> >     <
> https://gist.github.com/pabloalcain/de79938507ad2d823a866238b3c8a66e#file-dynamic_dataframe_minimal-py-L21
> >
> >     e g). This allows the fluent methods to always keep the same class,
> >     making chainability as smooth as it is with pyspark dataframes.
> >
> >     As an example implementation, here's a link to a gist
> >     (
> https://gist.github.com/pabloalcain/de79938507ad2d823a866238b3c8a66e <
> https://gist.github.com/pabloalcain/de79938507ad2d823a866238b3c8a66e>)
> >     that implemented dynamically `withColumn` and `select` methods and
> >     the expected output.
> >
> >     I'm sharing this here in case you feel like this approach can be
> >     useful for anyone else. In our case it greatly sped up the
> >     development of abstraction layers and allowed us to write cleaner
> >     code. One of the advantages is that it would simply be a "plugin"
> >     over pyspark, that does not modify anyhow already existing code or
> >     application interfaces.
> >
> >     If you think that this can be helpful, I can write a PR as a more
> >     refined proof of concept.
> >
> >     Thanks!
> >
> >     Pablo
> >
>
>
> --
> Best regards,
> Maciej Szymkiewicz
>
> Web: https://zero323.net
> PGP: A30CEF0C31A501EC
>

Re: PySpark Dynamic DataFrame for easier inheritance

Posted by Maciej <ms...@gmail.com>.

This seems like a lot of trouble for not so common use case that has
viable alternatives. Once you assume that class is intended for
inheritance (which, arguably we neither do or imply a the moment) you're
even more restricted that we are right now, according to the project
policy and need for keeping things synchronized across all languages.

On Scala side, I would rather expect to see type classes than direct
inheritance so this might be a dead feature from the start.

As of Python (sorry if I missed something in the preceding discussion),
quite natural approach would be to wrap DataFrame instance in your
business class and delegate calls to the wrapped object. A very naive
implementation could look like this

from functools import wraps

class BusinessModel:
    @classmethod
    def delegate(cls, a):
        def _(*args, **kwargs):
            result = a(*args, **kwargs)
            if isinstance(result, DataFrame):
                return  cls(result)
            else:
                return result

        if callable(a):
            return wraps(a)(_)
        else:
            return a

    def __init__(self, df):
        self._df = df

    def __getattr__(self, name):
        return BusinessModel.delegate(getattr(self._df, name))

    def with_price(self, price=42):
        return self.selectExpr("*", f"{price} as price")


(BusinessModel(spark.createDataFrame([(1, "DEC")], ("id", "month")))
    .select("id")
    .with_price(0.0)
    .select("price")
    .show())


but it can be easily adjusted to handle more complex uses cases,
including inheritance.



On 12/29/21 12:54, Pablo Alcain wrote:
> Hey everyone! I'm re-sending this e-mail, now with a PR proposal
> (https://github.com/apache/spark/pull/35045
> <https://github.com/apache/spark/pull/35045> if you want to take a look
> at the code with a couple of examples). The proposed change includes
> only a new class that would extend only the Python API without doing any
> change to the underlying scala code. The benefit would be that the new
> code only extends previous functionality without breaking any existing
> application code, allowing pyspark users to try it out and see if it
> turns out to be useful. Hyukjin Kwon
> <https://github.com/HyukjinKwon> commented that a drawback with this
> would be that, if we do this, it would be hard to deprecate later the
> `DynamicDataFrame` API. The other option, if we want this inheritance to
> be feasible, is to directly implement this "casting" directly on the
> `DataFrame` code, so for example it would change from 
> 
> def limit(self, num: int) -> "DataFrame":
>     jdf = self._jdf.limit(num)
>     return DataFrame(jdf, self.sql_ctx)
> 
> to
> 
> def limit(self, num: int) -> "DataFrame":
>     jdf = self._jdf.li <http://jdf.li> mit(num)
>     return self.__class__(jdf, self.sql_ctx) # type(self) would work as well
> 
> This approach would probably need to implement similar changes on the
> Scala API as well in order to allow this kind of inheritance on Scala as
> well (unfortunately I'm not knowledgable enough in Scala to figure out
> what the changes would be exactly)
> 
> I wanted to gather your input on this idea, whether you think it can be
> helpful or not, and what would be the best strategy, in your opinion, to
> pursue it.
> 
> Thank you very much!
> Pablo
> 
> On Thu, Nov 4, 2021 at 9:44 PM Pablo Alcain
> <pablo.alcain@wildlifestudios.com
> <ma...@wildlifestudios.com>> wrote:
> 
>     tl;dr: a proposal for a pyspark "DynamicDataFrame" class that would
>     make it easier to inherit from it while keeping dataframe methods.
> 
>     Hello everyone. We have been working for a long time with PySpark
>     and more specifically with DataFrames. In our pipelines we have
>     several tables, with specific purposes, that we usually load as
>     DataFrames. As you might expect, there are a handful of queries and
>     transformations per dataframe that are done many times, so we
>     thought of ways that we could abstract them:
> 
>     1. Functions: using functions that call dataframes and returns them
>     transformed. It had a couple of pitfalls: we had to manage the
>     namespaces carefully, and also the "chainability" didn't feel very
>     pyspark-y.
>     2. MonkeyPatching DataFrame: we monkeypatched
>     (https://stackoverflow.com/questions/5626193/what-is-monkey-patching
>     <https://stackoverflow.com/questions/5626193/what-is-monkey-patching>)
>     methods with the regularly done queries inside the DataFrame class.
>     This one kept it pyspark-y, but there was no easy way to handle
>     segregated namespaces/
>     3. Inheritances: create the class `MyBusinessDataFrame`, inherit
>     from `DataFrame` and implement the methods there. This one solves
>     all the issues, but with a caveat: the chainable methods cast the
>     result explicitly to `DataFrame` (see
>     https://github.com/apache/spark/blob/master/python/pyspark/sql/dataframe.py#L1910
>     <https://github.com/apache/spark/blob/master/python/pyspark/sql/dataframe.py#L1910>
>     e g). Therefore, everytime you use one of the parent's methods you'd
>     have to re-cast to `MyBusinessDataFrame`, making the code cumbersome.
> 
>     In view of these pitfalls we decided to go for a slightly different
>     approach, inspired by #3: We created a class called
>     `DynamicDataFrame` that overrides the explicit call to `DataFrame`
>     as done in PySpark but instead casted dynamically to
>     `self.__class__` (see
>     https://gist.github.com/pabloalcain/de79938507ad2d823a866238b3c8a66e#file-dynamic_dataframe_minimal-py-L21
>     <https://gist.github.com/pabloalcain/de79938507ad2d823a866238b3c8a66e#file-dynamic_dataframe_minimal-py-L21>
>     e g). This allows the fluent methods to always keep the same class,
>     making chainability as smooth as it is with pyspark dataframes.
> 
>     As an example implementation, here's a link to a gist
>     (https://gist.github.com/pabloalcain/de79938507ad2d823a866238b3c8a66e <https://gist.github.com/pabloalcain/de79938507ad2d823a866238b3c8a66e>)
>     that implemented dynamically `withColumn` and `select` methods and
>     the expected output.
> 
>     I'm sharing this here in case you feel like this approach can be
>     useful for anyone else. In our case it greatly sped up the
>     development of abstraction layers and allowed us to write cleaner
>     code. One of the advantages is that it would simply be a "plugin"
>     over pyspark, that does not modify anyhow already existing code or
>     application interfaces.
> 
>     If you think that this can be helpful, I can write a PR as a more
>     refined proof of concept.
> 
>     Thanks!
> 
>     Pablo
> 


-- 
Best regards,
Maciej Szymkiewicz

Web: https://zero323.net
PGP: A30CEF0C31A501EC