You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@spark.apache.org by Tim <bo...@posteo.de> on 2022/08/09 20:10:02 UTC

[DISCUSS] [Spark SQL, PySpark] Combining StructTypes into a new StructType

Hi all,

this is my first message to the Spark mailing list, so please bear with 
me if I don't fully meet your communication standards.
I just wanted to discuss one aspect that I've stumbled across several 
times over the past few weeks.
When working with Spark, I often run into the problem of having to merge 
two (or more) existing StructTypes into a new one to define a schema.
Usually this looks similar (in Python) to the following simplified 
example:

         a = StructType([StuctField("field_a", StringType())])
         b = StructType([StructField("field_b", IntegerType())])

         combined = StructType( a.fields + b.fields)

My idea, which I would like to discuss, is to shorten the above example 
in Python as follows by supporting Python's add operator for 
StructTypes:

         combined = a + b


What do you think of this idea? Are there any reasons why this is not 
yet part of StructType's functionality?
If you support this idea, I could create a first PR for further and 
deeper discussion.

Best
Tim

---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscribe@spark.apache.org

Re: [DISCUSS] [Spark SQL, PySpark] Combining StructTypes into a new StructType

Posted by Alexandros Biratsis <al...@gmail.com>.

Hi Maciej,

Sorry for the late reply. I believe you are right. Merging nested
StructType s can be tricky. As a matter of fact, it will require a complex
logic and most likely some conventions to include all the edge cases.

What about just exposing the existing merge
<https://github.com/apache/spark/blob/36dd531a93af55ce5c2bfd8d275814ccb2846962/sql/catalyst/src/main/scala/org/apache/spark/sql/types/StructType.scala#L496>
(currently
private) through a public *merge *method? Could that add some extra
flexibility to the current API?

Best,
Alex

On Sun, Aug 14, 2022 at 2:10 PM Maciej <ms...@gmail.com> wrote:

> I have mixed feelings about this proposal. Merging or diffing schemas is
> a common operation, but specific requirements differ from case to case,
> especially when complex nested data is used.
>
> Even if we put ordering of the fields aside, data types equality
> semantics (StructField in particular) is likely to result in
> implementation which is either confusing or has limited applicability.
>
> Additionally, Scala StructType is already a Seq[StructField] and as such
> provides set-like operations (contains, diff, intersect, union) as well
> as implementations of ++ / :+ / +: so we cannot do much here, without
> breaking the existing API.
>
> On 8/14/22 11:30, Alexandros Biratsis wrote:
> > Hello Rui and Tim,
> >
> > Indeed this sound a good idea and quite useful. To make it more formal
> > the list of a StructType could be treated as a Scala/Python set by
> > providing(inheriting?) the common sets' functionality i.e add, remove,
> > concat, intersect, diff etc. The set like functionality could be part of
> > StructType class for both languages.
> >
> > The Scala set collection
> >
> https://www.scala-lang.org/api/2.13.x/scala/collection/immutable/Set.html
> <https://www.scala-lang.org/api/2.13.x/scala/collection/immutable/Set.html
> >
> >
> > Best,
> > Alex
> >
> > On Wed, Aug 10, 2022, 08:14 Rui Wang <amaliujia@apache.org
> > <ma...@apache.org>> wrote:
> >
> >     Thanks for the idea!
> >
> >     I am thinking that the usage of "combined = StructType( a.fields +
> >     b.fields)" is still good because
> >     1) it is not horrible to merge a and b in this way.
> >     2) itself clarifies the intention which is merge two struct's fields
> >     to construct a new struct
> >     3) you also have room to apply more complicated operations on fields
> >     merging. For example remove duplicate files with the same name or
> >     use a.fields but remove some fields if they are in b.
> >
> >     overloading "+" could be
> >     1. it's ambiguous on what this plus is doing.
> >     2. If you define + is a concatenation on the fields, then it's
> >     limited to only do the concatenation. How about other operations
> >     like extract fields from a based on b? Maybe overloading "-"? In
> >     this case the item list will grow.
> >
> >     -Rui
> >
> >     On Tue, Aug 9, 2022 at 1:10 PM Tim <bossenti@posteo.de
> >     <ma...@posteo.de>> wrote:
> >
> >         Hi all,
> >
> >         this is my first message to the Spark mailing list, so please
> >         bear with
> >         me if I don't fully meet your communication standards.
> >         I just wanted to discuss one aspect that I've stumbled across
> >         several
> >         times over the past few weeks.
> >         When working with Spark, I often run into the problem of having
> >         to merge
> >         two (or more) existing StructTypes into a new one to define a
> >         schema.
> >         Usually this looks similar (in Python) to the following
> simplified
> >         example:
> >
> >                   a = StructType([StuctField("field_a", StringType())])
> >                   b = StructType([StructField("field_b", IntegerType())])
> >
> >                   combined = StructType( a.fields + b.fields)
> >
> >         My idea, which I would like to discuss, is to shorten the above
> >         example
> >         in Python as follows by supporting Python's add operator for
> >         StructTypes:
> >
> >                   combined = a + b
> >
> >
> >         What do you think of this idea? Are there any reasons why this
> >         is not
> >         yet part of StructType's functionality?
> >         If you support this idea, I could create a first PR for further
> and
> >         deeper discussion.
> >
> >         Best
> >         Tim
> >
> >
>  ---------------------------------------------------------------------
> >         To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
> >         <ma...@spark.apache.org>
> >
>
> --
> Best regards,
> Maciej Szymkiewicz
>
> Web: https://zero323.net
> PGP: A30CEF0C31A501EC
>
>

Re: [DISCUSS] [Spark SQL, PySpark] Combining StructTypes into a new StructType

Posted by Maciej <ms...@gmail.com>.

I have mixed feelings about this proposal. Merging or diffing schemas is 
a common operation, but specific requirements differ from case to case, 
especially when complex nested data is used.

Even if we put ordering of the fields aside, data types equality 
semantics (StructField in particular) is likely to result in 
implementation which is either confusing or has limited applicability.

Additionally, Scala StructType is already a Seq[StructField] and as such 
provides set-like operations (contains, diff, intersect, union) as well 
as implementations of ++ / :+ / +: so we cannot do much here, without 
breaking the existing API.

On 8/14/22 11:30, Alexandros Biratsis wrote:
> Hello Rui and Tim,
> 
> Indeed this sound a good idea and quite useful. To make it more formal 
> the list of a StructType could be treated as a Scala/Python set by 
> providing(inheriting?) the common sets' functionality i.e add, remove, 
> concat, intersect, diff etc. The set like functionality could be part of 
> StructType class for both languages.
> 
> The Scala set collection 
> https://www.scala-lang.org/api/2.13.x/scala/collection/immutable/Set.html <https://www.scala-lang.org/api/2.13.x/scala/collection/immutable/Set.html>
> 
> Best,
> Alex
> 
> On Wed, Aug 10, 2022, 08:14 Rui Wang <amaliujia@apache.org 
> <ma...@apache.org>> wrote:
> 
>     Thanks for the idea!
> 
>     I am thinking that the usage of "combined = StructType( a.fields +
>     b.fields)" is still good because
>     1) it is not horrible to merge a and b in this way.
>     2) itself clarifies the intention which is merge two struct's fields
>     to construct a new struct
>     3) you also have room to apply more complicated operations on fields
>     merging. For example remove duplicate files with the same name or
>     use a.fields but remove some fields if they are in b.
> 
>     overloading "+" could be
>     1. it's ambiguous on what this plus is doing.
>     2. If you define + is a concatenation on the fields, then it's
>     limited to only do the concatenation. How about other operations
>     like extract fields from a based on b? Maybe overloading "-"? In
>     this case the item list will grow.
> 
>     -Rui
> 
>     On Tue, Aug 9, 2022 at 1:10 PM Tim <bossenti@posteo.de
>     <ma...@posteo.de>> wrote:
> 
>         Hi all,
> 
>         this is my first message to the Spark mailing list, so please
>         bear with
>         me if I don't fully meet your communication standards.
>         I just wanted to discuss one aspect that I've stumbled across
>         several
>         times over the past few weeks.
>         When working with Spark, I often run into the problem of having
>         to merge
>         two (or more) existing StructTypes into a new one to define a
>         schema.
>         Usually this looks similar (in Python) to the following simplified
>         example:
> 
>                   a = StructType([StuctField("field_a", StringType())])
>                   b = StructType([StructField("field_b", IntegerType())])
> 
>                   combined = StructType( a.fields + b.fields)
> 
>         My idea, which I would like to discuss, is to shorten the above
>         example
>         in Python as follows by supporting Python's add operator for
>         StructTypes:
> 
>                   combined = a + b
> 
> 
>         What do you think of this idea? Are there any reasons why this
>         is not
>         yet part of StructType's functionality?
>         If you support this idea, I could create a first PR for further and
>         deeper discussion.
> 
>         Best
>         Tim
> 
>         ---------------------------------------------------------------------
>         To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
>         <ma...@spark.apache.org>
> 

-- 
Best regards,
Maciej Szymkiewicz

Web: https://zero323.net
PGP: A30CEF0C31A501EC

Re: [DISCUSS] [Spark SQL, PySpark] Combining StructTypes into a new StructType

Posted by Alexandros Biratsis <al...@gmail.com>.

Hello Rui and Tim,

Indeed this sound a good idea and quite useful. To make it more formal the
list of a StructType could be treated as a Scala/Python set by
providing(inheriting?) the common sets' functionality i.e add, remove,
concat, intersect, diff etc. The set like functionality could be part of
StructType class for both languages.

The Scala set collection
https://www.scala-lang.org/api/2.13.x/scala/collection/immutable/Set.html

Best,
Alex

On Wed, Aug 10, 2022, 08:14 Rui Wang <am...@apache.org> wrote:

> Thanks for the idea!
>
> I am thinking that the usage of "combined = StructType( a.fields +
> b.fields)" is still good because
> 1) it is not horrible to merge a and b in this way.
> 2) itself clarifies the intention which is merge two struct's fields to
> construct a new struct
> 3) you also have room to apply more complicated operations on fields
> merging. For example remove duplicate files with the same name or use
> a.fields but remove some fields if they are in b.
>
> overloading "+" could be
> 1. it's ambiguous on what this plus is doing.
> 2. If you define + is a concatenation on the fields, then it's limited to
> only do the concatenation. How about other operations like extract fields
> from a based on b? Maybe overloading "-"? In this case the item list will
> grow.
>
> -Rui
>
> On Tue, Aug 9, 2022 at 1:10 PM Tim <bo...@posteo.de> wrote:
>
>> Hi all,
>>
>> this is my first message to the Spark mailing list, so please bear with
>> me if I don't fully meet your communication standards.
>> I just wanted to discuss one aspect that I've stumbled across several
>> times over the past few weeks.
>> When working with Spark, I often run into the problem of having to merge
>> two (or more) existing StructTypes into a new one to define a schema.
>> Usually this looks similar (in Python) to the following simplified
>> example:
>>
>>          a = StructType([StuctField("field_a", StringType())])
>>          b = StructType([StructField("field_b", IntegerType())])
>>
>>          combined = StructType( a.fields + b.fields)
>>
>> My idea, which I would like to discuss, is to shorten the above example
>> in Python as follows by supporting Python's add operator for
>> StructTypes:
>>
>>          combined = a + b
>>
>>
>> What do you think of this idea? Are there any reasons why this is not
>> yet part of StructType's functionality?
>> If you support this idea, I could create a first PR for further and
>> deeper discussion.
>>
>> Best
>> Tim
>>
>> ---------------------------------------------------------------------
>> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
>>
>>

Re: [DISCUSS] [Spark SQL, PySpark] Combining StructTypes into a new StructType

Posted by Rui Wang <am...@apache.org>.

Thanks for the idea!

I am thinking that the usage of "combined = StructType( a.fields +
b.fields)" is still good because
1) it is not horrible to merge a and b in this way.
2) itself clarifies the intention which is merge two struct's fields to
construct a new struct
3) you also have room to apply more complicated operations on fields
merging. For example remove duplicate files with the same name or use
a.fields but remove some fields if they are in b.

overloading "+" could be
1. it's ambiguous on what this plus is doing.
2. If you define + is a concatenation on the fields, then it's limited to
only do the concatenation. How about other operations like extract fields
from a based on b? Maybe overloading "-"? In this case the item list will
grow.

-Rui

On Tue, Aug 9, 2022 at 1:10 PM Tim <bo...@posteo.de> wrote:

> Hi all,
>
> this is my first message to the Spark mailing list, so please bear with
> me if I don't fully meet your communication standards.
> I just wanted to discuss one aspect that I've stumbled across several
> times over the past few weeks.
> When working with Spark, I often run into the problem of having to merge
> two (or more) existing StructTypes into a new one to define a schema.
> Usually this looks similar (in Python) to the following simplified
> example:
>
>          a = StructType([StuctField("field_a", StringType())])
>          b = StructType([StructField("field_b", IntegerType())])
>
>          combined = StructType( a.fields + b.fields)
>
> My idea, which I would like to discuss, is to shorten the above example
> in Python as follows by supporting Python's add operator for
> StructTypes:
>
>          combined = a + b
>
>
> What do you think of this idea? Are there any reasons why this is not
> yet part of StructType's functionality?
> If you support this idea, I could create a first PR for further and
> deeper discussion.
>
> Best
> Tim
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
>
>