You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@impala.apache.org by Gabor Kaszab <ga...@cloudera.com> on 2021/09/10 18:03:45 UTC

UNNEST() operation for arrays

Hey,

I'm working on implementing a new UNNEST() operation that does what it
says, unnests arrays. But here the trick would come when there are multiple
UNNESTS() in the select list. Let me explain the desired behaviour with an
example:

Let's assume the following table:
[image: Screenshot 2021-09-10 at 19.58.53.png]
An the following query:
SELECT id, UNNEST(arr1), UNNEST(arr2) FROM tbl;

Here the expected result is to not produce a join on the two arrays as we
would with the original way of giving them in the select lit, but to put
the values of the arrays "next to each other" like this:
[image: Screenshot 2021-09-10 at 19.59.04.png]
I have investigated some options how this could be implemented and created
a doc for my notes. Let me share this with the community so that people can
share their preference. Any feedback is welcome!

https://docs.google.com/document/d/184EKJwMME4SNzyfOTueI-nz-IL-WUiBeaS8Zhi2XzMo/edit?usp=sharing

Cheers,
Gabor

Re: Re: UNNEST() operation for arrays

Posted by Tim Armstrong <ti...@gmail.com>.

Thanks for responding in detail, I agree with your points and it's
interesting that the SQL standard has something similar - I didn't know
that.

> I checked in Postgres and apparently you can unnest arrays from multiple
> tables and they still get zipped. I think for the first iteration this
> would be overkill to implement as Impala puts the arrays from different
> tables into a different subplan making it a bit more complex to implement,
> so I'd just restrict unnest() to a single table at this point.
Agreed! I'm not sure why you would want to do this - maybe an unusual
schema design.

On Tue, 28 Sept 2021 at 06:07, Gabor Kaszab <ga...@cloudera.com>
wrote:

> Thanks for sharing your view, Tim! And sorry for the late answer.
>
> *Feature flag:*
> The more thought I give this the more I feel that using a feature flag
> indeed would be confusing. Additionally, I like the idea of having a
> special syntax for the zipping unnest.
>
> *Syntax:*
> I checked the SQL standard and it says something really similar to your
> suggestion:
> Example from the standard:
> SELECT M.E FROM UNNEST (MVE) AS M(E)
>
> <collection derived table> ::=
>   UNNEST <left paren> <collection value expression>
> [ { <comma> <collection value expression> }... ] <right paren> [ WITH
> ORDINALITY ]
>
> And you can specify aliases for the <collection derived table>s with the AS
> keyword and even the standard mentions multiple collection parameters for
> the unnest(). So let's have the query rewrite this way:
> SELECT UNNEST(t.a1), UNNEST(t.a2) FROM t;
> ==>
> SELECT _tmp1, _tmp2 FROM t, UNNEST(t.a1, t.a2) AS (_tmp1, tmp2);
>
> *Arrays from multiple tables:*
> I checked in Postgres and apparently you can unnest arrays from multiple
> tables and they still get zipped. I think for the first iteration this
> would be overkill to implement as Impala puts the arrays from different
> tables into a different subplan making it a bit more complex to implement,
> so I'd just restrict unnest() to a single table at this point.
>
> *UnnestTableRef:*
> I also thought that these unnests in the FROM clause should go into a kind
> of TableRef (e.g. UnnestTableRef?, ZippingUnnestTableRef) but there I found
> that each TableRef has a single rawPath_ and this might not be the best for
> a single TableRef descendant holding multiple arrays with multiple
> rawPath_s. But this is not a big issue, I'm sure there is another way to
> get around this.
>
> Cheers,
> Gabor
>
>
> On Tue, Sep 14, 2021 at 7:36 AM Tim Armstrong <ti...@gmail.com>
> wrote:
>
> > I think compatibility with Postgres derivatives generally makes sense as
> a
> > goal, even if the behaviour isn't ideal - definitely much easier than
> > rewriting everything to a different syntax.
> >
> > With your rewrite proposal, would you also be able to express these
> queries
> > by unnesting in the FROM clause?
> >
> > SELECT unnest(t.a) c1, unnest(t.b) c1
> > FROM tbl t
> > =>
> > SELECT _tmp1 c1, _tmp2 c2
> > FROM tbl t, UNNEST(t.a, t.b) (_tmp1, _tmp2)
> >
> > I'm not sure a feature flag would work well for this, since you
> > might potentially have a more complex FROM clause with other unnesting
> and
> > I'm not sure how you would model that. I suspect it works out cleaner to
> > model the zipped arrays as a TableRef - for most purposes in analysis and
> > planning it's going to be treated the same as a TableRef that is just a
> > plain array of structs - the analysis/planner code for TableRefs, joins,
> > etc isn't the easiest to extend.
> >
> > I guess you could also rewrite it to something like this too, which would
> > really let you isolate the special handling to a simple SELECT block:
> > SELECT v._tmp1 c1, v._tmp2 c2
> > FROM tbl t, (SELECT t.a __tmp1, t.b __tmp2 FROM t.a ZIP JOIN t.b) v
> >
> > Another question: what happens if the two arrays are from different table
> > references? Is that allowed? Do they still get zipped together?
> >
> > On Mon, 13 Sept 2021 at 04:43, Gabor Kaszab <ga...@apache.org>
> > wrote:
> >
> > > Hey All,
> > > Thanks for taking a look and sharing your view. The whole idea of
> having
> > > multiple UNNEST() functions in the select list (and zipping the
> results)
> > > came from a user as a direct feature request as they try to move to
> > Impala
> > > from Greenplum and as they say they can't rewrite their existing
> queries.
> > >
> > > The most straightforward way would be to implement UNNEST() as a UDTF
> but
> > > Impala doesn't support them, and anyway, as mentioned above Impala
> > doesn't
> > > support even UDFs with nested type parameters.
> > >
> > > My idea was something like introducing an unnest SlotRef or such,
> change
> > > the SQL parser to translate each UNNEST(array) into this SlotRef and
> then
> > > do a query rewrite based on that so that we can process the unnest. My
> > > preference would be to rewrite the query to the syntax we have now with
> > > joining the arrays in the FROM clause as collection table refs and then
> > > have a feature flag (e.g. unnest_join_type) to indicate whether we want
> > the
> > > joined or the zipped behaviour. For the zipped one we can modify the
> > > existing UNNEST node to be able to receive more arrays and then do the
> > zip
> > > itself.
> > >
> > > What do you think?
> > >
> > > Gabor
> > >
> > > On Mon, Sep 13, 2021 at 9:35 AM 彭斗 <pe...@126.com> wrote:
> > >
> > > > Hi  All,
> > > >
> > > >
> > > > Currently IMPALA don't support complex data type as input/output
> > > > parameters for Bulltin Functions / User Defined Functions,
> > > > If IMPALA can support this, the Unnest()  or other  operation on
> > complex
> > > > data type may easy to achieve.
> > > >
> > > >
> > > >
> > > >
> > > > --
> > > >
> > > > Best Regards
> > > > Dou
> > > >
> > > >
> > > >
> > > >
> > > >
> > > > At 2021-09-13 15:08:52, "Quanlong Huang" <hu...@gmail.com>
> > > wrote:
> > > > >This seems like the explode() UDTF in Hive:
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-explode(array)
> > > > >
> > > > >Hive doesn't support multiple UDTF in the SelectList. I think we can
> > > make
> > > > >the same choice that only supports one UNNEST() in the SelectList.
> If
> > > > >multiple columns need to be unfolded, they should be wrapped in the
> > same
> > > > >UNNEST() function with a keyword/func specifying how to join the
> rows.
> > > > E.g.
> > > > >
> > > > >-- cross join by default
> > > > >select UNNEST(array_a, array_b) from tbl;
> > > > >-- specifying the join type
> > > > >select UNNEST(zip array_a, array_b) from tbl;
> > > > >select UNNEST(arrays_zip(array_a, array_b)) from tbl;
> > > > >
> > > > >explode() in Hive only supports one column. The above syntax is
> > inspired
> > > > by
> > > > >SparkSQL:
> > > > >
> > > >
> > >
> >
> https://stackoverflow.com/questions/33220916/explode-transpose-multiple-columns-in-spark-sql-table
> > > > >https://spark.apache.org/docs/latest/api/sql/index.html#arrays_zip
> > > > >
> > > > >On a separate note, what about using the same name(explode) as
> Hive? I
> > > > >think it'd be helpful for migrations into Impala, especially if we
> > also
> > > > >support the LATERAL VIEW syntax.
> > > > >
> > > > >Thanks,
> > > > >Quanlong
> > > > >
> > > > >On Sun, Sep 12, 2021 at 8:56 AM Tim Armstrong <
> > > tim.g.armstrong@gmail.com>
> > > > >wrote:
> > > > >
> > > > >> The syntax/behaviour seems kinda unusual to me - do any other
> > systems
> > > > aside
> > > > >> from postgres implement this? UNNEST() in the from clause seems
> more
> > > > >> common, and it is generally not consistent with the rest of SQL
> for
> > > > SELECT
> > > > >> clause entries to affect the set of rows returned.
> > > > >>
> > > > >> I'm not sure but it seems like it might be tricky to make the
> select
> > > > clause
> > > > >> in the Impala frontend do this.
> > > > >>
> > > > >> Some way of doing this in FROM clause seems more intuitive and
> > > > consistent
> > > > >> with the rest of sql, e.g.
> > > > >>
> > > > >> select id, e1, e2
> > > > >> from tbl t, UNNEST(t.arr1, t.arr2) (e1, e2)
> > > > >>
> > > > >> I'm tempted to say that it should be called ZIP() since that's the
> > > name
> > > > of
> > > > >> this operation in functional programming, but I don't think
> > inventing
> > > > new
> > > > >> SQL keywords is really ideal.
> > > > >>
> > > > >> I'm also a little curious about the use case for this. I think
> > ideally
> > > > you
> > > > >> would represent two arrays with correlated indices as an array of
> > > > structs
> > > > >> instead of two arrays, but I can appreciate that schema designs
> may
> > > not
> > > > >> always be fully normalised.
> > > > >>
> > > > >> On Fri, 10 Sept 2021 at 18:38, Aman Sinha <am...@gmail.com>
> > > wrote:
> > > > >>
> > > > >> > Hi Gabor,
> > > > >> > An implicit assumption in the proposed syntax is that there is
> > some
> > > > >> > association between an item at a particular position in the
> first
> > > > array
> > > > >> and
> > > > >> > the same position in the second array.
> > > > >> > However,  in general the arrays are not ordered in a semantic
> > sense.
> > > > e.g
> > > > >> > car_make: {"Honda", "Toyota"},   car_model: {"Prius", "Accord"}.
> > >  The
> > > > >> > proposed UNNEST behavior will not produce any rows where {Honda,
> > > > Accord}
> > > > >> > appear together which would be unexpected.  Although one could
> > make
> > > > the
> > > > >> > argument that in order to get this behavior you could fall back
> to
> > > the
> > > > >> > current syntax of putting the arrays in the FROM clause, it
> could
> > be
> > > > >> > confusing to explain when one should be used vs other.  It would
> > be
> > > > >> useful
> > > > >> > to see what other systems (apart from Postgres) which have more
> > > native
> > > > >> > complex types support do in this case (e.g Presto, Couchbase,
> > > > Bigquery).
> > > > >> >
> > > > >> > One alternative I can think of is to make the association
> between
> > > the
> > > > >> items
> > > > >> > in multiple arrays be more explicit e.g by using an outer join
> on
> > > the
> > > > >> > ordinal position. That way, items at the same position appear
> > > together
> > > > >> in a
> > > > >> > row and for non-matches we produce Null for the smaller array. I
> > > > haven't
> > > > >> > thought through the exact syntax for this yet.
> > > > >> >
> > > > >> > -Aman
> > > > >> >
> > > > >> > On Fri, Sep 10, 2021 at 11:04 AM Gabor Kaszab <
> > > > gaborkaszab@cloudera.com>
> > > > >> > wrote:
> > > > >> >
> > > > >> > > Hey,
> > > > >> > >
> > > > >> > > I'm working on implementing a new UNNEST() operation that does
> > > what
> > > > it
> > > > >> > > says, unnests arrays. But here the trick would come when there
> > are
> > > > >> > multiple
> > > > >> > > UNNESTS() in the select list. Let me explain the desired
> > behaviour
> > > > with
> > > > >> > an
> > > > >> > > example:
> > > > >> > >
> > > > >> > > Let's assume the following table:
> > > > >> > > [image: Screenshot 2021-09-10 at 19.58.53.png]
> > > > >> > > An the following query:
> > > > >> > > SELECT id, UNNEST(arr1), UNNEST(arr2) FROM tbl;
> > > > >> > >
> > > > >> > > Here the expected result is to not produce a join on the two
> > > arrays
> > > > as
> > > > >> we
> > > > >> > > would with the original way of giving them in the select lit,
> > but
> > > to
> > > > >> put
> > > > >> > > the values of the arrays "next to each other" like this:
> > > > >> > > [image: Screenshot 2021-09-10 at 19.59.04.png]
> > > > >> > > I have investigated some options how this could be implemented
> > and
> > > > >> > created
> > > > >> > > a doc for my notes. Let me share this with the community so
> that
> > > > people
> > > > >> > can
> > > > >> > > share their preference. Any feedback is welcome!
> > > > >> > >
> > > > >> > >
> > > > >> > >
> > > > >> >
> > > > >>
> > > >
> > >
> >
> https://docs.google.com/document/d/184EKJwMME4SNzyfOTueI-nz-IL-WUiBeaS8Zhi2XzMo/edit?usp=sharing
> > > > >> > >
> > > > >> > > Cheers,
> > > > >> > > Gabor
> > > > >> > >
> > > > >> > >
> > > > >> > >
> > > > >> >
> > > > >>
> > > >
> > >
> >
>

Re: Re: UNNEST() operation for arrays

Posted by Gabor Kaszab <ga...@cloudera.com>.

Thanks for sharing your view, Tim! And sorry for the late answer.

*Feature flag:*
The more thought I give this the more I feel that using a feature flag
indeed would be confusing. Additionally, I like the idea of having a
special syntax for the zipping unnest.

*Syntax:*
I checked the SQL standard and it says something really similar to your
suggestion:
Example from the standard:
SELECT M.E FROM UNNEST (MVE) AS M(E)

<collection derived table> ::=
  UNNEST <left paren> <collection value expression>
[ { <comma> <collection value expression> }... ] <right paren> [ WITH
ORDINALITY ]

And you can specify aliases for the <collection derived table>s with the AS
keyword and even the standard mentions multiple collection parameters for
the unnest(). So let's have the query rewrite this way:
SELECT UNNEST(t.a1), UNNEST(t.a2) FROM t;
==>
SELECT _tmp1, _tmp2 FROM t, UNNEST(t.a1, t.a2) AS (_tmp1, tmp2);

*Arrays from multiple tables:*
I checked in Postgres and apparently you can unnest arrays from multiple
tables and they still get zipped. I think for the first iteration this
would be overkill to implement as Impala puts the arrays from different
tables into a different subplan making it a bit more complex to implement,
so I'd just restrict unnest() to a single table at this point.

*UnnestTableRef:*
I also thought that these unnests in the FROM clause should go into a kind
of TableRef (e.g. UnnestTableRef?, ZippingUnnestTableRef) but there I found
that each TableRef has a single rawPath_ and this might not be the best for
a single TableRef descendant holding multiple arrays with multiple
rawPath_s. But this is not a big issue, I'm sure there is another way to
get around this.

Cheers,
Gabor


On Tue, Sep 14, 2021 at 7:36 AM Tim Armstrong <ti...@gmail.com>
wrote:

> I think compatibility with Postgres derivatives generally makes sense as a
> goal, even if the behaviour isn't ideal - definitely much easier than
> rewriting everything to a different syntax.
>
> With your rewrite proposal, would you also be able to express these queries
> by unnesting in the FROM clause?
>
> SELECT unnest(t.a) c1, unnest(t.b) c1
> FROM tbl t
> =>
> SELECT _tmp1 c1, _tmp2 c2
> FROM tbl t, UNNEST(t.a, t.b) (_tmp1, _tmp2)
>
> I'm not sure a feature flag would work well for this, since you
> might potentially have a more complex FROM clause with other unnesting and
> I'm not sure how you would model that. I suspect it works out cleaner to
> model the zipped arrays as a TableRef - for most purposes in analysis and
> planning it's going to be treated the same as a TableRef that is just a
> plain array of structs - the analysis/planner code for TableRefs, joins,
> etc isn't the easiest to extend.
>
> I guess you could also rewrite it to something like this too, which would
> really let you isolate the special handling to a simple SELECT block:
> SELECT v._tmp1 c1, v._tmp2 c2
> FROM tbl t, (SELECT t.a __tmp1, t.b __tmp2 FROM t.a ZIP JOIN t.b) v
>
> Another question: what happens if the two arrays are from different table
> references? Is that allowed? Do they still get zipped together?
>
> On Mon, 13 Sept 2021 at 04:43, Gabor Kaszab <ga...@apache.org>
> wrote:
>
> > Hey All,
> > Thanks for taking a look and sharing your view. The whole idea of having
> > multiple UNNEST() functions in the select list (and zipping the results)
> > came from a user as a direct feature request as they try to move to
> Impala
> > from Greenplum and as they say they can't rewrite their existing queries.
> >
> > The most straightforward way would be to implement UNNEST() as a UDTF but
> > Impala doesn't support them, and anyway, as mentioned above Impala
> doesn't
> > support even UDFs with nested type parameters.
> >
> > My idea was something like introducing an unnest SlotRef or such, change
> > the SQL parser to translate each UNNEST(array) into this SlotRef and then
> > do a query rewrite based on that so that we can process the unnest. My
> > preference would be to rewrite the query to the syntax we have now with
> > joining the arrays in the FROM clause as collection table refs and then
> > have a feature flag (e.g. unnest_join_type) to indicate whether we want
> the
> > joined or the zipped behaviour. For the zipped one we can modify the
> > existing UNNEST node to be able to receive more arrays and then do the
> zip
> > itself.
> >
> > What do you think?
> >
> > Gabor
> >
> > On Mon, Sep 13, 2021 at 9:35 AM 彭斗 <pe...@126.com> wrote:
> >
> > > Hi  All,
> > >
> > >
> > > Currently IMPALA don't support complex data type as input/output
> > > parameters for Bulltin Functions / User Defined Functions,
> > > If IMPALA can support this, the Unnest()  or other  operation on
> complex
> > > data type may easy to achieve.
> > >
> > >
> > >
> > >
> > > --
> > >
> > > Best Regards
> > > Dou
> > >
> > >
> > >
> > >
> > >
> > > At 2021-09-13 15:08:52, "Quanlong Huang" <hu...@gmail.com>
> > wrote:
> > > >This seems like the explode() UDTF in Hive:
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-explode(array)
> > > >
> > > >Hive doesn't support multiple UDTF in the SelectList. I think we can
> > make
> > > >the same choice that only supports one UNNEST() in the SelectList. If
> > > >multiple columns need to be unfolded, they should be wrapped in the
> same
> > > >UNNEST() function with a keyword/func specifying how to join the rows.
> > > E.g.
> > > >
> > > >-- cross join by default
> > > >select UNNEST(array_a, array_b) from tbl;
> > > >-- specifying the join type
> > > >select UNNEST(zip array_a, array_b) from tbl;
> > > >select UNNEST(arrays_zip(array_a, array_b)) from tbl;
> > > >
> > > >explode() in Hive only supports one column. The above syntax is
> inspired
> > > by
> > > >SparkSQL:
> > > >
> > >
> >
> https://stackoverflow.com/questions/33220916/explode-transpose-multiple-columns-in-spark-sql-table
> > > >https://spark.apache.org/docs/latest/api/sql/index.html#arrays_zip
> > > >
> > > >On a separate note, what about using the same name(explode) as Hive? I
> > > >think it'd be helpful for migrations into Impala, especially if we
> also
> > > >support the LATERAL VIEW syntax.
> > > >
> > > >Thanks,
> > > >Quanlong
> > > >
> > > >On Sun, Sep 12, 2021 at 8:56 AM Tim Armstrong <
> > tim.g.armstrong@gmail.com>
> > > >wrote:
> > > >
> > > >> The syntax/behaviour seems kinda unusual to me - do any other
> systems
> > > aside
> > > >> from postgres implement this? UNNEST() in the from clause seems more
> > > >> common, and it is generally not consistent with the rest of SQL for
> > > SELECT
> > > >> clause entries to affect the set of rows returned.
> > > >>
> > > >> I'm not sure but it seems like it might be tricky to make the select
> > > clause
> > > >> in the Impala frontend do this.
> > > >>
> > > >> Some way of doing this in FROM clause seems more intuitive and
> > > consistent
> > > >> with the rest of sql, e.g.
> > > >>
> > > >> select id, e1, e2
> > > >> from tbl t, UNNEST(t.arr1, t.arr2) (e1, e2)
> > > >>
> > > >> I'm tempted to say that it should be called ZIP() since that's the
> > name
> > > of
> > > >> this operation in functional programming, but I don't think
> inventing
> > > new
> > > >> SQL keywords is really ideal.
> > > >>
> > > >> I'm also a little curious about the use case for this. I think
> ideally
> > > you
> > > >> would represent two arrays with correlated indices as an array of
> > > structs
> > > >> instead of two arrays, but I can appreciate that schema designs may
> > not
> > > >> always be fully normalised.
> > > >>
> > > >> On Fri, 10 Sept 2021 at 18:38, Aman Sinha <am...@gmail.com>
> > wrote:
> > > >>
> > > >> > Hi Gabor,
> > > >> > An implicit assumption in the proposed syntax is that there is
> some
> > > >> > association between an item at a particular position in the first
> > > array
> > > >> and
> > > >> > the same position in the second array.
> > > >> > However,  in general the arrays are not ordered in a semantic
> sense.
> > > e.g
> > > >> > car_make: {"Honda", "Toyota"},   car_model: {"Prius", "Accord"}.
> >  The
> > > >> > proposed UNNEST behavior will not produce any rows where {Honda,
> > > Accord}
> > > >> > appear together which would be unexpected.  Although one could
> make
> > > the
> > > >> > argument that in order to get this behavior you could fall back to
> > the
> > > >> > current syntax of putting the arrays in the FROM clause, it could
> be
> > > >> > confusing to explain when one should be used vs other.  It would
> be
> > > >> useful
> > > >> > to see what other systems (apart from Postgres) which have more
> > native
> > > >> > complex types support do in this case (e.g Presto, Couchbase,
> > > Bigquery).
> > > >> >
> > > >> > One alternative I can think of is to make the association between
> > the
> > > >> items
> > > >> > in multiple arrays be more explicit e.g by using an outer join on
> > the
> > > >> > ordinal position. That way, items at the same position appear
> > together
> > > >> in a
> > > >> > row and for non-matches we produce Null for the smaller array. I
> > > haven't
> > > >> > thought through the exact syntax for this yet.
> > > >> >
> > > >> > -Aman
> > > >> >
> > > >> > On Fri, Sep 10, 2021 at 11:04 AM Gabor Kaszab <
> > > gaborkaszab@cloudera.com>
> > > >> > wrote:
> > > >> >
> > > >> > > Hey,
> > > >> > >
> > > >> > > I'm working on implementing a new UNNEST() operation that does
> > what
> > > it
> > > >> > > says, unnests arrays. But here the trick would come when there
> are
> > > >> > multiple
> > > >> > > UNNESTS() in the select list. Let me explain the desired
> behaviour
> > > with
> > > >> > an
> > > >> > > example:
> > > >> > >
> > > >> > > Let's assume the following table:
> > > >> > > [image: Screenshot 2021-09-10 at 19.58.53.png]
> > > >> > > An the following query:
> > > >> > > SELECT id, UNNEST(arr1), UNNEST(arr2) FROM tbl;
> > > >> > >
> > > >> > > Here the expected result is to not produce a join on the two
> > arrays
> > > as
> > > >> we
> > > >> > > would with the original way of giving them in the select lit,
> but
> > to
> > > >> put
> > > >> > > the values of the arrays "next to each other" like this:
> > > >> > > [image: Screenshot 2021-09-10 at 19.59.04.png]
> > > >> > > I have investigated some options how this could be implemented
> and
> > > >> > created
> > > >> > > a doc for my notes. Let me share this with the community so that
> > > people
> > > >> > can
> > > >> > > share their preference. Any feedback is welcome!
> > > >> > >
> > > >> > >
> > > >> > >
> > > >> >
> > > >>
> > >
> >
> https://docs.google.com/document/d/184EKJwMME4SNzyfOTueI-nz-IL-WUiBeaS8Zhi2XzMo/edit?usp=sharing
> > > >> > >
> > > >> > > Cheers,
> > > >> > > Gabor
> > > >> > >
> > > >> > >
> > > >> > >
> > > >> >
> > > >>
> > >
> >
>

Re: Re: UNNEST() operation for arrays

Posted by Tim Armstrong <ti...@gmail.com>.

I think compatibility with Postgres derivatives generally makes sense as a
goal, even if the behaviour isn't ideal - definitely much easier than
rewriting everything to a different syntax.

With your rewrite proposal, would you also be able to express these queries
by unnesting in the FROM clause?

SELECT unnest(t.a) c1, unnest(t.b) c1
FROM tbl t
=>
SELECT _tmp1 c1, _tmp2 c2
FROM tbl t, UNNEST(t.a, t.b) (_tmp1, _tmp2)

I'm not sure a feature flag would work well for this, since you
might potentially have a more complex FROM clause with other unnesting and
I'm not sure how you would model that. I suspect it works out cleaner to
model the zipped arrays as a TableRef - for most purposes in analysis and
planning it's going to be treated the same as a TableRef that is just a
plain array of structs - the analysis/planner code for TableRefs, joins,
etc isn't the easiest to extend.

I guess you could also rewrite it to something like this too, which would
really let you isolate the special handling to a simple SELECT block:
SELECT v._tmp1 c1, v._tmp2 c2
FROM tbl t, (SELECT t.a __tmp1, t.b __tmp2 FROM t.a ZIP JOIN t.b) v

Another question: what happens if the two arrays are from different table
references? Is that allowed? Do they still get zipped together?

On Mon, 13 Sept 2021 at 04:43, Gabor Kaszab <ga...@apache.org> wrote:

> Hey All,
> Thanks for taking a look and sharing your view. The whole idea of having
> multiple UNNEST() functions in the select list (and zipping the results)
> came from a user as a direct feature request as they try to move to Impala
> from Greenplum and as they say they can't rewrite their existing queries.
>
> The most straightforward way would be to implement UNNEST() as a UDTF but
> Impala doesn't support them, and anyway, as mentioned above Impala doesn't
> support even UDFs with nested type parameters.
>
> My idea was something like introducing an unnest SlotRef or such, change
> the SQL parser to translate each UNNEST(array) into this SlotRef and then
> do a query rewrite based on that so that we can process the unnest. My
> preference would be to rewrite the query to the syntax we have now with
> joining the arrays in the FROM clause as collection table refs and then
> have a feature flag (e.g. unnest_join_type) to indicate whether we want the
> joined or the zipped behaviour. For the zipped one we can modify the
> existing UNNEST node to be able to receive more arrays and then do the zip
> itself.
>
> What do you think?
>
> Gabor
>
> On Mon, Sep 13, 2021 at 9:35 AM 彭斗 <pe...@126.com> wrote:
>
> > Hi  All,
> >
> >
> > Currently IMPALA don't support complex data type as input/output
> > parameters for Bulltin Functions / User Defined Functions,
> > If IMPALA can support this, the Unnest()  or other  operation on  complex
> > data type may easy to achieve.
> >
> >
> >
> >
> > --
> >
> > Best Regards
> > Dou
> >
> >
> >
> >
> >
> > At 2021-09-13 15:08:52, "Quanlong Huang" <hu...@gmail.com>
> wrote:
> > >This seems like the explode() UDTF in Hive:
> > >
> >
> https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-explode(array)
> > >
> > >Hive doesn't support multiple UDTF in the SelectList. I think we can
> make
> > >the same choice that only supports one UNNEST() in the SelectList. If
> > >multiple columns need to be unfolded, they should be wrapped in the same
> > >UNNEST() function with a keyword/func specifying how to join the rows.
> > E.g.
> > >
> > >-- cross join by default
> > >select UNNEST(array_a, array_b) from tbl;
> > >-- specifying the join type
> > >select UNNEST(zip array_a, array_b) from tbl;
> > >select UNNEST(arrays_zip(array_a, array_b)) from tbl;
> > >
> > >explode() in Hive only supports one column. The above syntax is inspired
> > by
> > >SparkSQL:
> > >
> >
> https://stackoverflow.com/questions/33220916/explode-transpose-multiple-columns-in-spark-sql-table
> > >https://spark.apache.org/docs/latest/api/sql/index.html#arrays_zip
> > >
> > >On a separate note, what about using the same name(explode) as Hive? I
> > >think it'd be helpful for migrations into Impala, especially if we also
> > >support the LATERAL VIEW syntax.
> > >
> > >Thanks,
> > >Quanlong
> > >
> > >On Sun, Sep 12, 2021 at 8:56 AM Tim Armstrong <
> tim.g.armstrong@gmail.com>
> > >wrote:
> > >
> > >> The syntax/behaviour seems kinda unusual to me - do any other systems
> > aside
> > >> from postgres implement this? UNNEST() in the from clause seems more
> > >> common, and it is generally not consistent with the rest of SQL for
> > SELECT
> > >> clause entries to affect the set of rows returned.
> > >>
> > >> I'm not sure but it seems like it might be tricky to make the select
> > clause
> > >> in the Impala frontend do this.
> > >>
> > >> Some way of doing this in FROM clause seems more intuitive and
> > consistent
> > >> with the rest of sql, e.g.
> > >>
> > >> select id, e1, e2
> > >> from tbl t, UNNEST(t.arr1, t.arr2) (e1, e2)
> > >>
> > >> I'm tempted to say that it should be called ZIP() since that's the
> name
> > of
> > >> this operation in functional programming, but I don't think inventing
> > new
> > >> SQL keywords is really ideal.
> > >>
> > >> I'm also a little curious about the use case for this. I think ideally
> > you
> > >> would represent two arrays with correlated indices as an array of
> > structs
> > >> instead of two arrays, but I can appreciate that schema designs may
> not
> > >> always be fully normalised.
> > >>
> > >> On Fri, 10 Sept 2021 at 18:38, Aman Sinha <am...@gmail.com>
> wrote:
> > >>
> > >> > Hi Gabor,
> > >> > An implicit assumption in the proposed syntax is that there is some
> > >> > association between an item at a particular position in the first
> > array
> > >> and
> > >> > the same position in the second array.
> > >> > However,  in general the arrays are not ordered in a semantic sense.
> > e.g
> > >> > car_make: {"Honda", "Toyota"},   car_model: {"Prius", "Accord"}.
>  The
> > >> > proposed UNNEST behavior will not produce any rows where {Honda,
> > Accord}
> > >> > appear together which would be unexpected.  Although one could make
> > the
> > >> > argument that in order to get this behavior you could fall back to
> the
> > >> > current syntax of putting the arrays in the FROM clause, it could be
> > >> > confusing to explain when one should be used vs other.  It would be
> > >> useful
> > >> > to see what other systems (apart from Postgres) which have more
> native
> > >> > complex types support do in this case (e.g Presto, Couchbase,
> > Bigquery).
> > >> >
> > >> > One alternative I can think of is to make the association between
> the
> > >> items
> > >> > in multiple arrays be more explicit e.g by using an outer join on
> the
> > >> > ordinal position. That way, items at the same position appear
> together
> > >> in a
> > >> > row and for non-matches we produce Null for the smaller array. I
> > haven't
> > >> > thought through the exact syntax for this yet.
> > >> >
> > >> > -Aman
> > >> >
> > >> > On Fri, Sep 10, 2021 at 11:04 AM Gabor Kaszab <
> > gaborkaszab@cloudera.com>
> > >> > wrote:
> > >> >
> > >> > > Hey,
> > >> > >
> > >> > > I'm working on implementing a new UNNEST() operation that does
> what
> > it
> > >> > > says, unnests arrays. But here the trick would come when there are
> > >> > multiple
> > >> > > UNNESTS() in the select list. Let me explain the desired behaviour
> > with
> > >> > an
> > >> > > example:
> > >> > >
> > >> > > Let's assume the following table:
> > >> > > [image: Screenshot 2021-09-10 at 19.58.53.png]
> > >> > > An the following query:
> > >> > > SELECT id, UNNEST(arr1), UNNEST(arr2) FROM tbl;
> > >> > >
> > >> > > Here the expected result is to not produce a join on the two
> arrays
> > as
> > >> we
> > >> > > would with the original way of giving them in the select lit, but
> to
> > >> put
> > >> > > the values of the arrays "next to each other" like this:
> > >> > > [image: Screenshot 2021-09-10 at 19.59.04.png]
> > >> > > I have investigated some options how this could be implemented and
> > >> > created
> > >> > > a doc for my notes. Let me share this with the community so that
> > people
> > >> > can
> > >> > > share their preference. Any feedback is welcome!
> > >> > >
> > >> > >
> > >> > >
> > >> >
> > >>
> >
> https://docs.google.com/document/d/184EKJwMME4SNzyfOTueI-nz-IL-WUiBeaS8Zhi2XzMo/edit?usp=sharing
> > >> > >
> > >> > > Cheers,
> > >> > > Gabor
> > >> > >
> > >> > >
> > >> > >
> > >> >
> > >>
> >
>

Re: Re: UNNEST() operation for arrays

Posted by Gabor Kaszab <ga...@apache.org>.

Hey All,
Thanks for taking a look and sharing your view. The whole idea of having
multiple UNNEST() functions in the select list (and zipping the results)
came from a user as a direct feature request as they try to move to Impala
from Greenplum and as they say they can't rewrite their existing queries.

The most straightforward way would be to implement UNNEST() as a UDTF but
Impala doesn't support them, and anyway, as mentioned above Impala doesn't
support even UDFs with nested type parameters.

My idea was something like introducing an unnest SlotRef or such, change
the SQL parser to translate each UNNEST(array) into this SlotRef and then
do a query rewrite based on that so that we can process the unnest. My
preference would be to rewrite the query to the syntax we have now with
joining the arrays in the FROM clause as collection table refs and then
have a feature flag (e.g. unnest_join_type) to indicate whether we want the
joined or the zipped behaviour. For the zipped one we can modify the
existing UNNEST node to be able to receive more arrays and then do the zip
itself.

What do you think?

Gabor

On Mon, Sep 13, 2021 at 9:35 AM 彭斗 <pe...@126.com> wrote:

> Hi  All,
>
>
> Currently IMPALA don't support complex data type as input/output
> parameters for Bulltin Functions / User Defined Functions,
> If IMPALA can support this, the Unnest()  or other  operation on  complex
> data type may easy to achieve.
>
>
>
>
> --
>
> Best Regards
> Dou
>
>
>
>
>
> At 2021-09-13 15:08:52, "Quanlong Huang" <hu...@gmail.com> wrote:
> >This seems like the explode() UDTF in Hive:
> >
> https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-explode(array)
> >
> >Hive doesn't support multiple UDTF in the SelectList. I think we can make
> >the same choice that only supports one UNNEST() in the SelectList. If
> >multiple columns need to be unfolded, they should be wrapped in the same
> >UNNEST() function with a keyword/func specifying how to join the rows.
> E.g.
> >
> >-- cross join by default
> >select UNNEST(array_a, array_b) from tbl;
> >-- specifying the join type
> >select UNNEST(zip array_a, array_b) from tbl;
> >select UNNEST(arrays_zip(array_a, array_b)) from tbl;
> >
> >explode() in Hive only supports one column. The above syntax is inspired
> by
> >SparkSQL:
> >
> https://stackoverflow.com/questions/33220916/explode-transpose-multiple-columns-in-spark-sql-table
> >https://spark.apache.org/docs/latest/api/sql/index.html#arrays_zip
> >
> >On a separate note, what about using the same name(explode) as Hive? I
> >think it'd be helpful for migrations into Impala, especially if we also
> >support the LATERAL VIEW syntax.
> >
> >Thanks,
> >Quanlong
> >
> >On Sun, Sep 12, 2021 at 8:56 AM Tim Armstrong <ti...@gmail.com>
> >wrote:
> >
> >> The syntax/behaviour seems kinda unusual to me - do any other systems
> aside
> >> from postgres implement this? UNNEST() in the from clause seems more
> >> common, and it is generally not consistent with the rest of SQL for
> SELECT
> >> clause entries to affect the set of rows returned.
> >>
> >> I'm not sure but it seems like it might be tricky to make the select
> clause
> >> in the Impala frontend do this.
> >>
> >> Some way of doing this in FROM clause seems more intuitive and
> consistent
> >> with the rest of sql, e.g.
> >>
> >> select id, e1, e2
> >> from tbl t, UNNEST(t.arr1, t.arr2) (e1, e2)
> >>
> >> I'm tempted to say that it should be called ZIP() since that's the name
> of
> >> this operation in functional programming, but I don't think inventing
> new
> >> SQL keywords is really ideal.
> >>
> >> I'm also a little curious about the use case for this. I think ideally
> you
> >> would represent two arrays with correlated indices as an array of
> structs
> >> instead of two arrays, but I can appreciate that schema designs may not
> >> always be fully normalised.
> >>
> >> On Fri, 10 Sept 2021 at 18:38, Aman Sinha <am...@gmail.com> wrote:
> >>
> >> > Hi Gabor,
> >> > An implicit assumption in the proposed syntax is that there is some
> >> > association between an item at a particular position in the first
> array
> >> and
> >> > the same position in the second array.
> >> > However,  in general the arrays are not ordered in a semantic sense.
> e.g
> >> > car_make: {"Honda", "Toyota"},   car_model: {"Prius", "Accord"}.   The
> >> > proposed UNNEST behavior will not produce any rows where {Honda,
> Accord}
> >> > appear together which would be unexpected.  Although one could make
> the
> >> > argument that in order to get this behavior you could fall back to the
> >> > current syntax of putting the arrays in the FROM clause, it could be
> >> > confusing to explain when one should be used vs other.  It would be
> >> useful
> >> > to see what other systems (apart from Postgres) which have more native
> >> > complex types support do in this case (e.g Presto, Couchbase,
> Bigquery).
> >> >
> >> > One alternative I can think of is to make the association between the
> >> items
> >> > in multiple arrays be more explicit e.g by using an outer join on the
> >> > ordinal position. That way, items at the same position appear together
> >> in a
> >> > row and for non-matches we produce Null for the smaller array. I
> haven't
> >> > thought through the exact syntax for this yet.
> >> >
> >> > -Aman
> >> >
> >> > On Fri, Sep 10, 2021 at 11:04 AM Gabor Kaszab <
> gaborkaszab@cloudera.com>
> >> > wrote:
> >> >
> >> > > Hey,
> >> > >
> >> > > I'm working on implementing a new UNNEST() operation that does what
> it
> >> > > says, unnests arrays. But here the trick would come when there are
> >> > multiple
> >> > > UNNESTS() in the select list. Let me explain the desired behaviour
> with
> >> > an
> >> > > example:
> >> > >
> >> > > Let's assume the following table:
> >> > > [image: Screenshot 2021-09-10 at 19.58.53.png]
> >> > > An the following query:
> >> > > SELECT id, UNNEST(arr1), UNNEST(arr2) FROM tbl;
> >> > >
> >> > > Here the expected result is to not produce a join on the two arrays
> as
> >> we
> >> > > would with the original way of giving them in the select lit, but to
> >> put
> >> > > the values of the arrays "next to each other" like this:
> >> > > [image: Screenshot 2021-09-10 at 19.59.04.png]
> >> > > I have investigated some options how this could be implemented and
> >> > created
> >> > > a doc for my notes. Let me share this with the community so that
> people
> >> > can
> >> > > share their preference. Any feedback is welcome!
> >> > >
> >> > >
> >> > >
> >> >
> >>
> https://docs.google.com/document/d/184EKJwMME4SNzyfOTueI-nz-IL-WUiBeaS8Zhi2XzMo/edit?usp=sharing
> >> > >
> >> > > Cheers,
> >> > > Gabor
> >> > >
> >> > >
> >> > >
> >> >
> >>
>

Re:Re: UNNEST() operation for arrays

Posted by 彭斗 <pe...@126.com>.

Hi  All,


Currently IMPALA don't support complex data type as input/output parameters for Bulltin Functions / User Defined Functions,
If IMPALA can support this, the Unnest()  or other  operation on  complex data type may easy to achieve.




--

Best Regards
Dou





At 2021-09-13 15:08:52, "Quanlong Huang" <hu...@gmail.com> wrote:
>This seems like the explode() UDTF in Hive:
>https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-explode(array)
>
>Hive doesn't support multiple UDTF in the SelectList. I think we can make
>the same choice that only supports one UNNEST() in the SelectList. If
>multiple columns need to be unfolded, they should be wrapped in the same
>UNNEST() function with a keyword/func specifying how to join the rows. E.g.
>
>-- cross join by default
>select UNNEST(array_a, array_b) from tbl;
>-- specifying the join type
>select UNNEST(zip array_a, array_b) from tbl;
>select UNNEST(arrays_zip(array_a, array_b)) from tbl;
>
>explode() in Hive only supports one column. The above syntax is inspired by
>SparkSQL:
>https://stackoverflow.com/questions/33220916/explode-transpose-multiple-columns-in-spark-sql-table
>https://spark.apache.org/docs/latest/api/sql/index.html#arrays_zip
>
>On a separate note, what about using the same name(explode) as Hive? I
>think it'd be helpful for migrations into Impala, especially if we also
>support the LATERAL VIEW syntax.
>
>Thanks,
>Quanlong
>
>On Sun, Sep 12, 2021 at 8:56 AM Tim Armstrong <ti...@gmail.com>
>wrote:
>
>> The syntax/behaviour seems kinda unusual to me - do any other systems aside
>> from postgres implement this? UNNEST() in the from clause seems more
>> common, and it is generally not consistent with the rest of SQL for SELECT
>> clause entries to affect the set of rows returned.
>>
>> I'm not sure but it seems like it might be tricky to make the select clause
>> in the Impala frontend do this.
>>
>> Some way of doing this in FROM clause seems more intuitive and consistent
>> with the rest of sql, e.g.
>>
>> select id, e1, e2
>> from tbl t, UNNEST(t.arr1, t.arr2) (e1, e2)
>>
>> I'm tempted to say that it should be called ZIP() since that's the name of
>> this operation in functional programming, but I don't think inventing new
>> SQL keywords is really ideal.
>>
>> I'm also a little curious about the use case for this. I think ideally you
>> would represent two arrays with correlated indices as an array of structs
>> instead of two arrays, but I can appreciate that schema designs may not
>> always be fully normalised.
>>
>> On Fri, 10 Sept 2021 at 18:38, Aman Sinha <am...@gmail.com> wrote:
>>
>> > Hi Gabor,
>> > An implicit assumption in the proposed syntax is that there is some
>> > association between an item at a particular position in the first array
>> and
>> > the same position in the second array.
>> > However,  in general the arrays are not ordered in a semantic sense.  e.g
>> > car_make: {"Honda", "Toyota"},   car_model: {"Prius", "Accord"}.   The
>> > proposed UNNEST behavior will not produce any rows where {Honda, Accord}
>> > appear together which would be unexpected.  Although one could make the
>> > argument that in order to get this behavior you could fall back to the
>> > current syntax of putting the arrays in the FROM clause, it could be
>> > confusing to explain when one should be used vs other.  It would be
>> useful
>> > to see what other systems (apart from Postgres) which have more native
>> > complex types support do in this case (e.g Presto, Couchbase, Bigquery).
>> >
>> > One alternative I can think of is to make the association between the
>> items
>> > in multiple arrays be more explicit e.g by using an outer join on the
>> > ordinal position. That way, items at the same position appear together
>> in a
>> > row and for non-matches we produce Null for the smaller array. I haven't
>> > thought through the exact syntax for this yet.
>> >
>> > -Aman
>> >
>> > On Fri, Sep 10, 2021 at 11:04 AM Gabor Kaszab <ga...@cloudera.com>
>> > wrote:
>> >
>> > > Hey,
>> > >
>> > > I'm working on implementing a new UNNEST() operation that does what it
>> > > says, unnests arrays. But here the trick would come when there are
>> > multiple
>> > > UNNESTS() in the select list. Let me explain the desired behaviour with
>> > an
>> > > example:
>> > >
>> > > Let's assume the following table:
>> > > [image: Screenshot 2021-09-10 at 19.58.53.png]
>> > > An the following query:
>> > > SELECT id, UNNEST(arr1), UNNEST(arr2) FROM tbl;
>> > >
>> > > Here the expected result is to not produce a join on the two arrays as
>> we
>> > > would with the original way of giving them in the select lit, but to
>> put
>> > > the values of the arrays "next to each other" like this:
>> > > [image: Screenshot 2021-09-10 at 19.59.04.png]
>> > > I have investigated some options how this could be implemented and
>> > created
>> > > a doc for my notes. Let me share this with the community so that people
>> > can
>> > > share their preference. Any feedback is welcome!
>> > >
>> > >
>> > >
>> >
>> https://docs.google.com/document/d/184EKJwMME4SNzyfOTueI-nz-IL-WUiBeaS8Zhi2XzMo/edit?usp=sharing
>> > >
>> > > Cheers,
>> > > Gabor
>> > >
>> > >
>> > >
>> >
>>

Re: UNNEST() operation for arrays

Posted by Quanlong Huang <hu...@gmail.com>.

This seems like the explode() UDTF in Hive:
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-explode(array)

Hive doesn't support multiple UDTF in the SelectList. I think we can make
the same choice that only supports one UNNEST() in the SelectList. If
multiple columns need to be unfolded, they should be wrapped in the same
UNNEST() function with a keyword/func specifying how to join the rows. E.g.

-- cross join by default
select UNNEST(array_a, array_b) from tbl;
-- specifying the join type
select UNNEST(zip array_a, array_b) from tbl;
select UNNEST(arrays_zip(array_a, array_b)) from tbl;

explode() in Hive only supports one column. The above syntax is inspired by
SparkSQL:
https://stackoverflow.com/questions/33220916/explode-transpose-multiple-columns-in-spark-sql-table
https://spark.apache.org/docs/latest/api/sql/index.html#arrays_zip

On a separate note, what about using the same name(explode) as Hive? I
think it'd be helpful for migrations into Impala, especially if we also
support the LATERAL VIEW syntax.

Thanks,
Quanlong

On Sun, Sep 12, 2021 at 8:56 AM Tim Armstrong <ti...@gmail.com>
wrote:

> The syntax/behaviour seems kinda unusual to me - do any other systems aside
> from postgres implement this? UNNEST() in the from clause seems more
> common, and it is generally not consistent with the rest of SQL for SELECT
> clause entries to affect the set of rows returned.
>
> I'm not sure but it seems like it might be tricky to make the select clause
> in the Impala frontend do this.
>
> Some way of doing this in FROM clause seems more intuitive and consistent
> with the rest of sql, e.g.
>
> select id, e1, e2
> from tbl t, UNNEST(t.arr1, t.arr2) (e1, e2)
>
> I'm tempted to say that it should be called ZIP() since that's the name of
> this operation in functional programming, but I don't think inventing new
> SQL keywords is really ideal.
>
> I'm also a little curious about the use case for this. I think ideally you
> would represent two arrays with correlated indices as an array of structs
> instead of two arrays, but I can appreciate that schema designs may not
> always be fully normalised.
>
> On Fri, 10 Sept 2021 at 18:38, Aman Sinha <am...@gmail.com> wrote:
>
> > Hi Gabor,
> > An implicit assumption in the proposed syntax is that there is some
> > association between an item at a particular position in the first array
> and
> > the same position in the second array.
> > However,  in general the arrays are not ordered in a semantic sense.  e.g
> > car_make: {"Honda", "Toyota"},   car_model: {"Prius", "Accord"}.   The
> > proposed UNNEST behavior will not produce any rows where {Honda, Accord}
> > appear together which would be unexpected.  Although one could make the
> > argument that in order to get this behavior you could fall back to the
> > current syntax of putting the arrays in the FROM clause, it could be
> > confusing to explain when one should be used vs other.  It would be
> useful
> > to see what other systems (apart from Postgres) which have more native
> > complex types support do in this case (e.g Presto, Couchbase, Bigquery).
> >
> > One alternative I can think of is to make the association between the
> items
> > in multiple arrays be more explicit e.g by using an outer join on the
> > ordinal position. That way, items at the same position appear together
> in a
> > row and for non-matches we produce Null for the smaller array. I haven't
> > thought through the exact syntax for this yet.
> >
> > -Aman
> >
> > On Fri, Sep 10, 2021 at 11:04 AM Gabor Kaszab <ga...@cloudera.com>
> > wrote:
> >
> > > Hey,
> > >
> > > I'm working on implementing a new UNNEST() operation that does what it
> > > says, unnests arrays. But here the trick would come when there are
> > multiple
> > > UNNESTS() in the select list. Let me explain the desired behaviour with
> > an
> > > example:
> > >
> > > Let's assume the following table:
> > > [image: Screenshot 2021-09-10 at 19.58.53.png]
> > > An the following query:
> > > SELECT id, UNNEST(arr1), UNNEST(arr2) FROM tbl;
> > >
> > > Here the expected result is to not produce a join on the two arrays as
> we
> > > would with the original way of giving them in the select lit, but to
> put
> > > the values of the arrays "next to each other" like this:
> > > [image: Screenshot 2021-09-10 at 19.59.04.png]
> > > I have investigated some options how this could be implemented and
> > created
> > > a doc for my notes. Let me share this with the community so that people
> > can
> > > share their preference. Any feedback is welcome!
> > >
> > >
> > >
> >
> https://docs.google.com/document/d/184EKJwMME4SNzyfOTueI-nz-IL-WUiBeaS8Zhi2XzMo/edit?usp=sharing
> > >
> > > Cheers,
> > > Gabor
> > >
> > >
> > >
> >
>

Re: UNNEST() operation for arrays

Posted by Tim Armstrong <ti...@gmail.com>.

The syntax/behaviour seems kinda unusual to me - do any other systems aside
from postgres implement this? UNNEST() in the from clause seems more
common, and it is generally not consistent with the rest of SQL for SELECT
clause entries to affect the set of rows returned.

I'm not sure but it seems like it might be tricky to make the select clause
in the Impala frontend do this.

Some way of doing this in FROM clause seems more intuitive and consistent
with the rest of sql, e.g.

select id, e1, e2
from tbl t, UNNEST(t.arr1, t.arr2) (e1, e2)

I'm tempted to say that it should be called ZIP() since that's the name of
this operation in functional programming, but I don't think inventing new
SQL keywords is really ideal.

I'm also a little curious about the use case for this. I think ideally you
would represent two arrays with correlated indices as an array of structs
instead of two arrays, but I can appreciate that schema designs may not
always be fully normalised.

On Fri, 10 Sept 2021 at 18:38, Aman Sinha <am...@gmail.com> wrote:

> Hi Gabor,
> An implicit assumption in the proposed syntax is that there is some
> association between an item at a particular position in the first array and
> the same position in the second array.
> However,  in general the arrays are not ordered in a semantic sense.  e.g
> car_make: {"Honda", "Toyota"},   car_model: {"Prius", "Accord"}.   The
> proposed UNNEST behavior will not produce any rows where {Honda, Accord}
> appear together which would be unexpected.  Although one could make the
> argument that in order to get this behavior you could fall back to the
> current syntax of putting the arrays in the FROM clause, it could be
> confusing to explain when one should be used vs other.  It would be useful
> to see what other systems (apart from Postgres) which have more native
> complex types support do in this case (e.g Presto, Couchbase, Bigquery).
>
> One alternative I can think of is to make the association between the items
> in multiple arrays be more explicit e.g by using an outer join on the
> ordinal position. That way, items at the same position appear together in a
> row and for non-matches we produce Null for the smaller array. I haven't
> thought through the exact syntax for this yet.
>
> -Aman
>
> On Fri, Sep 10, 2021 at 11:04 AM Gabor Kaszab <ga...@cloudera.com>
> wrote:
>
> > Hey,
> >
> > I'm working on implementing a new UNNEST() operation that does what it
> > says, unnests arrays. But here the trick would come when there are
> multiple
> > UNNESTS() in the select list. Let me explain the desired behaviour with
> an
> > example:
> >
> > Let's assume the following table:
> > [image: Screenshot 2021-09-10 at 19.58.53.png]
> > An the following query:
> > SELECT id, UNNEST(arr1), UNNEST(arr2) FROM tbl;
> >
> > Here the expected result is to not produce a join on the two arrays as we
> > would with the original way of giving them in the select lit, but to put
> > the values of the arrays "next to each other" like this:
> > [image: Screenshot 2021-09-10 at 19.59.04.png]
> > I have investigated some options how this could be implemented and
> created
> > a doc for my notes. Let me share this with the community so that people
> can
> > share their preference. Any feedback is welcome!
> >
> >
> >
> https://docs.google.com/document/d/184EKJwMME4SNzyfOTueI-nz-IL-WUiBeaS8Zhi2XzMo/edit?usp=sharing
> >
> > Cheers,
> > Gabor
> >
> >
> >
>

Re: UNNEST() operation for arrays

Posted by Aman Sinha <am...@gmail.com>.

Hi Gabor,
An implicit assumption in the proposed syntax is that there is some
association between an item at a particular position in the first array and
the same position in the second array.
However,  in general the arrays are not ordered in a semantic sense.  e.g
car_make: {"Honda", "Toyota"},   car_model: {"Prius", "Accord"}.   The
proposed UNNEST behavior will not produce any rows where {Honda, Accord}
appear together which would be unexpected.  Although one could make the
argument that in order to get this behavior you could fall back to the
current syntax of putting the arrays in the FROM clause, it could be
confusing to explain when one should be used vs other.  It would be useful
to see what other systems (apart from Postgres) which have more native
complex types support do in this case (e.g Presto, Couchbase, Bigquery).

One alternative I can think of is to make the association between the items
in multiple arrays be more explicit e.g by using an outer join on the
ordinal position. That way, items at the same position appear together in a
row and for non-matches we produce Null for the smaller array. I haven't
thought through the exact syntax for this yet.

-Aman

On Fri, Sep 10, 2021 at 11:04 AM Gabor Kaszab <ga...@cloudera.com>
wrote:

> Hey,
>
> I'm working on implementing a new UNNEST() operation that does what it
> says, unnests arrays. But here the trick would come when there are multiple
> UNNESTS() in the select list. Let me explain the desired behaviour with an
> example:
>
> Let's assume the following table:
> [image: Screenshot 2021-09-10 at 19.58.53.png]
> An the following query:
> SELECT id, UNNEST(arr1), UNNEST(arr2) FROM tbl;
>
> Here the expected result is to not produce a join on the two arrays as we
> would with the original way of giving them in the select lit, but to put
> the values of the arrays "next to each other" like this:
> [image: Screenshot 2021-09-10 at 19.59.04.png]
> I have investigated some options how this could be implemented and created
> a doc for my notes. Let me share this with the community so that people can
> share their preference. Any feedback is welcome!
>
>
> https://docs.google.com/document/d/184EKJwMME4SNzyfOTueI-nz-IL-WUiBeaS8Zhi2XzMo/edit?usp=sharing
>
> Cheers,
> Gabor
>
>
>