You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@asterixdb.apache.org by Müller Ingo <in...@inf.ethz.ch> on 2021/08/03 08:39:38 UTC

Producing combinations of array elements in SQL++

Dear SQL++ experts,

I am working on an academic study [1] comparing the expressiveness of various query languages for nested data in a high-energy physics use case. Among other systems, I have implemented the queries of the use case in SQL++ [2]. In general, SQL++ seems to be quite well suited for those queries, but one frequent pattern seems to have only cumbersome formulations.

In short, I am missing the functionality to produce (deterministic) element identifiers for array elements. In BigQuery’s SQL dialect [3], this can be done with “UNNEST(array_field) WITH OFFSET AS idx”. For now, I am using a construct similar to this: “FROM (SELECT array_field.*, row_number() OVER () AS idx FROM base_table.array_field) AS array_element” (see full query here [4]). I have two issues with this: First, it needs undebatably more characters than the BigQuery version and it is arguably also more cumbersome. Second, I don’t think that it is even correct since the element numbers are not guaranteed to be deterministic (i.e., if I use that construct in several places in the query, the same array element may get different values for row_number()). The documentation on row_number says “If the window order clause is omitted, the return values may be unpredictable.“ However, introducing an order clause not only makes the pattern even more cumbersome (the elements are objects with at least four fields), it also does not guarantee a deterministic order still. The documentation of the Window Order Clause says: “The row_number() function returns a distinct number for each tuple. If tuples are tied, the results may be unpredictable.” (To be fair, if two array elements agree on all fields, I can probably treat them interchangeably in my use case.)

With that introduction, two questions: (1) Can you confirm that there is really no language construct better than what I currently use (such as BigQuery’s “OFFSET”)? (2) Can you confirm that row_number() is really undeterministic in the way I use it?

I need these element identifiers for producing combinations (in the mathematical sense [5]) of the elements in an array field, but maybe there is a better way. In the simple case of producing all (unique) pairs of elements of a particular array, I use a construct like the following:

      FROM (SELECT array_field.*, row_number() OVER () AS idx FROM base_table.array_field) AS array_element1,
           (SELECT array_field.*, row_number() OVER () AS idx FROM base_table. array_field) AS array_element2
      WHERE
        array_element1.idx < array_element2.idx

So a last question: (3) Can you think of a better way to compute combinations?

Thanks a lot in advance and best regards,
Ingo


[1] Dan Graur, Ingo Müller, Ghislain Fourny, Gordon T. Watts, Mason Proffitt, Gustavo Alonso. "Evaluating Query Languages and Systems for High-Energy Physics Data." arXiv: 2104.12615 [cs.DB], 2021. https://arxiv.org/abs/2104.12615
[2] https://github.com/RumbleDB/iris-hep-benchmark-sqlpp
[3] https://cloud.google.com/bigquery/docs/reference/standard-sql/arrays#flattening_arrays
[4] https://github.com/RumbleDB/iris-hep-benchmark-sqlpp/blob/master/queries/query-5/query.sqlpp
[5] https://en.wikipedia.org/wiki/Combination

RE: Producing combinations of array elements in SQL++

Posted by Müller Ingo <in...@inf.ethz.ch>.

Hey Mike,

Thanks a lot for following up! I realize that one aspect wasn't very explicit in my description: I only want one permutation of each combination, e.g., I only want one pair out of (a1, a2) and (a2, a1). The Cartesian product contains both. In your example data, for "pkey": 2, I only want one of (3, 5) and (5, 3).

(This is really just a performance optimization: of all of the pairs, I eventually pick the one that maximizes some metric, and that metric is symmetric, so I can could pick any of the equivalent permutations. However, I doubt that any query optimizer can detect that, and for 3-combinations, that optimization makes a really significant difference.)

Off topic: ROOT [1], the standard framework used by physicists today, and interestingly PrestoDB as well [2], have a built-in function "combinations" for that purpose.

Cheers,
Ingo


[1] https://root.cern/doc/master/group__vecops.html#gabbbf82257156b750c7b9bdd39a53ee33
[2] https://prestodb.io/docs/current/functions/array.html#combinations


> -----Original Message-----
> From: Mike Carey <dt...@gmail.com>
> Sent: Thursday, August 5, 2021 1:03 AM
> To: users@asterixdb.apache.org
> Subject: Re: Producing combinations of array elements in SQL++
> 
> Ingo,
> 
> So if I'm understanding the use case right, and order is irrelevant, and it's just
> about computing unique pairs, here is a potentially simpler version:
> 
> 	SELECT pkey,
> 	       (SELECT DISTINCT a1, a2
> 	        FROM ds.array_field AS a1,
> 	             ds.array_field AS a2
> 	        ) AS upairs
> 	FROM my_dataset AS ds;
> 
> I.e., do a cartesian product of the array with itself and eliminate duplicate pairs.
> No positional info needed.  Does that meet the problem statement?  (I may be
> missing a part of it.)
> 
> 
> Cheers,
> 
> Mike
> 
> PS - Here is an example that can be run to test it w/o creating data (just
> pretending to do so with WITH):
> 
> 	WITH my_dataset AS
> 	   ([{"pkey": 1, "array_field":[1, 2, 2, 4]},
> 	    {"pkey": 2, "array_field":[3, 5]}])
> 	SELECT pkey,
> 	       (SELECT DISTINCT a1, a2
> 	        FROM ds.array_field AS a1,
> 	             ds.array_field AS a2
> 	        ) AS upairs
> 	FROM my_dataset AS ds;

Re: Producing combinations of array elements in SQL++

Posted by Mike Carey <dt...@gmail.com>.

Ingo,

So if I'm understanding the use case right, and order is irrelevant, and 
it's just about computing unique pairs, here is a potentially simpler 
version:

    SELECT pkey,
            (SELECT DISTINCT a1, a2
             FROM ds.array_field AS a1,
                  ds.array_field AS a2
             ) AS upairs
    FROM my_dataset AS ds;

I.e., do a cartesian product of the array with itself and eliminate 
duplicate pairs.  No positional info needed.  Does that meet the problem 
statement?  (I may be missing a part of it.)

Cheers,

Mike

PS - Here is an example that can be run to test it w/o creating data 
(just pretending to do so with WITH):

    WITH my_dataset AS
        ([{"pkey": 1, "array_field":[1, 2, 2, 4]},
         {"pkey": 2, "array_field":[3, 5]}])
    SELECT pkey,
            (SELECT DISTINCT a1, a2
             FROM ds.array_field AS a1,
                  ds.array_field AS a2
             ) AS upairs
    FROM my_dataset AS ds;

On 8/4/21 12:23 AM, Müller Ingo wrote:
> Hi Dmitry,
>
> Thanks a lot for the pointers! (In the snippets there, I also learned about the LET clause. Great feature -- I'll need to review the query implementations again to use it!)
>
> Cheers,
> Ingo
>
>
>> -----Original Message-----
>> From: Dmitry Lychagin <dm...@couchbase.com>
>> Sent: Wednesday, August 4, 2021 3:22 AM
>> To: users@asterixdb.apache.org
>> Subject: Re: Producing combinations of array elements in SQL++
>>
>> Hi Ingo,
>>
>> I think positional vars should probably work fine in your usecase (unnesting array
>> fields within an object).
>> There are issues with this feature when it's used in other contexts though (e.g.
>> position of an object inside a dataset) These are open bugs in JIRA that track
>> problems with this feature: [1][2][3]
>>
>> [1] https://issues.apache.org/jira/browse/ASTERIXDB-1050
>> [2] https://issues.apache.org/jira/browse/ASTERIXDB-1052
>> [3] https://issues.apache.org/jira/browse/ASTERIXDB-2595  (this one is relevant
>> for array unnesting too)
>>
>> Thanks,
>> -- Dmitry
>>
>>
>> On 8/3/21, 10:20 AM, "Müller  Ingo" <in...@inf.ethz.ch> wrote:
>>
>>       EXTERNAL EMAIL:  Use caution when opening attachments or clicking on links
>>
>>
>>
>>
>>
>>      Quick update: The "AT posVar" construct seems to produce the correct result
>> in some simple test cases and are much cleaner that my previous construct (see
>> [6])! In particular, the fact that I don't even need UNNEST makes some queries
>> really concise :)
>>
>>      Small follow-up question: Is there an issue tracking the status of this feature? I
>> couldn't find it in JIRA...
>>
>>      Cheers,
>>      Ingo
>>
>>
>>      [6] https://github.com/RumbleDB/iris-hep-benchmark-sqlpp/commit/1023d97
>>
>>
>>      > -----Original Message-----
>>      > From: Müller Ingo <in...@inf.ethz.ch>
>>      > Sent: Tuesday, August 3, 2021 6:03 PM
>>      > To: users@asterixdb.apache.org
>>      > Subject: RE: Producing combinations of array elements in SQL++
>>      >
>>      > Dear Mike, dear Dmitry,
>>      >
>>      > Thanks a lot for the quick and helpful answers! I think what I was looking for
>> is
>>      > the "AT posVar". We are using similar constructs in several other SQL
>>      > dialects/query languages. I will try it out right away, but I am not sure yet
>> how
>>      > we will deal with such a "beta" feature in our study.
>>      >
>>      > @Mike: I indeed "just" need all combinations and the position of array
>> elements
>>      > is just a means to get that -- their order isn't important. (In fact, queries
>> typically
>>      > compute some metric for each combination and then the combination that
>>      > minimizes or maximizes that metric.) If you know a better way to express
>> that, I
>>      > am very interested!
>>      >
>>      > All the best,
>>      > Ingo
>>      >
>>      >
>>      > > -----Original Message-----
>>      > > From: Mike Carey <dt...@gmail.com>
>>      > > Sent: Tuesday, August 3, 2021 5:48 PM
>>      > > To: users@asterixdb.apache.org
>>      > > Subject: Re: Producing combinations of array elements in SQL++
>>      > >
>>      > > Ingo,
>>      > >
>>      > > ALSO:  Perhaps clarify what you mean by "unique"?  Suppose an array
>>      > > has [1, 2, 2, 3] as its contents.  What would the desired result be?
>>      > > The best approach to this will depend on what you mean by that.
>>      > >
>>      > >
>>      > > Cheers,
>>      > >
>>      > > Mike
>>      > >
>>      > >
>>      > > On 8/3/21 8:38 AM, Dmitry Lychagin wrote:
>>      > >
>>      > >
>>      > >     Hi Ingo,
>>      > >
>>      > >
>>      > >
>>      > >     1) SQL++ supports positional variables in the FROM clause, as follows:
>>      > >
>>      > >     FROM dataset AS ds, ds.array_field AS elementVar AT posVar
>>      > >
>>      > >     (see https://asterixdb.apache.org/docs/0.9.7/SQLPP.html#prod178
>>      > > <https://asterixdb.apache.org/docs/0.9.7/SQLPP.html#prod178>   --
>>      > FromTerm)
>>      > >
>>      > >     or using UNNEST instead of “,”
>>      > >
>>      > >     FROM dataset AS ds UNNEST ds.array_field AS elementVar AT posVar
>>      > >
>>      > >     (see https://asterixdb.apache.org/docs/0.9.7/SQLPP.html#prod181
>>      > > <https://asterixdb.apache.org/docs/0.9.7/SQLPP.html#prod181>  –
>>      > > UnnestClause)
>>      > >
>>      > >
>>      > >
>>      > >     Positional variable is bound to a position of an element inside an
>>      > > array which is being unnested.
>>      > >
>>      > >
>>      > >
>>      > >     Here’s how your query could look like using these variables:
>>      > >
>>      > >     SELECT …
>>      > >
>>      > >     FROM base_table, base_table.array_field AS  array_element1 AT pos1,
>>      > > base_table.array_field AS  array_element2 AT pos2
>>      > >
>>      > >     WHERE pos1 < pos2
>>      > >
>>      > >
>>      > >
>>      > >     2) As for window function calls. I can confirm that if there is no
>>      > > ORDER BY sub-clause inside OVER clause then the order of the elements
>>      > > processed by window functions (row_number(), rank(), etc) is not
>> guaranteed.
>>      > >
>>      > >
>>      > >
>>      > >     Thanks,
>>      > >
>>      > >     -- Dmitry
>>      > >
>>      > >
>>      > >
>>      > >
>>      > >
>>      > >     From: Müller Ingo <in...@inf.ethz.ch>
>>      > > <ma...@inf.ethz.ch>
>>      > >     Reply-To: "users@asterixdb.apache.org"
>>      > > <ma...@asterixdb.apache.org>  <us...@asterixdb.apache.org>
>>      > > <ma...@asterixdb.apache.org>
>>      > >     Date: Tuesday, August 3, 2021 at 1:39 AM
>>      > >     To: "users@asterixdb.apache.org"
>>      > > <ma...@asterixdb.apache.org>  <us...@asterixdb.apache.org>
>>      > > <ma...@asterixdb.apache.org>
>>      > >     Subject: Producing combinations of array elements in SQL++
>>      > >
>>      > >
>>      > >
>>      > >  EXTERNAL EMAIL:  Use caution when opening attachments or clicking on
>>      > > links
>>      > >
>>      > >
>>      > >
>>      > >
>>      > >
>>      > >     Dear SQL++ experts,
>>      > >
>>      > >
>>      > >
>>      > >     I am working on an academic study [1] comparing the expressiveness of
>>      > > various query languages for nested data in a high-energy physics use case.
>>      > > Among other systems, I have implemented the queries of the use case in
>>      > > SQL++ [2]. In general, SQL++ seems to be quite well suited for those
>>      > > queries, but one frequent pattern seems to have only cumbersome
>>      > formulations.
>>      > >
>>      > >
>>      > >
>>      > >     In short, I am missing the functionality to produce (deterministic)
>>      > > element identifiers for array elements. In BigQuery’s SQL dialect [3],
>>      > > this can be done with “UNNEST(array_field) WITH OFFSET AS idx”. For
>>      > > now, I am using a construct similar to this: “FROM (SELECT
>>      > > array_field.*, row_number() OVER () AS idx FROM
>>      > > base_table.array_field) AS array_element” (see full query here [4]). I
>>      > > have two issues with this: First, it needs undebatably more characters
>>      > > than the BigQuery version and it is arguably also more cumbersome.
>>      > > Second, I don’t think that it is even correct since the element
>>      > > numbers are not guaranteed to be deterministic (i.e., if I use that
>>      > > construct in several places in the query, the same array element may
>>      > > get different values for row_number()). The documentation on
>>      > > row_number says “If the window order clause is omitted, the return
>>      > > values may be unpredictable.“ However, introducing an order clause not
>>      > > only makes the pattern even more cumbersome (the elements are objects
>>      > > with at least four fields), it also does not guarantee a deterministic
>>      > > order still. The documentation of the Window Order Clause says: “The
>>      > > row_number() function returns a distinct number for each tuple. If
>>      > > tuples are tied, the results may be unpredictable.” (To be fair, if
>>      > > two array elements agree on all fields, I can probably treat them
>>      > > interchangeably in my use case.)
>>      > >
>>      > >
>>      > >
>>      > >     With that introduction, two questions: (1) Can you confirm that there
>>      > > is really no language construct better than what I currently use (such
>>      > > as BigQuery’s “OFFSET”)? (2) Can you confirm that row_number() is
>>      > > really undeterministic in the way I use it?
>>      > >
>>      > >
>>      > >
>>      > >     I need these element identifiers for producing combinations (in the
>>      > > mathematical sense [5]) of the elements in an array field, but maybe
>>      > > there is a better way. In the simple case of producing all (unique)
>>      > > pairs of elements of a particular array, I use a construct like the following:
>>      > >
>>      > >
>>      > >
>>      > >           FROM (SELECT array_field.*, row_number() OVER () AS idx FROM
>>      > > base_table.array_field) AS array_element1,
>>      > >
>>      > >                (SELECT array_field.*, row_number() OVER () AS idx FROM
>>      > > base_table. array_field) AS array_element2
>>      > >
>>      > >           WHERE
>>      > >
>>      > >             array_element1.idx < array_element2.idx
>>      > >
>>      > >
>>      > >
>>      > >     So a last question: (3) Can you think of a better way to compute
>>      > > combinations?
>>      > >
>>      > >
>>      > >
>>      > >     Thanks a lot in advance and best regards,
>>      > >
>>      > >     Ingo
>>      > >
>>      > >
>>      > >
>>      > >
>>      > >
>>      > >     [1] Dan Graur, Ingo Müller, Ghislain Fourny, Gordon T. Watts, Mason
>>      > > Proffitt, Gustavo Alonso. "Evaluating Query Languages and Systems for
>>      > > High- Energy Physics Data." arXiv: 2104.12615 [cs.DB], 2021.
>>      > > https://arxiv.org/abs/2104.12615
>>      > >
>>      > >     [2] https://github.com/RumbleDB/iris-hep-benchmark-sqlpp
>>      > > <https://github.com/RumbleDB/iris-hep-benchmark-sqlpp>
>>      > >
>>      > >     [3] https://cloud.google.com/bigquery/docs/reference/standard-
>>      > > sql/arrays#flattening_arrays
>>      > > <https://cloud.google.com/bigquery/docs/reference/standard-
>>      > > sql/arrays#flattening_arrays>
>>      > >
>>      > >     [4] https://github.com/RumbleDB/iris-hep-benchmark-
>>      > > sqlpp/blob/master/queries/query-5/query.sqlpp
>>      > > <https://github.com/RumbleDB/iris-hep-benchmark-
>>      > > sqlpp/blob/master/queries/query-5/query.sqlpp>
>>      > >
>>      > >     [5] https://en.wikipedia.org/wiki/Combination
>>      > >
>>      > >
>>      > >
>>      > >
>>

RE: Producing combinations of array elements in SQL++

Posted by Müller Ingo <in...@inf.ethz.ch>.

Hi Dmitry,

Thanks a lot for the pointers! (In the snippets there, I also learned about the LET clause. Great feature -- I'll need to review the query implementations again to use it!)

Cheers,
Ingo


> -----Original Message-----
> From: Dmitry Lychagin <dm...@couchbase.com>
> Sent: Wednesday, August 4, 2021 3:22 AM
> To: users@asterixdb.apache.org
> Subject: Re: Producing combinations of array elements in SQL++
> 
> Hi Ingo,
> 
> I think positional vars should probably work fine in your usecase (unnesting array
> fields within an object).
> There are issues with this feature when it's used in other contexts though (e.g.
> position of an object inside a dataset) These are open bugs in JIRA that track
> problems with this feature: [1][2][3]
> 
> [1] https://issues.apache.org/jira/browse/ASTERIXDB-1050
> [2] https://issues.apache.org/jira/browse/ASTERIXDB-1052
> [3] https://issues.apache.org/jira/browse/ASTERIXDB-2595  (this one is relevant
> for array unnesting too)
> 
> Thanks,
> -- Dmitry
> 
> 
> On 8/3/21, 10:20 AM, "Müller  Ingo" <in...@inf.ethz.ch> wrote:
> 
>      EXTERNAL EMAIL:  Use caution when opening attachments or clicking on links
> 
> 
> 
> 
> 
>     Quick update: The "AT posVar" construct seems to produce the correct result
> in some simple test cases and are much cleaner that my previous construct (see
> [6])! In particular, the fact that I don't even need UNNEST makes some queries
> really concise :)
> 
>     Small follow-up question: Is there an issue tracking the status of this feature? I
> couldn't find it in JIRA...
> 
>     Cheers,
>     Ingo
> 
> 
>     [6] https://github.com/RumbleDB/iris-hep-benchmark-sqlpp/commit/1023d97
> 
> 
>     > -----Original Message-----
>     > From: Müller Ingo <in...@inf.ethz.ch>
>     > Sent: Tuesday, August 3, 2021 6:03 PM
>     > To: users@asterixdb.apache.org
>     > Subject: RE: Producing combinations of array elements in SQL++
>     >
>     > Dear Mike, dear Dmitry,
>     >
>     > Thanks a lot for the quick and helpful answers! I think what I was looking for
> is
>     > the "AT posVar". We are using similar constructs in several other SQL
>     > dialects/query languages. I will try it out right away, but I am not sure yet
> how
>     > we will deal with such a "beta" feature in our study.
>     >
>     > @Mike: I indeed "just" need all combinations and the position of array
> elements
>     > is just a means to get that -- their order isn't important. (In fact, queries
> typically
>     > compute some metric for each combination and then the combination that
>     > minimizes or maximizes that metric.) If you know a better way to express
> that, I
>     > am very interested!
>     >
>     > All the best,
>     > Ingo
>     >
>     >
>     > > -----Original Message-----
>     > > From: Mike Carey <dt...@gmail.com>
>     > > Sent: Tuesday, August 3, 2021 5:48 PM
>     > > To: users@asterixdb.apache.org
>     > > Subject: Re: Producing combinations of array elements in SQL++
>     > >
>     > > Ingo,
>     > >
>     > > ALSO:  Perhaps clarify what you mean by "unique"?  Suppose an array
>     > > has [1, 2, 2, 3] as its contents.  What would the desired result be?
>     > > The best approach to this will depend on what you mean by that.
>     > >
>     > >
>     > > Cheers,
>     > >
>     > > Mike
>     > >
>     > >
>     > > On 8/3/21 8:38 AM, Dmitry Lychagin wrote:
>     > >
>     > >
>     > >     Hi Ingo,
>     > >
>     > >
>     > >
>     > >     1) SQL++ supports positional variables in the FROM clause, as follows:
>     > >
>     > >     FROM dataset AS ds, ds.array_field AS elementVar AT posVar
>     > >
>     > >     (see https://asterixdb.apache.org/docs/0.9.7/SQLPP.html#prod178
>     > > <https://asterixdb.apache.org/docs/0.9.7/SQLPP.html#prod178>   --
>     > FromTerm)
>     > >
>     > >     or using UNNEST instead of “,”
>     > >
>     > >     FROM dataset AS ds UNNEST ds.array_field AS elementVar AT posVar
>     > >
>     > >     (see https://asterixdb.apache.org/docs/0.9.7/SQLPP.html#prod181
>     > > <https://asterixdb.apache.org/docs/0.9.7/SQLPP.html#prod181>  –
>     > > UnnestClause)
>     > >
>     > >
>     > >
>     > >     Positional variable is bound to a position of an element inside an
>     > > array which is being unnested.
>     > >
>     > >
>     > >
>     > >     Here’s how your query could look like using these variables:
>     > >
>     > >     SELECT …
>     > >
>     > >     FROM base_table, base_table.array_field AS  array_element1 AT pos1,
>     > > base_table.array_field AS  array_element2 AT pos2
>     > >
>     > >     WHERE pos1 < pos2
>     > >
>     > >
>     > >
>     > >     2) As for window function calls. I can confirm that if there is no
>     > > ORDER BY sub-clause inside OVER clause then the order of the elements
>     > > processed by window functions (row_number(), rank(), etc) is not
> guaranteed.
>     > >
>     > >
>     > >
>     > >     Thanks,
>     > >
>     > >     -- Dmitry
>     > >
>     > >
>     > >
>     > >
>     > >
>     > >     From: Müller Ingo <in...@inf.ethz.ch>
>     > > <ma...@inf.ethz.ch>
>     > >     Reply-To: "users@asterixdb.apache.org"
>     > > <ma...@asterixdb.apache.org>  <us...@asterixdb.apache.org>
>     > > <ma...@asterixdb.apache.org>
>     > >     Date: Tuesday, August 3, 2021 at 1:39 AM
>     > >     To: "users@asterixdb.apache.org"
>     > > <ma...@asterixdb.apache.org>  <us...@asterixdb.apache.org>
>     > > <ma...@asterixdb.apache.org>
>     > >     Subject: Producing combinations of array elements in SQL++
>     > >
>     > >
>     > >
>     > >  EXTERNAL EMAIL:  Use caution when opening attachments or clicking on
>     > > links
>     > >
>     > >
>     > >
>     > >
>     > >
>     > >     Dear SQL++ experts,
>     > >
>     > >
>     > >
>     > >     I am working on an academic study [1] comparing the expressiveness of
>     > > various query languages for nested data in a high-energy physics use case.
>     > > Among other systems, I have implemented the queries of the use case in
>     > > SQL++ [2]. In general, SQL++ seems to be quite well suited for those
>     > > queries, but one frequent pattern seems to have only cumbersome
>     > formulations.
>     > >
>     > >
>     > >
>     > >     In short, I am missing the functionality to produce (deterministic)
>     > > element identifiers for array elements. In BigQuery’s SQL dialect [3],
>     > > this can be done with “UNNEST(array_field) WITH OFFSET AS idx”. For
>     > > now, I am using a construct similar to this: “FROM (SELECT
>     > > array_field.*, row_number() OVER () AS idx FROM
>     > > base_table.array_field) AS array_element” (see full query here [4]). I
>     > > have two issues with this: First, it needs undebatably more characters
>     > > than the BigQuery version and it is arguably also more cumbersome.
>     > > Second, I don’t think that it is even correct since the element
>     > > numbers are not guaranteed to be deterministic (i.e., if I use that
>     > > construct in several places in the query, the same array element may
>     > > get different values for row_number()). The documentation on
>     > > row_number says “If the window order clause is omitted, the return
>     > > values may be unpredictable.“ However, introducing an order clause not
>     > > only makes the pattern even more cumbersome (the elements are objects
>     > > with at least four fields), it also does not guarantee a deterministic
>     > > order still. The documentation of the Window Order Clause says: “The
>     > > row_number() function returns a distinct number for each tuple. If
>     > > tuples are tied, the results may be unpredictable.” (To be fair, if
>     > > two array elements agree on all fields, I can probably treat them
>     > > interchangeably in my use case.)
>     > >
>     > >
>     > >
>     > >     With that introduction, two questions: (1) Can you confirm that there
>     > > is really no language construct better than what I currently use (such
>     > > as BigQuery’s “OFFSET”)? (2) Can you confirm that row_number() is
>     > > really undeterministic in the way I use it?
>     > >
>     > >
>     > >
>     > >     I need these element identifiers for producing combinations (in the
>     > > mathematical sense [5]) of the elements in an array field, but maybe
>     > > there is a better way. In the simple case of producing all (unique)
>     > > pairs of elements of a particular array, I use a construct like the following:
>     > >
>     > >
>     > >
>     > >           FROM (SELECT array_field.*, row_number() OVER () AS idx FROM
>     > > base_table.array_field) AS array_element1,
>     > >
>     > >                (SELECT array_field.*, row_number() OVER () AS idx FROM
>     > > base_table. array_field) AS array_element2
>     > >
>     > >           WHERE
>     > >
>     > >             array_element1.idx < array_element2.idx
>     > >
>     > >
>     > >
>     > >     So a last question: (3) Can you think of a better way to compute
>     > > combinations?
>     > >
>     > >
>     > >
>     > >     Thanks a lot in advance and best regards,
>     > >
>     > >     Ingo
>     > >
>     > >
>     > >
>     > >
>     > >
>     > >     [1] Dan Graur, Ingo Müller, Ghislain Fourny, Gordon T. Watts, Mason
>     > > Proffitt, Gustavo Alonso. "Evaluating Query Languages and Systems for
>     > > High- Energy Physics Data." arXiv: 2104.12615 [cs.DB], 2021.
>     > > https://arxiv.org/abs/2104.12615
>     > >
>     > >     [2] https://github.com/RumbleDB/iris-hep-benchmark-sqlpp
>     > > <https://github.com/RumbleDB/iris-hep-benchmark-sqlpp>
>     > >
>     > >     [3] https://cloud.google.com/bigquery/docs/reference/standard-
>     > > sql/arrays#flattening_arrays
>     > > <https://cloud.google.com/bigquery/docs/reference/standard-
>     > > sql/arrays#flattening_arrays>
>     > >
>     > >     [4] https://github.com/RumbleDB/iris-hep-benchmark-
>     > > sqlpp/blob/master/queries/query-5/query.sqlpp
>     > > <https://github.com/RumbleDB/iris-hep-benchmark-
>     > > sqlpp/blob/master/queries/query-5/query.sqlpp>
>     > >
>     > >     [5] https://en.wikipedia.org/wiki/Combination
>     > >
>     > >
>     > >
>     > >
>

Re: Producing combinations of array elements in SQL++

Posted by Dmitry Lychagin <dm...@couchbase.com>.

Hi Ingo, 

I think positional vars should probably work fine in your usecase (unnesting array fields within an object). 
There are issues with this feature when it's used in other contexts though (e.g. position of an object inside a dataset)
These are open bugs in JIRA that track problems with this feature: [1][2][3] 

[1] https://issues.apache.org/jira/browse/ASTERIXDB-1050
[2] https://issues.apache.org/jira/browse/ASTERIXDB-1052
[3] https://issues.apache.org/jira/browse/ASTERIXDB-2595  (this one is relevant for array unnesting too)

Thanks,
-- Dmitry
 

On 8/3/21, 10:20 AM, "Müller  Ingo" <in...@inf.ethz.ch> wrote:

     EXTERNAL EMAIL:  Use caution when opening attachments or clicking on links





    Quick update: The "AT posVar" construct seems to produce the correct result in some simple test cases and are much cleaner that my previous construct (see [6])! In particular, the fact that I don't even need UNNEST makes some queries really concise :)

    Small follow-up question: Is there an issue tracking the status of this feature? I couldn't find it in JIRA...

    Cheers,
    Ingo


    [6] https://github.com/RumbleDB/iris-hep-benchmark-sqlpp/commit/1023d97


    > -----Original Message-----
    > From: Müller Ingo <in...@inf.ethz.ch>
    > Sent: Tuesday, August 3, 2021 6:03 PM
    > To: users@asterixdb.apache.org
    > Subject: RE: Producing combinations of array elements in SQL++
    >
    > Dear Mike, dear Dmitry,
    >
    > Thanks a lot for the quick and helpful answers! I think what I was looking for is
    > the "AT posVar". We are using similar constructs in several other SQL
    > dialects/query languages. I will try it out right away, but I am not sure yet how
    > we will deal with such a "beta" feature in our study.
    >
    > @Mike: I indeed "just" need all combinations and the position of array elements
    > is just a means to get that -- their order isn't important. (In fact, queries typically
    > compute some metric for each combination and then the combination that
    > minimizes or maximizes that metric.) If you know a better way to express that, I
    > am very interested!
    >
    > All the best,
    > Ingo
    >
    >
    > > -----Original Message-----
    > > From: Mike Carey <dt...@gmail.com>
    > > Sent: Tuesday, August 3, 2021 5:48 PM
    > > To: users@asterixdb.apache.org
    > > Subject: Re: Producing combinations of array elements in SQL++
    > >
    > > Ingo,
    > >
    > > ALSO:  Perhaps clarify what you mean by "unique"?  Suppose an array
    > > has [1, 2, 2, 3] as its contents.  What would the desired result be?
    > > The best approach to this will depend on what you mean by that.
    > >
    > >
    > > Cheers,
    > >
    > > Mike
    > >
    > >
    > > On 8/3/21 8:38 AM, Dmitry Lychagin wrote:
    > >
    > >
    > >     Hi Ingo,
    > >
    > >
    > >
    > >     1) SQL++ supports positional variables in the FROM clause, as follows:
    > >
    > >     FROM dataset AS ds, ds.array_field AS elementVar AT posVar
    > >
    > >     (see https://asterixdb.apache.org/docs/0.9.7/SQLPP.html#prod178
    > > <https://asterixdb.apache.org/docs/0.9.7/SQLPP.html#prod178>   --
    > FromTerm)
    > >
    > >     or using UNNEST instead of “,”
    > >
    > >     FROM dataset AS ds UNNEST ds.array_field AS elementVar AT posVar
    > >
    > >     (see https://asterixdb.apache.org/docs/0.9.7/SQLPP.html#prod181
    > > <https://asterixdb.apache.org/docs/0.9.7/SQLPP.html#prod181>  –
    > > UnnestClause)
    > >
    > >
    > >
    > >     Positional variable is bound to a position of an element inside an
    > > array which is being unnested.
    > >
    > >
    > >
    > >     Here’s how your query could look like using these variables:
    > >
    > >     SELECT …
    > >
    > >     FROM base_table, base_table.array_field AS  array_element1 AT pos1,
    > > base_table.array_field AS  array_element2 AT pos2
    > >
    > >     WHERE pos1 < pos2
    > >
    > >
    > >
    > >     2) As for window function calls. I can confirm that if there is no
    > > ORDER BY sub-clause inside OVER clause then the order of the elements
    > > processed by window functions (row_number(), rank(), etc) is not guaranteed.
    > >
    > >
    > >
    > >     Thanks,
    > >
    > >     -- Dmitry
    > >
    > >
    > >
    > >
    > >
    > >     From: Müller Ingo <in...@inf.ethz.ch>
    > > <ma...@inf.ethz.ch>
    > >     Reply-To: "users@asterixdb.apache.org"
    > > <ma...@asterixdb.apache.org>  <us...@asterixdb.apache.org>
    > > <ma...@asterixdb.apache.org>
    > >     Date: Tuesday, August 3, 2021 at 1:39 AM
    > >     To: "users@asterixdb.apache.org"
    > > <ma...@asterixdb.apache.org>  <us...@asterixdb.apache.org>
    > > <ma...@asterixdb.apache.org>
    > >     Subject: Producing combinations of array elements in SQL++
    > >
    > >
    > >
    > >  EXTERNAL EMAIL:  Use caution when opening attachments or clicking on
    > > links
    > >
    > >
    > >
    > >
    > >
    > >     Dear SQL++ experts,
    > >
    > >
    > >
    > >     I am working on an academic study [1] comparing the expressiveness of
    > > various query languages for nested data in a high-energy physics use case.
    > > Among other systems, I have implemented the queries of the use case in
    > > SQL++ [2]. In general, SQL++ seems to be quite well suited for those
    > > queries, but one frequent pattern seems to have only cumbersome
    > formulations.
    > >
    > >
    > >
    > >     In short, I am missing the functionality to produce (deterministic)
    > > element identifiers for array elements. In BigQuery’s SQL dialect [3],
    > > this can be done with “UNNEST(array_field) WITH OFFSET AS idx”. For
    > > now, I am using a construct similar to this: “FROM (SELECT
    > > array_field.*, row_number() OVER () AS idx FROM
    > > base_table.array_field) AS array_element” (see full query here [4]). I
    > > have two issues with this: First, it needs undebatably more characters
    > > than the BigQuery version and it is arguably also more cumbersome.
    > > Second, I don’t think that it is even correct since the element
    > > numbers are not guaranteed to be deterministic (i.e., if I use that
    > > construct in several places in the query, the same array element may
    > > get different values for row_number()). The documentation on
    > > row_number says “If the window order clause is omitted, the return
    > > values may be unpredictable.“ However, introducing an order clause not
    > > only makes the pattern even more cumbersome (the elements are objects
    > > with at least four fields), it also does not guarantee a deterministic
    > > order still. The documentation of the Window Order Clause says: “The
    > > row_number() function returns a distinct number for each tuple. If
    > > tuples are tied, the results may be unpredictable.” (To be fair, if
    > > two array elements agree on all fields, I can probably treat them
    > > interchangeably in my use case.)
    > >
    > >
    > >
    > >     With that introduction, two questions: (1) Can you confirm that there
    > > is really no language construct better than what I currently use (such
    > > as BigQuery’s “OFFSET”)? (2) Can you confirm that row_number() is
    > > really undeterministic in the way I use it?
    > >
    > >
    > >
    > >     I need these element identifiers for producing combinations (in the
    > > mathematical sense [5]) of the elements in an array field, but maybe
    > > there is a better way. In the simple case of producing all (unique)
    > > pairs of elements of a particular array, I use a construct like the following:
    > >
    > >
    > >
    > >           FROM (SELECT array_field.*, row_number() OVER () AS idx FROM
    > > base_table.array_field) AS array_element1,
    > >
    > >                (SELECT array_field.*, row_number() OVER () AS idx FROM
    > > base_table. array_field) AS array_element2
    > >
    > >           WHERE
    > >
    > >             array_element1.idx < array_element2.idx
    > >
    > >
    > >
    > >     So a last question: (3) Can you think of a better way to compute
    > > combinations?
    > >
    > >
    > >
    > >     Thanks a lot in advance and best regards,
    > >
    > >     Ingo
    > >
    > >
    > >
    > >
    > >
    > >     [1] Dan Graur, Ingo Müller, Ghislain Fourny, Gordon T. Watts, Mason
    > > Proffitt, Gustavo Alonso. "Evaluating Query Languages and Systems for
    > > High- Energy Physics Data." arXiv: 2104.12615 [cs.DB], 2021.
    > > https://arxiv.org/abs/2104.12615
    > >
    > >     [2] https://github.com/RumbleDB/iris-hep-benchmark-sqlpp
    > > <https://github.com/RumbleDB/iris-hep-benchmark-sqlpp>
    > >
    > >     [3] https://cloud.google.com/bigquery/docs/reference/standard-
    > > sql/arrays#flattening_arrays
    > > <https://cloud.google.com/bigquery/docs/reference/standard-
    > > sql/arrays#flattening_arrays>
    > >
    > >     [4] https://github.com/RumbleDB/iris-hep-benchmark-
    > > sqlpp/blob/master/queries/query-5/query.sqlpp
    > > <https://github.com/RumbleDB/iris-hep-benchmark-
    > > sqlpp/blob/master/queries/query-5/query.sqlpp>
    > >
    > >     [5] https://en.wikipedia.org/wiki/Combination
    > >
    > >
    > >
    > >

RE: Producing combinations of array elements in SQL++

Posted by Müller Ingo <in...@inf.ethz.ch>.

Quick update: The "AT posVar" construct seems to produce the correct result in some simple test cases and are much cleaner that my previous construct (see [6])! In particular, the fact that I don't even need UNNEST makes some queries really concise :)

Small follow-up question: Is there an issue tracking the status of this feature? I couldn't find it in JIRA...

Cheers,
Ingo


[6] https://github.com/RumbleDB/iris-hep-benchmark-sqlpp/commit/1023d97


> -----Original Message-----
> From: Müller Ingo <in...@inf.ethz.ch>
> Sent: Tuesday, August 3, 2021 6:03 PM
> To: users@asterixdb.apache.org
> Subject: RE: Producing combinations of array elements in SQL++
> 
> Dear Mike, dear Dmitry,
> 
> Thanks a lot for the quick and helpful answers! I think what I was looking for is
> the "AT posVar". We are using similar constructs in several other SQL
> dialects/query languages. I will try it out right away, but I am not sure yet how
> we will deal with such a "beta" feature in our study.
> 
> @Mike: I indeed "just" need all combinations and the position of array elements
> is just a means to get that -- their order isn't important. (In fact, queries typically
> compute some metric for each combination and then the combination that
> minimizes or maximizes that metric.) If you know a better way to express that, I
> am very interested!
> 
> All the best,
> Ingo
> 
> 
> > -----Original Message-----
> > From: Mike Carey <dt...@gmail.com>
> > Sent: Tuesday, August 3, 2021 5:48 PM
> > To: users@asterixdb.apache.org
> > Subject: Re: Producing combinations of array elements in SQL++
> >
> > Ingo,
> >
> > ALSO:  Perhaps clarify what you mean by "unique"?  Suppose an array
> > has [1, 2, 2, 3] as its contents.  What would the desired result be?
> > The best approach to this will depend on what you mean by that.
> >
> >
> > Cheers,
> >
> > Mike
> >
> >
> > On 8/3/21 8:38 AM, Dmitry Lychagin wrote:
> >
> >
> > 	Hi Ingo,
> >
> >
> >
> > 	1) SQL++ supports positional variables in the FROM clause, as follows:
> >
> > 	FROM dataset AS ds, ds.array_field AS elementVar AT posVar
> >
> > 	(see https://asterixdb.apache.org/docs/0.9.7/SQLPP.html#prod178
> > <https://asterixdb.apache.org/docs/0.9.7/SQLPP.html#prod178>   --
> FromTerm)
> >
> > 	or using UNNEST instead of “,”
> >
> > 	FROM dataset AS ds UNNEST ds.array_field AS elementVar AT posVar
> >
> > 	(see https://asterixdb.apache.org/docs/0.9.7/SQLPP.html#prod181
> > <https://asterixdb.apache.org/docs/0.9.7/SQLPP.html#prod181>  –
> > UnnestClause)
> >
> >
> >
> > 	Positional variable is bound to a position of an element inside an
> > array which is being unnested.
> >
> >
> >
> > 	Here’s how your query could look like using these variables:
> >
> > 	SELECT …
> >
> > 	FROM base_table, base_table.array_field AS  array_element1 AT pos1,
> > base_table.array_field AS  array_element2 AT pos2
> >
> > 	WHERE pos1 < pos2
> >
> >
> >
> > 	2) As for window function calls. I can confirm that if there is no
> > ORDER BY sub-clause inside OVER clause then the order of the elements
> > processed by window functions (row_number(), rank(), etc) is not guaranteed.
> >
> >
> >
> > 	Thanks,
> >
> > 	-- Dmitry
> >
> >
> >
> >
> >
> > 	From: Müller Ingo <in...@inf.ethz.ch>
> > <ma...@inf.ethz.ch>
> > 	Reply-To: "users@asterixdb.apache.org"
> > <ma...@asterixdb.apache.org>  <us...@asterixdb.apache.org>
> > <ma...@asterixdb.apache.org>
> > 	Date: Tuesday, August 3, 2021 at 1:39 AM
> > 	To: "users@asterixdb.apache.org"
> > <ma...@asterixdb.apache.org>  <us...@asterixdb.apache.org>
> > <ma...@asterixdb.apache.org>
> > 	Subject: Producing combinations of array elements in SQL++
> >
> >
> >
> >  EXTERNAL EMAIL:  Use caution when opening attachments or clicking on
> > links
> >
> >
> >
> >
> >
> > 	Dear SQL++ experts,
> >
> >
> >
> > 	I am working on an academic study [1] comparing the expressiveness of
> > various query languages for nested data in a high-energy physics use case.
> > Among other systems, I have implemented the queries of the use case in
> > SQL++ [2]. In general, SQL++ seems to be quite well suited for those
> > queries, but one frequent pattern seems to have only cumbersome
> formulations.
> >
> >
> >
> > 	In short, I am missing the functionality to produce (deterministic)
> > element identifiers for array elements. In BigQuery’s SQL dialect [3],
> > this can be done with “UNNEST(array_field) WITH OFFSET AS idx”. For
> > now, I am using a construct similar to this: “FROM (SELECT
> > array_field.*, row_number() OVER () AS idx FROM
> > base_table.array_field) AS array_element” (see full query here [4]). I
> > have two issues with this: First, it needs undebatably more characters
> > than the BigQuery version and it is arguably also more cumbersome.
> > Second, I don’t think that it is even correct since the element
> > numbers are not guaranteed to be deterministic (i.e., if I use that
> > construct in several places in the query, the same array element may
> > get different values for row_number()). The documentation on
> > row_number says “If the window order clause is omitted, the return
> > values may be unpredictable.“ However, introducing an order clause not
> > only makes the pattern even more cumbersome (the elements are objects
> > with at least four fields), it also does not guarantee a deterministic
> > order still. The documentation of the Window Order Clause says: “The
> > row_number() function returns a distinct number for each tuple. If
> > tuples are tied, the results may be unpredictable.” (To be fair, if
> > two array elements agree on all fields, I can probably treat them
> > interchangeably in my use case.)
> >
> >
> >
> > 	With that introduction, two questions: (1) Can you confirm that there
> > is really no language construct better than what I currently use (such
> > as BigQuery’s “OFFSET”)? (2) Can you confirm that row_number() is
> > really undeterministic in the way I use it?
> >
> >
> >
> > 	I need these element identifiers for producing combinations (in the
> > mathematical sense [5]) of the elements in an array field, but maybe
> > there is a better way. In the simple case of producing all (unique)
> > pairs of elements of a particular array, I use a construct like the following:
> >
> >
> >
> > 	      FROM (SELECT array_field.*, row_number() OVER () AS idx FROM
> > base_table.array_field) AS array_element1,
> >
> > 	           (SELECT array_field.*, row_number() OVER () AS idx FROM
> > base_table. array_field) AS array_element2
> >
> > 	      WHERE
> >
> > 	        array_element1.idx < array_element2.idx
> >
> >
> >
> > 	So a last question: (3) Can you think of a better way to compute
> > combinations?
> >
> >
> >
> > 	Thanks a lot in advance and best regards,
> >
> > 	Ingo
> >
> >
> >
> >
> >
> > 	[1] Dan Graur, Ingo Müller, Ghislain Fourny, Gordon T. Watts, Mason
> > Proffitt, Gustavo Alonso. "Evaluating Query Languages and Systems for
> > High- Energy Physics Data." arXiv: 2104.12615 [cs.DB], 2021.
> > https://arxiv.org/abs/2104.12615
> >
> > 	[2] https://github.com/RumbleDB/iris-hep-benchmark-sqlpp
> > <https://github.com/RumbleDB/iris-hep-benchmark-sqlpp>
> >
> > 	[3] https://cloud.google.com/bigquery/docs/reference/standard-
> > sql/arrays#flattening_arrays
> > <https://cloud.google.com/bigquery/docs/reference/standard-
> > sql/arrays#flattening_arrays>
> >
> > 	[4] https://github.com/RumbleDB/iris-hep-benchmark-
> > sqlpp/blob/master/queries/query-5/query.sqlpp
> > <https://github.com/RumbleDB/iris-hep-benchmark-
> > sqlpp/blob/master/queries/query-5/query.sqlpp>
> >
> > 	[5] https://en.wikipedia.org/wiki/Combination
> >
> >
> >
> >

RE: Producing combinations of array elements in SQL++

Posted by Müller Ingo <in...@inf.ethz.ch>.

Dear Mike, dear Dmitry,

Thanks a lot for the quick and helpful answers! I think what I was looking for is the "AT posVar". We are using similar constructs in several other SQL dialects/query languages. I will try it out right away, but I am not sure yet how we will deal with such a "beta" feature in our study.

@Mike: I indeed "just" need all combinations and the position of array elements is just a means to get that -- their order isn't important. (In fact, queries typically compute some metric for each combination and then the combination that minimizes or maximizes that metric.) If you know a better way to express that, I am very interested!

All the best,
Ingo


> -----Original Message-----
> From: Mike Carey <dt...@gmail.com>
> Sent: Tuesday, August 3, 2021 5:48 PM
> To: users@asterixdb.apache.org
> Subject: Re: Producing combinations of array elements in SQL++
> 
> Ingo,
> 
> ALSO:  Perhaps clarify what you mean by "unique"?  Suppose an array has [1, 2,
> 2, 3] as its contents.  What would the desired result be?  The best approach to
> this will depend on what you mean by that.
> 
> 
> Cheers,
> 
> Mike
> 
> 
> On 8/3/21 8:38 AM, Dmitry Lychagin wrote:
> 
> 
> 	Hi Ingo,
> 
> 
> 
> 	1) SQL++ supports positional variables in the FROM clause, as follows:
> 
> 	FROM dataset AS ds, ds.array_field AS elementVar AT posVar
> 
> 	(see https://asterixdb.apache.org/docs/0.9.7/SQLPP.html#prod178
> <https://asterixdb.apache.org/docs/0.9.7/SQLPP.html#prod178>   -- FromTerm)
> 
> 	or using UNNEST instead of “,”
> 
> 	FROM dataset AS ds UNNEST ds.array_field AS elementVar AT posVar
> 
> 	(see https://asterixdb.apache.org/docs/0.9.7/SQLPP.html#prod181
> <https://asterixdb.apache.org/docs/0.9.7/SQLPP.html#prod181>  –
> UnnestClause)
> 
> 
> 
> 	Positional variable is bound to a position of an element inside an array
> which is being unnested.
> 
> 
> 
> 	Here’s how your query could look like using these variables:
> 
> 	SELECT …
> 
> 	FROM base_table, base_table.array_field AS  array_element1 AT pos1,
> base_table.array_field AS  array_element2 AT pos2
> 
> 	WHERE pos1 < pos2
> 
> 
> 
> 	2) As for window function calls. I can confirm that if there is no ORDER
> BY sub-clause inside OVER clause then the order of the elements processed by
> window functions (row_number(), rank(), etc) is not guaranteed.
> 
> 
> 
> 	Thanks,
> 
> 	-- Dmitry
> 
> 
> 
> 
> 
> 	From: Müller Ingo <in...@inf.ethz.ch>
> <ma...@inf.ethz.ch>
> 	Reply-To: "users@asterixdb.apache.org"
> <ma...@asterixdb.apache.org>  <us...@asterixdb.apache.org>
> <ma...@asterixdb.apache.org>
> 	Date: Tuesday, August 3, 2021 at 1:39 AM
> 	To: "users@asterixdb.apache.org"
> <ma...@asterixdb.apache.org>  <us...@asterixdb.apache.org>
> <ma...@asterixdb.apache.org>
> 	Subject: Producing combinations of array elements in SQL++
> 
> 
> 
>  EXTERNAL EMAIL:  Use caution when opening attachments or clicking on links
> 
> 
> 
> 
> 
> 	Dear SQL++ experts,
> 
> 
> 
> 	I am working on an academic study [1] comparing the expressiveness of
> various query languages for nested data in a high-energy physics use case.
> Among other systems, I have implemented the queries of the use case in SQL++
> [2]. In general, SQL++ seems to be quite well suited for those queries, but one
> frequent pattern seems to have only cumbersome formulations.
> 
> 
> 
> 	In short, I am missing the functionality to produce (deterministic)
> element identifiers for array elements. In BigQuery’s SQL dialect [3], this can be
> done with “UNNEST(array_field) WITH OFFSET AS idx”. For now, I am using a
> construct similar to this: “FROM (SELECT array_field.*, row_number() OVER () AS
> idx FROM base_table.array_field) AS array_element” (see full query here [4]). I
> have two issues with this: First, it needs undebatably more characters than the
> BigQuery version and it is arguably also more cumbersome. Second, I don’t think
> that it is even correct since the element numbers are not guaranteed to be
> deterministic (i.e., if I use that construct in several places in the query, the same
> array element may get different values for row_number()). The documentation
> on row_number says “If the window order clause is omitted, the return values
> may be unpredictable.“ However, introducing an order clause not only makes
> the pattern even more cumbersome (the elements are objects with at least four
> fields), it also does not guarantee a deterministic order still. The documentation
> of the Window Order Clause says: “The row_number() function returns a distinct
> number for each tuple. If tuples are tied, the results may be unpredictable.” (To
> be fair, if two array elements agree on all fields, I can probably treat them
> interchangeably in my use case.)
> 
> 
> 
> 	With that introduction, two questions: (1) Can you confirm that there is
> really no language construct better than what I currently use (such as BigQuery’s
> “OFFSET”)? (2) Can you confirm that row_number() is really undeterministic in
> the way I use it?
> 
> 
> 
> 	I need these element identifiers for producing combinations (in the
> mathematical sense [5]) of the elements in an array field, but maybe there is a
> better way. In the simple case of producing all (unique) pairs of elements of a
> particular array, I use a construct like the following:
> 
> 
> 
> 	      FROM (SELECT array_field.*, row_number() OVER () AS idx FROM
> base_table.array_field) AS array_element1,
> 
> 	           (SELECT array_field.*, row_number() OVER () AS idx FROM
> base_table. array_field) AS array_element2
> 
> 	      WHERE
> 
> 	        array_element1.idx < array_element2.idx
> 
> 
> 
> 	So a last question: (3) Can you think of a better way to compute
> combinations?
> 
> 
> 
> 	Thanks a lot in advance and best regards,
> 
> 	Ingo
> 
> 
> 
> 
> 
> 	[1] Dan Graur, Ingo Müller, Ghislain Fourny, Gordon T. Watts, Mason
> Proffitt, Gustavo Alonso. "Evaluating Query Languages and Systems for High-
> Energy Physics Data." arXiv: 2104.12615 [cs.DB], 2021.
> https://arxiv.org/abs/2104.12615
> 
> 	[2] https://github.com/RumbleDB/iris-hep-benchmark-sqlpp
> <https://github.com/RumbleDB/iris-hep-benchmark-sqlpp>
> 
> 	[3] https://cloud.google.com/bigquery/docs/reference/standard-
> sql/arrays#flattening_arrays
> <https://cloud.google.com/bigquery/docs/reference/standard-
> sql/arrays#flattening_arrays>
> 
> 	[4] https://github.com/RumbleDB/iris-hep-benchmark-
> sqlpp/blob/master/queries/query-5/query.sqlpp
> <https://github.com/RumbleDB/iris-hep-benchmark-
> sqlpp/blob/master/queries/query-5/query.sqlpp>
> 
> 	[5] https://en.wikipedia.org/wiki/Combination
> 
> 
> 
>

Re: Producing combinations of array elements in SQL++

Posted by Mike Carey <dt...@gmail.com>.

Ingo,

ALSO:  Perhaps clarify what you mean by "unique"?  Suppose an array has 
[1, 2, 2, 3] as its contents. What would the desired result be?  The 
best approach to this will depend on what you mean by that.

Cheers,

Mike

On 8/3/21 8:38 AM, Dmitry Lychagin wrote:
>
> Hi Ingo,
>
> 1) SQL++ supports positional variables in the FROM clause, as follows:
>
> FROM dataset AS ds, ds.array_field AS elementVar AT posVar
>
> (see https://asterixdb.apache.org/docs/0.9.7/SQLPP.html#prod178 
> <https://asterixdb.apache.org/docs/0.9.7/SQLPP.html#prod178>  -- FromTerm)
>
> or using UNNEST instead of “,”
>
> FROM dataset AS ds UNNEST ds.array_field AS elementVar AT posVar
>
> (see https://asterixdb.apache.org/docs/0.9.7/SQLPP.html#prod181 
> <https://asterixdb.apache.org/docs/0.9.7/SQLPP.html#prod181> – 
> UnnestClause)
>
> Positional variable is bound to a position of an element inside an 
> array which is being unnested.
>
> Here’s how your query could look like using these variables:
>
> SELECT …
>
> FROM base_table, base_table.array_field AS  array_element1 AT pos1,  
> base_table.array_field AS  array_element2 AT pos2
>
> WHERE pos1 < pos2
>
> 2) As for window function calls. I can confirm that if there is no 
> ORDER BY sub-clause inside OVER clause then the order of the elements 
> processed by window functions (row_number(), rank(), etc) is not 
> guaranteed.
>
> Thanks,
>
> -- Dmitry
>
> *From: *Müller Ingo <in...@inf.ethz.ch>
> *Reply-To: *"users@asterixdb.apache.org" <us...@asterixdb.apache.org>
> *Date: *Tuesday, August 3, 2021 at 1:39 AM
> *To: *"users@asterixdb.apache.org" <us...@asterixdb.apache.org>
> *Subject: *Producing combinations of array elements in SQL++
>
> *EXTERNAL EMAIL**: *Use caution when opening attachments or clicking 
> on links
>
> Dear SQL++ experts,
>
> I am working on an academic study [1] comparing the expressiveness of 
> various query languages for nested data in a high-energy physics use 
> case. Among other systems, I have implemented the queries of the use 
> case in SQL++ [2]. In general, SQL++ seems to be quite well suited for 
> those queries, but one frequent pattern seems to have only cumbersome 
> formulations.
>
> In short, I am missing the functionality to produce (deterministic) 
> element identifiers for array elements. In BigQuery’s SQL dialect [3], 
> this can be done with “UNNEST(array_field) WITH OFFSET AS idx”. For 
> now, I am using a construct similar to this: “FROM (SELECT 
> array_field.*, row_number() OVER () AS idx FROM 
> base_table.array_field) AS array_element” (see full query here [4]). I 
> have two issues with this: First, it needs undebatably more characters 
> than the BigQuery version and it is arguably also more cumbersome. 
> Second, I don’t think that it is even correct since the element 
> numbers are not guaranteed to be deterministic (i.e., if I use that 
> construct in several places in the query, the same array element may 
> get different values for row_number()). The documentation on 
> row_number says “If the window order clause is omitted, the return 
> values may be unpredictable.“ However, introducing an order clause not 
> only makes the pattern even more cumbersome (the elements are objects 
> with at least four fields), it also does not guarantee a deterministic 
> order still. The documentation of the Window Order Clause says: “The 
> row_number() function returns a distinct number for each tuple. If 
> tuples are tied, the results may be unpredictable.” (To be fair, if 
> two array elements agree on all fields, I can probably treat them 
> interchangeably in my use case.)
>
> With that introduction, two questions: (1) Can you confirm that there 
> is really no language construct better than what I currently use (such 
> as BigQuery’s “OFFSET”)? (2) Can you confirm that row_number() is 
> really undeterministic in the way I use it?
>
> I need these element identifiers for producing combinations (in the 
> mathematical sense [5]) of the elements in an array field, but maybe 
> there is a better way. In the simple case of producing all (unique) 
> pairs of elements of a particular array, I use a construct like the 
> following:
>
>       FROM (SELECT array_field.*, row_number() OVER () AS idx FROM 
> base_table.array_field) AS array_element1,
>
>            (SELECT array_field.*, row_number() OVER () AS idx FROM 
> base_table. array_field) AS array_element2
>
>       WHERE
>
>         array_element1.idx < array_element2.idx
>
> So a last question: (3) Can you think of a better way to compute 
> combinations?
>
> Thanks a lot in advance and best regards,
>
> Ingo
>
> [1] Dan Graur, Ingo Müller, Ghislain Fourny, Gordon T. Watts, Mason 
> Proffitt, Gustavo Alonso. "Evaluating Query Languages and Systems for 
> High-Energy Physics Data." arXiv: 2104.12615 [cs.DB], 2021. 
> https://arxiv.org/abs/2104.12615 <https://arxiv.org/abs/2104.12615>
>
> [2] https://github.com/RumbleDB/iris-hep-benchmark-sqlpp 
> <https://github.com/RumbleDB/iris-hep-benchmark-sqlpp>
>
> [3] 
> https://cloud.google.com/bigquery/docs/reference/standard-sql/arrays#flattening_arrays 
> <https://cloud.google.com/bigquery/docs/reference/standard-sql/arrays#flattening_arrays>
>
> [4] 
> https://github.com/RumbleDB/iris-hep-benchmark-sqlpp/blob/master/queries/query-5/query.sqlpp 
> <https://github.com/RumbleDB/iris-hep-benchmark-sqlpp/blob/master/queries/query-5/query.sqlpp>
>
> [5] https://en.wikipedia.org/wiki/Combination 
> <https://en.wikipedia.org/wiki/Combination>
>

Re: Producing combinations of array elements in SQL++

Posted by Dmitry Lychagin <dm...@couchbase.com>.

Hi Ingo,

1) SQL++ supports positional variables in the FROM clause, as follows:
FROM dataset AS ds, ds.array_field AS elementVar AT posVar
(see https://asterixdb.apache.org/docs/0.9.7/SQLPP.html#prod178  -- FromTerm)
or using UNNEST instead of “,”
FROM dataset AS ds UNNEST ds.array_field AS elementVar AT posVar
(see https://asterixdb.apache.org/docs/0.9.7/SQLPP.html#prod181 – UnnestClause)

Positional variable is bound to a position of an element inside an array which is being unnested.

Here’s how your query could look like using these variables:
SELECT …
FROM base_table, base_table.array_field AS  array_element1 AT pos1,  base_table.array_field AS  array_element2 AT pos2
WHERE pos1 < pos2

2) As for window function calls. I can confirm that if there is no ORDER BY sub-clause inside OVER clause then the order of the elements processed by window functions (row_number(), rank(), etc) is not guaranteed.

Thanks,
-- Dmitry


From: Müller Ingo <in...@inf.ethz.ch>
Reply-To: "users@asterixdb.apache.org" <us...@asterixdb.apache.org>
Date: Tuesday, August 3, 2021 at 1:39 AM
To: "users@asterixdb.apache.org" <us...@asterixdb.apache.org>
Subject: Producing combinations of array elements in SQL++

 EXTERNAL EMAIL:  Use caution when opening attachments or clicking on links




Dear SQL++ experts,

I am working on an academic study [1] comparing the expressiveness of various query languages for nested data in a high-energy physics use case. Among other systems, I have implemented the queries of the use case in SQL++ [2]. In general, SQL++ seems to be quite well suited for those queries, but one frequent pattern seems to have only cumbersome formulations.

In short, I am missing the functionality to produce (deterministic) element identifiers for array elements. In BigQuery’s SQL dialect [3], this can be done with “UNNEST(array_field) WITH OFFSET AS idx”. For now, I am using a construct similar to this: “FROM (SELECT array_field.*, row_number() OVER () AS idx FROM base_table.array_field) AS array_element” (see full query here [4]). I have two issues with this: First, it needs undebatably more characters than the BigQuery version and it is arguably also more cumbersome. Second, I don’t think that it is even correct since the element numbers are not guaranteed to be deterministic (i.e., if I use that construct in several places in the query, the same array element may get different values for row_number()). The documentation on row_number says “If the window order clause is omitted, the return values may be unpredictable.“ However, introducing an order clause not only makes the pattern even more cumbersome (the elements are objects with at least four fields), it also does not guarantee a deterministic order still. The documentation of the Window Order Clause says: “The row_number() function returns a distinct number for each tuple. If tuples are tied, the results may be unpredictable.” (To be fair, if two array elements agree on all fields, I can probably treat them interchangeably in my use case.)

With that introduction, two questions: (1) Can you confirm that there is really no language construct better than what I currently use (such as BigQuery’s “OFFSET”)? (2) Can you confirm that row_number() is really undeterministic in the way I use it?

I need these element identifiers for producing combinations (in the mathematical sense [5]) of the elements in an array field, but maybe there is a better way. In the simple case of producing all (unique) pairs of elements of a particular array, I use a construct like the following:

      FROM (SELECT array_field.*, row_number() OVER () AS idx FROM base_table.array_field) AS array_element1,
           (SELECT array_field.*, row_number() OVER () AS idx FROM base_table. array_field) AS array_element2
      WHERE
        array_element1.idx < array_element2.idx

So a last question: (3) Can you think of a better way to compute combinations?

Thanks a lot in advance and best regards,
Ingo


[1] Dan Graur, Ingo Müller, Ghislain Fourny, Gordon T. Watts, Mason Proffitt, Gustavo Alonso. "Evaluating Query Languages and Systems for High-Energy Physics Data." arXiv: 2104.12615 [cs.DB], 2021. https://arxiv.org/abs/2104.12615
[2] https://github.com/RumbleDB/iris-hep-benchmark-sqlpp
[3] https://cloud.google.com/bigquery/docs/reference/standard-sql/arrays#flattening_arrays
[4] https://github.com/RumbleDB/iris-hep-benchmark-sqlpp/blob/master/queries/query-5/query.sqlpp
[5] https://en.wikipedia.org/wiki/Combination

Re: Producing combinations of array elements in SQL++

Posted by Mike Carey <dt...@gmail.com>.

Ingo,

Very interesting!

QUESTION:  Is order important in your use case, or do you "just" need 
all unique pairs of elements from the array (in nor particular order).  
If order is just a means to that end in your use case, yes, there are 
easier/lower-cost ways.  :-)

BTW, SQL++ has an unpublished AT extension to the FROM clause that is 
very much like what BigQuery has - inherited from its AQL (XQuery-based) 
roots - but it is buggy in the parallel case (clusters) so it's not 
advertised at the moment. There is a fix on a side branch that a UC 
Riverside student has, but it's not checked in or well-tested yet.  
(What's there might actually work for your limited case though - hmm - 
but if order isn't needed, there are better ways.)

Cheers,

Mike

On 8/3/21 1:39 AM, Müller Ingo wrote:
>
> Dear SQL++ experts,
>
> I am working on an academic study [1] comparing the expressiveness of 
> various query languages for nested data in a high-energy physics use 
> case. Among other systems, I have implemented the queries of the use 
> case in SQL++ [2]. In general, SQL++ seems to be quite well suited for 
> those queries, but one frequent pattern seems to have only cumbersome 
> formulations.
>
> In short, I am missing the functionality to produce (deterministic) 
> element identifiers for array elements. In BigQuery’s SQL dialect [3], 
> this can be done with “UNNEST(array_field) WITH OFFSET AS idx”. For 
> now, I am using a construct similar to this: “FROM (SELECT 
> array_field.*, row_number() OVER () AS idx FROM 
> base_table.array_field) AS array_element” (see full query here [4]). I 
> have two issues with this: First, it needs undebatably more characters 
> than the BigQuery version and it is arguably also more cumbersome. 
> Second, I don’t think that it is even correct since the element 
> numbers are not guaranteed to be deterministic (i.e., if I use that 
> construct in several places in the query, the same array element may 
> get different values for row_number()). The documentation on 
> row_number says “If the window order clause is omitted, the return 
> values may be unpredictable.“ However, introducing an order clause not 
> only makes the pattern even more cumbersome (the elements are objects 
> with at least four fields), it also does not guarantee a deterministic 
> order still. The documentation of the Window Order Clause says: “The 
> row_number() function returns a distinct number for each tuple. If 
> tuples are tied, the results may be unpredictable.” (To be fair, if 
> two array elements agree on all fields, I can probably treat them 
> interchangeably in my use case.)
>
> With that introduction, two questions: (1) Can you confirm that there 
> is really no language construct better than what I currently use (such 
> as BigQuery’s “OFFSET”)? (2) Can you confirm that row_number() is 
> really undeterministic in the way I use it?
>
> I need these element identifiers for producing combinations (in the 
> mathematical sense [5]) of the elements in an array field, but maybe 
> there is a better way. In the simple case of producing all (unique) 
> pairs of elements of a particular array, I use a construct like the 
> following:
>
>       FROM (SELECT array_field.*, row_number() OVER () AS idx FROM 
> base_table.array_field) AS array_element1,
>
>            (SELECT array_field.*, row_number() OVER () AS idx FROM 
> base_table. array_field) AS array_element2
>
>       WHERE
>
>         array_element1.idx < array_element2.idx
>
> So a last question: (3) Can you think of a better way to compute 
> combinations?
>
> Thanks a lot in advance and best regards,
>
> Ingo
>
> [1] Dan Graur, Ingo Müller, Ghislain Fourny, Gordon T. Watts, Mason 
> Proffitt, Gustavo Alonso. "Evaluating Query Languages and Systems for 
> High-Energy Physics Data." arXiv: 2104.12615 [cs.DB], 2021. 
> https://arxiv.org/abs/2104.12615 <https://arxiv.org/abs/2104.12615>
>
> [2] https://github.com/RumbleDB/iris-hep-benchmark-sqlpp 
> <https://github.com/RumbleDB/iris-hep-benchmark-sqlpp>
>
> [3] 
> https://cloud.google.com/bigquery/docs/reference/standard-sql/arrays#flattening_arrays 
> <https://cloud.google.com/bigquery/docs/reference/standard-sql/arrays#flattening_arrays>
>
> [4] 
> https://github.com/RumbleDB/iris-hep-benchmark-sqlpp/blob/master/queries/query-5/query.sqlpp 
> <https://github.com/RumbleDB/iris-hep-benchmark-sqlpp/blob/master/queries/query-5/query.sqlpp>
>
> [5] https://en.wikipedia.org/wiki/Combination 
> <https://en.wikipedia.org/wiki/Combination>
>