You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@drill.apache.org by "benj (Jira)" <ji...@apache.org> on 2020/01/13 11:54:00 UTC

[jira] [Created] (DRILL-7524) Distinct on array with any_value

benj created DRILL-7524:
---------------------------

             Summary: Distinct on array with any_value
                 Key: DRILL-7524
                 URL: https://issues.apache.org/jira/browse/DRILL-7524
             Project: Apache Drill
          Issue Type: Bug
          Components: Functions - Drill
    Affects Versions: 1.17.0
            Reporter: benj
         Attachments: IndexOutOfBoundsException.txt, NegativeArraySizeException.txt

AS drill doesn't allow to GROUP BY nor DISTINCT nor ORDER BY complex type, it may appears as a solution to use any_value aggregate function to do some works.

But some problems appears:

With a dataset of 223664 rows like:
{code:sql}
SELECT Url, Tags FROM dfs.tmp.`data.json` LIMIT 1;
+-----------------------------------------+--------+
|                   Url                   |  Tags  |
+-----------------------------------------+--------+
| http://000.dijiushipindian.com/feed.rss | ["us"] |
+-----------------------------------------+--------+
{code}

With the own UDF function to_string that only do 

{code:java}
@Param FieldReader input;
...
String rowString = input.readObject().toString();
...
{code}


{code:sql}
SELECT any_value(T.Tags)Tags FROM dfs.tmp.`data.json`
GROUP BY NULLIF(UPPER(to_string(T.Tags)),'') /* WORK WELL */;
+--------+
|  Tags  |
+--------+
| ["us"] |
| ["cn"] |
...

SELECT Url, any_value(T.Tags)Tags FROM dfs.tmp.`data.json`
GROUP BY Url, NULLIF(UPPER(to_string(T.Tags)),'') /* NOK */;
  java.lang.NegativeArraySizeException
{code}
Sometimes the error can be different (details in attachment): java.lang.IndexOutOfBoundsException: index: 1634787136, length: 7629168 (expected: range(0, 8388608))

And before producing the error, the output show some results like below
{code}
+----------------------------------------------------------------------------------+------+
|                                       Url                                        | Tags |
+----------------------------------------------------------------------------------+------+
| http://everythiing4u.blogspot.com.es/2013/04/omg-proposal-fail.html              | []   |
| http://everythiing4u.blogspot.com.es/2013/04/omg-this-dude-just-owned-his-friend.html | []   |
{code}
And this result is not correct because field Tags is empty although this is never the case in the source file.

So maybe there is a problem with the aggregate function any_value.




--
This message was sent by Atlassian Jira
(v8.3.4#803005)