You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@drill.apache.org by "benj (Jira)" <ji...@apache.org> on 2020/01/13 11:54:00 UTC
[jira] [Created] (DRILL-7524) Distinct on array with any_value
benj created DRILL-7524:
---------------------------
Summary: Distinct on array with any_value
Key: DRILL-7524
URL: https://issues.apache.org/jira/browse/DRILL-7524
Project: Apache Drill
Issue Type: Bug
Components: Functions - Drill
Affects Versions: 1.17.0
Reporter: benj
Attachments: IndexOutOfBoundsException.txt, NegativeArraySizeException.txt
AS drill doesn't allow to GROUP BY nor DISTINCT nor ORDER BY complex type, it may appears as a solution to use any_value aggregate function to do some works.
But some problems appears:
With a dataset of 223664 rows like:
{code:sql}
SELECT Url, Tags FROM dfs.tmp.`data.json` LIMIT 1;
+-----------------------------------------+--------+
| Url | Tags |
+-----------------------------------------+--------+
| http://000.dijiushipindian.com/feed.rss | ["us"] |
+-----------------------------------------+--------+
{code}
With the own UDF function to_string that only do
{code:java}
@Param FieldReader input;
...
String rowString = input.readObject().toString();
...
{code}
{code:sql}
SELECT any_value(T.Tags)Tags FROM dfs.tmp.`data.json`
GROUP BY NULLIF(UPPER(to_string(T.Tags)),'') /* WORK WELL */;
+--------+
| Tags |
+--------+
| ["us"] |
| ["cn"] |
...
SELECT Url, any_value(T.Tags)Tags FROM dfs.tmp.`data.json`
GROUP BY Url, NULLIF(UPPER(to_string(T.Tags)),'') /* NOK */;
java.lang.NegativeArraySizeException
{code}
Sometimes the error can be different (details in attachment): java.lang.IndexOutOfBoundsException: index: 1634787136, length: 7629168 (expected: range(0, 8388608))
And before producing the error, the output show some results like below
{code}
+----------------------------------------------------------------------------------+------+
| Url | Tags |
+----------------------------------------------------------------------------------+------+
| http://everythiing4u.blogspot.com.es/2013/04/omg-proposal-fail.html | [] |
| http://everythiing4u.blogspot.com.es/2013/04/omg-this-dude-just-owned-his-friend.html | [] |
{code}
And this result is not correct because field Tags is empty although this is never the case in the source file.
So maybe there is a problem with the aggregate function any_value.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)