You are viewing a plain text version of this content. The canonical link for it is here.
Posted to notifications@asterixdb.apache.org by "Glenn Justo Galvizo (Jira)" <ji...@apache.org> on 2021/10/06 22:25:00 UTC

[jira] [Created] (ASTERIXDB-2971) UNION ALL query giving wrong result

Glenn Justo Galvizo created ASTERIXDB-2971:
----------------------------------------------

             Summary: UNION ALL query giving wrong result
                 Key: ASTERIXDB-2971
                 URL: https://issues.apache.org/jira/browse/ASTERIXDB-2971
             Project: Apache AsterixDB
          Issue Type: Bug
          Components: COMP - Compiler
            Reporter: Glenn Justo Galvizo
         Attachments: non-union.plan, union-all.plan

Using the DDLs:

 
{code:java}
DROP DATAVERSE Yelp IF EXISTS;
CREATE DATAVERSE Yelp;
USE Yelp;
CREATE TYPE GenericType
AS { _id: uuid };
CREATE DATASET Businesses (GenericType)
PRIMARY KEY _id AUTOGENERATED;
CREATE DATASET Checkins (GenericType)
PRIMARY KEY _id AUTOGENERATED;
CREATE DATASET Friends (GenericType)
PRIMARY KEY _id AUTOGENERATED;
CREATE DATASET Reviews (GenericType)
PRIMARY KEY _id AUTOGENERATED;
CREATE DATASET Tips (GenericType)
PRIMARY KEY _id AUTOGENERATED;
CREATE DATASET Users (GenericType)
PRIMARY KEY _id AUTOGENERATED;
CREATE INDEX businessesPrimaryIdx
ON Businesses ( business_id : string );
CREATE INDEX checkinBusinessIDIdx
ON Checkins ( business_id : string );
CREATE INDEX friendsUserIDIdx
ON Friends ( user_id : string );
CREATE INDEX reviewsPrimaryIdx
ON Reviews ( review_id : string );
CREATE INDEX tipsUserIDIdx
ON Tips ( user_id : string );
CREATE INDEX usersPrimaryIdx
ON Users ( user_id : string );
{code}
 

The following query returns the correct count results:

 
{code:java}
WITH U1 AS (
 FROM (FROM Yelp.Reviews R SELECT VALUE R LIMIT 100000) R,
 Yelp.Users U1, Yelp.Tips T, Yelp.Checkins C
 WHERE R.user_id = U1.user_id AND
 R.business_id = C.business_id AND
 R.user_id = T.user_id AND
 LEN(U1.friends) BETWEEN 10 AND 30
 GROUP BY U1.user_id
 LET reviewCount = COUNT(R._id),
 tipCount = COUNT(T._id)
 HAVING reviewCount > 10 AND
 tipCount > 10
 SELECT VALUE U1.user_id
 LIMIT 15
 ),
 U1F AS (
 FROM U1, Yelp.Users UU1, UU1.friends U1F
 WHERE TO_STRING(U1) /* +indexnl */ = UU1.user_id
 SELECT DISTINCT VALUE U1F
 )
SELECT VALUE {
 "u1Count": ( FROM U1 SELECT VALUE COUNT(*) )[0],
 "u1FCount": ( FROM U1F SELECT VALUE COUNT(*) )[0]
}{code}
 

But if I were UNION ALL these values instead (have two documents instead of one document with two fields), I get a different count for "u1FCount".

 
{code:java}
WITH U1 AS (
 -- Better to use hash join here...
 FROM (FROM Yelp.Reviews R SELECT VALUE R LIMIT 100000) R,
 Yelp.Users U1, Yelp.Tips T, Yelp.Checkins C
 WHERE R.user_id = U1.user_id AND
 R.business_id = C.business_id AND
 R.user_id = T.user_id AND
 LEN(U1.friends) BETWEEN 10 AND 30
 GROUP BY U1.user_id
 LET reviewCount = COUNT(R._id),
 tipCount = COUNT(T._id)
 HAVING reviewCount > 10 AND
 tipCount > 10
 SELECT VALUE U1.user_id
 LIMIT 15
 ),
 U1F AS (
 FROM U1, Yelp.Users UU1, UU1.friends U1F
 WHERE TO_STRING(U1) /* +indexnl */ = UU1.user_id
 SELECT DISTINCT VALUE U1F
 )
SELECT VALUE {
 "unionCounts": ( FROM U1 SELECT COUNT(*) AS u1Count UNION ALL FROM U1F SELECT COUNT(*) AS u1FCount )
}{code}
 

The generated plans for each is attached.

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)