You are viewing a plain text version of this content. The canonical link for it is here.
Posted to notifications@asterixdb.apache.org by "Taewoo Kim (JIRA)" <ji...@apache.org> on 2016/12/08 01:42:58 UTC

[jira] [Created] (ASTERIXDB-1748) More enhanced Data Skew Handling during a join would be nice to have.

Taewoo Kim created ASTERIXDB-1748:
-------------------------------------

             Summary: More enhanced Data Skew Handling during a join would be nice to have.
                 Key: ASTERIXDB-1748
                 URL: https://issues.apache.org/jira/browse/ASTERIXDB-1748
             Project: Apache AsterixDB
          Issue Type: Improvement
            Reporter: Taewoo Kim


Right now, if we have a data skew (especially the duplicated values), the cost of hash-join is high. For example, the following dataset has 9,000,000 records and we want do a self-equality-join on a non-primary, non-indexed key. For this key, there are 1.3M unique values. the biggest key belongs to 47,000 different records. So, when we do a self-join, the comparison needs to happen 47,000 * 47,000 (2,209,000,000) times for this key value and the cost is huge as we can expect. It would be nice to handle the join for this part in parallel, rather than conducting it on one node.

DDL:
{code}
create type AmazonReviewType as open {
	id: uuid
}

create dataset AmazonReview9Mline(AmazonReviewType) primary key id auto generated;

omit load ...
{code}

Query:
{code}
count(
for $o in dataset AmazonReview9Mline
for $i in dataset AmazonReview9Mline
where $o.reviewerID = $i.reviewerID and $o.id < $i.id
return {"oid":$o.reviewerID, "iid":$i.reviewerID}
);
{code}

Sample record:
{code}
{
  "reviewerID": "A2SUAM1J3GNN3B",
  "asin": "0000013714",
  "reviewerName": "J. McDonald",
  "helpful": [2, 3],
  "reviewText": "I bought this for my husband who plays the piano.  He is having a wonderful time playing these old hymns.  The music  is at times hard to read because we think the book was published for singing from more than playing from.  Great purchase though!",
  "overall": 5.0,
  "summary": "Heavenly Highway Hymns",
  "unixReviewTime": 1252800000,
  "reviewTime": "09 13, 2009"
}
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)