You are viewing a plain text version of this content. The canonical link for it is here.
Posted to notifications@asterixdb.apache.org by "Taewoo Kim (JIRA)" <ji...@apache.org> on 2016/10/21 23:27:58 UTC

[jira] [Created] (ASTERIXDB-1704) Fuzzy-join query is slow

Taewoo Kim created ASTERIXDB-1704:
-------------------------------------

             Summary: Fuzzy-join query is slow
                 Key: ASTERIXDB-1704
                 URL: https://issues.apache.org/jira/browse/ASTERIXDB-1704
             Project: Apache AsterixDB
          Issue Type: Bug
            Reporter: Taewoo Kim


I have an issue regarding the prefix-based fuzzy join (non-index based fuzzy join) on a small dataset. The following query runs forever even for a dataset with 200K records on 9 nodes. So, each node only has 20,000 records. Also, the record size is not that big. 

{code}
count(
for $o in dataset AmazonReview
for $i in dataset AmazonReview
where similarity-jaccard(word-tokens($o.reviewText), word-tokens($i.reviewText)) >= 0.2 and $o.id < $i.id
return {"oid":$o.reviewrID, "iid":$i.reviewID}
);
{code}

An example record is as follows.  

{code}
{
  "reviewerID": "A2SUAM1J3GNN3B",
  "asin": "0000013714",
  "reviewerName": "J. McDonald",
  "helpful": [2, 3],
  "reviewText": "I bought this for my husband who plays the piano.  He is having a wonderful time playing these old hymns.  The music  is at times hard to read because we think the book was published for singing from more than playing from.  Great purchase though!",
  "overall": 5.0,
  "summary": "Heavenly Highway Hymns",
  "unixReviewTime": 1252800000,
  "reviewTime": "09 13, 2009"
}
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)