You are viewing a plain text version of this content. The canonical link for it is here.
Posted to notifications@asterixdb.apache.org by "Taewoo Kim (JIRA)" <ji...@apache.org> on 2016/07/28 21:01:20 UTC
[jira] [Commented] (ASTERIXDB-1556) Prefix-based multi-way
Fuzzy-join generates an exception.
[ https://issues.apache.org/jira/browse/ASTERIXDB-1556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15398198#comment-15398198 ]
Taewoo Kim commented on ASTERIXDB-1556:
---------------------------------------
I have gone through the logic (3 stages) in AQL level. I think it's already heavy process even for a two dataset (we need to scan the same dataset multiple times). So, for three or more datasets, it would be really expensive operations considering the fact that each input (left or output) needs to be scanned multiple times in each stage. So, my naive feeling is that two-way fuzzy join is doable. But, more than two-way fuzzy join using the current plan optimization (FuzzyJoinRule) will be really hard.
I would like to check one thing. If there are three datasets (A B C) and we would like to do a three-way fuzzy join, can we do a two-dataset fuzzy join (A and B) first,then materialize this result, and finally do a two-dataset fuzzy join using the materialized output of (A and B) and C? Or does this happen now? Let's setup a time to finalize our thought.
For this simple AQL,
for $dblp in dataset('DBLP')
for $dblp2 in dataset('DBLP')
for $dblp3 in dataset('DBLP')
where $dblp.title ~= $dblp2.title and $dblp.authors ~= $dblp3.authors and $dblp.id < $dblp2.id and $dblp.id < $dblp3.id
order by $dblp.id, $dblp2.id, $dblp3.id
return {'dblp': $dblp, 'dblp2': $dblp2, 'dblp3': $dblp3}
the following logical plan (even not optimized) looks too heavy for a NC to handle (ABOUT 200 logical operators).
join (function-call: algebricks:eq, Args:[%0->$$24, %0->$$185]) -- |UNPARTITIONED|
select (function-call: algebricks:and, Args:[function-call: algebricks:lt, Args:[%0->$$24, %0->$$25], TRUE]) -- |UNPARTITIONED|
join (function-call: algebricks:eq, Args:[%0->$$24, %0->$$72]) -- |UNPARTITIONED|
assign [$$31] <- [function-call: asterix:field-access-by-index, Args:[%0->$$0, AInt32: {2}]] -- |UNPARTITIONED|
assign [$$27] <- [function-call: asterix:field-access-by-index, Args:[%0->$$0, AInt32: {3}]] -- |UNPARTITIONED|
data-scan []<-[$$24, $$0] <- fuzzyjoin:DBLP -- |UNPARTITIONED|
empty-tuple-source -- |UNPARTITIONED|
join (function-call: algebricks:eq, Args:[%0->$$25, %0->$$73]) -- |UNPARTITIONED|
assign [$$32] <- [function-call: asterix:field-access-by-index, Args:[%0->$$1, AInt32: {2}]] -- |UNPARTITIONED|
data-scan []<-[$$25, $$1] <- fuzzyjoin:DBLP -- |UNPARTITIONED|
empty-tuple-source -- |UNPARTITIONED|
group by ([$$72 := %0->$$39; $$73 := %0->$$42]) decor ([]) {
aggregate [$$100] <- [function-call: asterix:listify, Args:[%0->$$74]] -- |UNPARTITIONED|
nested tuple source -- |UNPARTITIONED|
} -- |UNPARTITIONED|
select (function-call: algebricks:ge, Args:[%0->$$74, AFloat: {0.5}]) -- |UNPARTITIONED|
assign [$$74] <- [function-call: asterix:similarity-jaccard-prefix, Args:[%0->$$50, %0->$$58, %0->$$61, %0->$$69, %0->$$59, AFloat: {0.5}]] -- |UNPARTITIONED|
join (function-call: algebricks:eq, Args:[%0->$$59, %0->$$70]) -- |UNPARTITIONED|
unnest $$59 <- function-call: asterix:subset-collection, Args:[%0->$$58, AInt32: {0}, function-call: asterix:prefix-len-jaccard, Args:[function-call: asterix:len, Args:[%0->$$58], AFloat: {0.5}]] -- |UNPARTITIONED|
assign [$$58] <- [%0->$$82] -- |UNPARTITIONED|
subplan {
aggregate [$$82] <- [function-call: asterix:listify, Args:[%0->$$53]] -- |UNPARTITIONED|
order (ASC, %0->$$53) -- |UNPARTITIONED|
select (function-call: algebricks:eq, Args:[%0->$$51, %0->$$52]) -- |UNPARTITIONED|
unnest $$52 at $$53 <- function-call: asterix:scan-collection, Args:[%0->$$80] -- |UNPARTITIONED|
subplan {
aggregate [$$80] <- [function-call: asterix:listify, Args:[%0->$$56]] -- |UNPARTITIONED|
order (ASC, function-call: asterix:count, Args:[%0->$$78]) (ASC, %0->$$56) -- |UNPARTITIONED|
group by ([$$56 := %0->$$55]) decor ([]) {
aggregate [$$78] <- [function-call: asterix:listify, Args:[%0->$$57]] -- |UNPARTITIONED|
nested tuple source -- |UNPARTITIONED|
} -- |UNPARTITIONED|
unnest $$55 <- function-call: asterix:scan-collection, Args:[function-call: asterix:counthashed-word-tokens, Args:[%0->$$43]] -- |UNPARTITIONED|
assign [$$57] <- [%0->$$45] -- |UNPARTITIONED|
assign [$$43] <- [function-call: asterix:field-access-by-index, Args:[%0->$$44, AInt32: {2}]] -- |UNPARTITIONED|
data-scan []<-[$$45, $$44] <- fuzzyjoin:DBLP -- |UNPARTITIONED|
empty-tuple-source -- |UNPARTITIONED|
} -- |UNPARTITIONED|
unnest $$51 <- function-call: asterix:scan-collection, Args:[%0->$$49] -- |UNPARTITIONED|
nested tuple source -- |UNPARTITIONED|
} -- |UNPARTITIONED|
assign [$$50] <- [function-call: asterix:len, Args:[%0->$$49]] -- |UNPARTITIONED|
assign [$$49] <- [function-call: asterix:counthashed-word-tokens, Args:[%0->$$36]] -- |UNPARTITIONED|
assign [$$36] <- [function-call: asterix:field-access-by-index, Args:[%0->$$37, AInt32: {2}]] -- |UNPARTITIONED|
assign [$$38] <- [function-call: asterix:field-access-by-index, Args:[%0->$$37, AInt32: {3}]] -- |UNPARTITIONED|
data-scan []<-[$$39, $$37] <- fuzzyjoin:DBLP -- |UNPARTITIONED|
empty-tuple-source -- |UNPARTITIONED|
unnest $$70 <- function-call: asterix:subset-collection, Args:[%0->$$69, AInt32: {0}, function-call: asterix:prefix-len-jaccard, Args:[function-call: asterix:len, Args:[%0->$$69], AFloat: {0.5}]] -- |UNPARTITIONED|
assign [$$69] <- [%0->$$93] -- |UNPARTITIONED|
subplan {
aggregate [$$93] <- [function-call: asterix:listify, Args:[%0->$$64]] -- |UNPARTITIONED|
order (ASC, %0->$$64) -- |UNPARTITIONED|
select (function-call: algebricks:eq, Args:[%0->$$62, %0->$$63]) -- |UNPARTITIONED|
unnest $$63 at $$64 <- function-call: asterix:scan-collection, Args:[%0->$$91] -- |UNPARTITIONED|
subplan {
aggregate [$$91] <- [function-call: asterix:listify, Args:[%0->$$67]] -- |UNPARTITIONED|
order (ASC, function-call: asterix:count, Args:[%0->$$89]) (ASC, %0->$$67) -- |UNPARTITIONED|
group by ([$$67 := %0->$$66]) decor ([]) {
aggregate [$$89] <- [function-call: asterix:listify, Args:[%0->$$68]] -- |UNPARTITIONED|
nested tuple source -- |UNPARTITIONED|
} -- |UNPARTITIONED|
unnest $$66 <- function-call: asterix:scan-collection, Args:[function-call: asterix:counthashed-word-tokens, Args:[%0->$$46]] -- |UNPARTITIONED|
assign [$$68] <- [%0->$$48] -- |UNPARTITIONED|
assign [$$46] <- [function-call: asterix:field-access-by-index, Args:[%0->$$47, AInt32: {2}]] -- |UNPARTITIONED|
data-scan []<-[$$48, $$47] <- fuzzyjoin:DBLP -- |UNPARTITIONED|
empty-tuple-source -- |UNPARTITIONED|
} -- |UNPARTITIONED|
unnest $$62 <- function-call: asterix:scan-collection, Args:[%0->$$60] -- |UNPARTITIONED|
nested tuple source -- |UNPARTITIONED|
} -- |UNPARTITIONED|
assign [$$61] <- [function-call: asterix:len, Args:[%0->$$60]] -- |UNPARTITIONED|
assign [$$60] <- [function-call: asterix:counthashed-word-tokens, Args:[%0->$$40]] -- |UNPARTITIONED|
assign [$$40] <- [function-call: asterix:field-access-by-index, Args:[%0->$$41, AInt32: {2}]] -- |UNPARTITIONED|
data-scan []<-[$$42, $$41] <- fuzzyjoin:DBLP -- |UNPARTITIONED|
empty-tuple-source -- |UNPARTITIONED|
join (function-call: algebricks:eq, Args:[%0->$$26, %0->$$186]) -- |UNPARTITIONED|
assign [$$28] <- [function-call: asterix:field-access-by-index, Args:[%0->$$2, AInt32: {3}]] -- |UNPARTITIONED|
data-scan []<-[$$26, $$2] <- fuzzyjoin:DBLP -- |UNPARTITIONED|
empty-tuple-source -- |UNPARTITIONED|
group by ([$$185 := %0->$$103; $$186 := %0->$$155]) decor ([]) {
aggregate [$$213] <- [function-call: asterix:listify, Args:[%0->$$187]] -- |UNPARTITIONED|
nested tuple source -- |UNPARTITIONED|
} -- |UNPARTITIONED|
select (function-call: algebricks:ge, Args:[%0->$$187, AFloat: {0.5}]) -- |UNPARTITIONED|
assign [$$187] <- [function-call: asterix:similarity-jaccard-prefix, Args:[%0->$$163, %0->$$171, %0->$$174, %0->$$182, %0->$$172, AFloat: {0.5}]] -- |UNPARTITIONED|
join (function-call: algebricks:eq, Args:[%0->$$172, %0->$$183]) -- |UNPARTITIONED|
unnest $$172 <- function-call: asterix:subset-collection, Args:[%0->$$171, AInt32: {0}, function-call: asterix:prefix-len-jaccard, Args:[function-call: asterix:len, Args:[%0->$$171], AFloat: {0.5}]] -- |UNPARTITIONED|
assign [$$171] <- [%0->$$195] -- |UNPARTITIONED|
subplan {
aggregate [$$195] <- [function-call: asterix:listify, Args:[%0->$$166]] -- |UNPARTITIONED|
order (ASC, %0->$$166) -- |UNPARTITIONED|
select (function-call: algebricks:eq, Args:[%0->$$164, %0->$$165]) -- |UNPARTITIONED|
unnest $$165 at $$166 <- function-call: asterix:scan-collection, Args:[%0->$$193] -- |UNPARTITIONED|
subplan {
aggregate [$$193] <- [function-call: asterix:listify, Args:[%0->$$169]] -- |UNPARTITIONED|
order (ASC, function-call: asterix:count, Args:[%0->$$191]) (ASC, %0->$$169) -- |UNPARTITIONED|
group by ([$$169 := %0->$$168]) decor ([]) {
aggregate [$$191] <- [function-call: asterix:listify, Args:[%0->$$170]] -- |UNPARTITIONED|
nested tuple source -- |UNPARTITIONED|
} -- |UNPARTITIONED|
unnest $$168 <- function-call: asterix:scan-collection, Args:[function-call: asterix:counthashed-word-tokens, Args:[%0->$$156]] -- |UNPARTITIONED|
assign [$$170] <- [%0->$$158] -- |UNPARTITIONED|
assign [$$156] <- [function-call: asterix:field-access-by-index, Args:[%0->$$157, AInt32: {3}]] -- |UNPARTITIONED|
data-scan []<-[$$158, $$157] <- fuzzyjoin:DBLP -- |UNPARTITIONED|
empty-tuple-source -- |UNPARTITIONED|
} -- |UNPARTITIONED|
unnest $$164 <- function-call: asterix:scan-collection, Args:[%0->$$162] -- |UNPARTITIONED|
nested tuple source -- |UNPARTITIONED|
} -- |UNPARTITIONED|
assign [$$163] <- [function-call: asterix:len, Args:[%0->$$162]] -- |UNPARTITIONED|
assign [$$162] <- [function-call: asterix:counthashed-word-tokens, Args:[%0->$$108]] -- |UNPARTITIONED|
select (function-call: algebricks:and, Args:[function-call: algebricks:lt, Args:[%0->$$103, %0->$$104], TRUE]) -- |UNPARTITIONED|
join (function-call: algebricks:eq, Args:[%0->$$103, %0->$$105]) -- |UNPARTITIONED|
assign [$$106] <- [function-call: asterix:field-access-by-index, Args:[%0->$$107, AInt32: {2}]] -- |UNPARTITIONED|
assign [$$108] <- [function-call: asterix:field-access-by-index, Args:[%0->$$107, AInt32: {3}]] -- |UNPARTITIONED|
data-scan []<-[$$103, $$107] <- fuzzyjoin:DBLP -- |UNPARTITIONED|
empty-tuple-source -- |UNPARTITIONED|
join (function-call: algebricks:eq, Args:[%0->$$104, %0->$$109]) -- |UNPARTITIONED|
assign [$$110] <- [function-call: asterix:field-access-by-index, Args:[%0->$$111, AInt32: {2}]] -- |UNPARTITIONED|
data-scan []<-[$$104, $$111] <- fuzzyjoin:DBLP -- |UNPARTITIONED|
empty-tuple-source -- |UNPARTITIONED|
group by ([$$105 := %0->$$112; $$109 := %0->$$113]) decor ([]) {
aggregate [$$152] <- [function-call: asterix:listify, Args:[%0->$$114]] -- |UNPARTITIONED|
nested tuple source -- |UNPARTITIONED|
} -- |UNPARTITIONED|
select (function-call: algebricks:ge, Args:[%0->$$114, AFloat: {0.5}]) -- |UNPARTITIONED|
assign [$$114] <- [function-call: asterix:similarity-jaccard-prefix, Args:[%0->$$115, %0->$$116, %0->$$117, %0->$$118, %0->$$119, AFloat: {0.5}]] -- |UNPARTITIONED|
join (function-call: algebricks:eq, Args:[%0->$$119, %0->$$120]) -- |UNPARTITIONED|
unnest $$119 <- function-call: asterix:subset-collection, Args:[%0->$$116, AInt32: {0}, function-call: asterix:prefix-len-jaccard, Args:[function-call: asterix:len, Args:[%0->$$116], AFloat: {0.5}]] -- |UNPARTITIONED|
assign [$$116] <- [%0->$$121] -- |UNPARTITIONED|
subplan {
aggregate [$$121] <- [function-call: asterix:listify, Args:[%0->$$126]] -- |UNPARTITIONED|
order (ASC, %0->$$126) -- |UNPARTITIONED|
select (function-call: algebricks:eq, Args:[%0->$$127, %0->$$128]) -- |UNPARTITIONED|
unnest $$128 at $$126 <- function-call: asterix:scan-collection, Args:[%0->$$129] -- |UNPARTITIONED|
subplan {
aggregate [$$129] <- [function-call: asterix:listify, Args:[%0->$$130]] -- |UNPARTITIONED|
order (ASC, function-call: asterix:count, Args:[%0->$$131]) (ASC, %0->$$130) -- |UNPARTITIONED|
group by ([$$130 := %0->$$132]) decor ([]) {
aggregate [$$131] <- [function-call: asterix:listify, Args:[%0->$$134]] -- |UNPARTITIONED|
nested tuple source -- |UNPARTITIONED|
} -- |UNPARTITIONED|
unnest $$132 <- function-call: asterix:scan-collection, Args:[function-call: asterix:counthashed-word-tokens, Args:[%0->$$133]] -- |UNPARTITIONED|
assign [$$134] <- [%0->$$135] -- |UNPARTITIONED|
assign [$$133] <- [function-call: asterix:field-access-by-index, Args:[%0->$$136, AInt32: {2}]] -- |UNPARTITIONED|
data-scan []<-[$$135, $$136] <- fuzzyjoin:DBLP -- |UNPARTITIONED|
empty-tuple-source -- |UNPARTITIONED|
} -- |UNPARTITIONED|
unnest $$127 <- function-call: asterix:scan-collection, Args:[%0->$$122] -- |UNPARTITIONED|
nested tuple source -- |UNPARTITIONED|
} -- |UNPARTITIONED|
assign [$$115] <- [function-call: asterix:len, Args:[%0->$$122]] -- |UNPARTITIONED|
assign [$$122] <- [function-call: asterix:counthashed-word-tokens, Args:[%0->$$123]] -- |UNPARTITIONED|
assign [$$123] <- [function-call: asterix:field-access-by-index, Args:[%0->$$124, AInt32: {2}]] -- |UNPARTITIONED|
assign [$$125] <- [function-call: asterix:field-access-by-index, Args:[%0->$$124, AInt32: {3}]] -- |UNPARTITIONED|
data-scan []<-[$$112, $$124] <- fuzzyjoin:DBLP -- |UNPARTITIONED|
empty-tuple-source -- |UNPARTITIONED|
unnest $$120 <- function-call: asterix:subset-collection, Args:[%0->$$118, AInt32: {0}, function-call: asterix:prefix-len-jaccard, Args:[function-call: asterix:len, Args:[%0->$$118], AFloat: {0.5}]] -- |UNPARTITIONED|
assign [$$118] <- [%0->$$137] -- |UNPARTITIONED|
subplan {
aggregate [$$137] <- [function-call: asterix:listify, Args:[%0->$$141]] -- |UNPARTITIONED|
order (ASC, %0->$$141) -- |UNPARTITIONED|
select (function-call: algebricks:eq, Args:[%0->$$142, %0->$$143]) -- |UNPARTITIONED|
unnest $$143 at $$141 <- function-call: asterix:scan-collection, Args:[%0->$$144] -- |UNPARTITIONED|
subplan {
aggregate [$$144] <- [function-call: asterix:listify, Args:[%0->$$145]] -- |UNPARTITIONED|
order (ASC, function-call: asterix:count, Args:[%0->$$146]) (ASC, %0->$$145) -- |UNPARTITIONED|
group by ([$$145 := %0->$$147]) decor ([]) {
aggregate [$$146] <- [function-call: asterix:listify, Args:[%0->$$149]] -- |UNPARTITIONED|
nested tuple source -- |UNPARTITIONED|
} -- |UNPARTITIONED|
unnest $$147 <- function-call: asterix:scan-collection, Args:[function-call: asterix:counthashed-word-tokens, Args:[%0->$$148]] -- |UNPARTITIONED|
assign [$$149] <- [%0->$$150] -- |UNPARTITIONED|
assign [$$148] <- [function-call: asterix:field-access-by-index, Args:[%0->$$151, AInt32: {2}]] -- |UNPARTITIONED|
data-scan []<-[$$150, $$151] <- fuzzyjoin:DBLP -- |UNPARTITIONED|
empty-tuple-source -- |UNPARTITIONED|
} -- |UNPARTITIONED|
unnest $$142 <- function-call: asterix:scan-collection, Args:[%0->$$138] -- |UNPARTITIONED|
nested tuple source -- |UNPARTITIONED|
} -- |UNPARTITIONED|
assign [$$117] <- [function-call: asterix:len, Args:[%0->$$138]] -- |UNPARTITIONED|
assign [$$138] <- [function-call: asterix:counthashed-word-tokens, Args:[%0->$$139]] -- |UNPARTITIONED|
assign [$$139] <- [function-call: asterix:field-access-by-index, Args:[%0->$$140, AInt32: {2}]] -- |UNPARTITIONED|
data-scan []<-[$$113, $$140] <- fuzzyjoin:DBLP -- |UNPARTITIONED|
empty-tuple-source -- |UNPARTITIONED|
unnest $$183 <- function-call: asterix:subset-collection, Args:[%0->$$182, AInt32: {0}, function-call: asterix:prefix-len-jaccard, Args:[function-call: asterix:len, Args:[%0->$$182], AFloat: {0.5}]] -- |UNPARTITIONED|
assign [$$182] <- [%0->$$206] -- |UNPARTITIONED|
subplan {
aggregate [$$206] <- [function-call: asterix:listify, Args:[%0->$$177]] -- |UNPARTITIONED|
order (ASC, %0->$$177) -- |UNPARTITIONED|
select (function-call: algebricks:eq, Args:[%0->$$175, %0->$$176]) -- |UNPARTITIONED|
unnest $$176 at $$177 <- function-call: asterix:scan-collection, Args:[%0->$$204] -- |UNPARTITIONED|
subplan {
aggregate [$$204] <- [function-call: asterix:listify, Args:[%0->$$180]] -- |UNPARTITIONED|
order (ASC, function-call: asterix:count, Args:[%0->$$202]) (ASC, %0->$$180) -- |UNPARTITIONED|
group by ([$$180 := %0->$$179]) decor ([]) {
aggregate [$$202] <- [function-call: asterix:listify, Args:[%0->$$181]] -- |UNPARTITIONED|
nested tuple source -- |UNPARTITIONED|
} -- |UNPARTITIONED|
unnest $$179 <- function-call: asterix:scan-collection, Args:[function-call: asterix:counthashed-word-tokens, Args:[%0->$$159]] -- |UNPARTITIONED|
assign [$$181] <- [%0->$$161] -- |UNPARTITIONED|
assign [$$159] <- [function-call: asterix:field-access-by-index, Args:[%0->$$160, AInt32: {3}]] -- |UNPARTITIONED|
data-scan []<-[$$161, $$160] <- fuzzyjoin:DBLP -- |UNPARTITIONED|
empty-tuple-source -- |UNPARTITIONED|
} -- |UNPARTITIONED|
unnest $$175 <- function-call: asterix:scan-collection, Args:[%0->$$173] -- |UNPARTITIONED|
nested tuple source -- |UNPARTITIONED|
} -- |UNPARTITIONED|
assign [$$174] <- [function-call: asterix:len, Args:[%0->$$173]] -- |UNPARTITIONED|
assign [$$173] <- [function-call: asterix:counthashed-word-tokens, Args:[%0->$$153]] -- |UNPARTITIONED|
assign [$$153] <- [function-call: asterix:field-access-by-index, Args:[%0->$$154, AInt32: {3}]] -- |UNPARTITIONED|
data-scan []<-[$$155, $$154] <- fuzzyjoin:DBLP -- |UNPARTITIONED|
empty-tuple-source -- |UNPARTITIONED|
> Prefix-based multi-way Fuzzy-join generates an exception.
> ---------------------------------------------------------
>
> Key: ASTERIXDB-1556
> URL: https://issues.apache.org/jira/browse/ASTERIXDB-1556
> Project: Apache AsterixDB
> Issue Type: Bug
> Reporter: Taewoo Kim
>
> When we enable prefix-based fuzzy-join and apply the multi-way fuzzy-join ( > 2), the system generates an out-of-memory exception.
> Since a fuzzy-join is created using 30-40 lines of AQL codes and this AQL is translated into massive number of operators (more than 200 operators in the plan for a 3-way fuzzy join), it could generate out-of-memory exception.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)