You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Michael McCandless (Commented) (JIRA)" <ji...@apache.org> on 2011/12/12 14:54:30 UTC
[jira] [Commented] (LUCENE-3602) Add join query to Lucene
[ https://issues.apache.org/jira/browse/LUCENE-3602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13167493#comment-13167493 ]
Michael McCandless commented on LUCENE-3602:
--------------------------------------------
Patch looks good!
* I like the test...
* Maybe rename actualQuery to fromQuery? (So it's clear that
JoinQuery runs fromQuery using fromSearcher, joining on
fromSearcher.fromField to toSearcher.toField).
* Why preComputedFromDocs...? Like if you were to cache something,
wouldn't you want cache the toSearcher's bitset instead?
* Maybe rename JoinQueryWeight.joinResult to topLevelJoinResult, to
contrast it w/ the per-segment scoring? And add a comment
explaining that we compute it once (on first segment) and all
later segments then reuse it?
* I wonder if we could make this a Filter instead, somehow? Ie, at
its core it converts a top-level bitset in the fromSearcher doc
space into the joined bitset in the toSearcher doc space. It
could even maybe just be a static method taking in fromBitset and
returning toBitset, which could operate per-segment on the
toSearcher side? (Separately: I wonder if JoinQuery should do
something with the scores of the fromQuery....? Not right now but
maybe later...).
* Why does the JoinQuery javadoc say "The downside of this
is that in a sharded environment not all documents might get
joined / linked." as a downside to the top-level approach? Maybe
reword that to state that all joined to/from docs must reside in
the same shard? In theory we could (later) make a shard friendly
approach? Eg, first pass builds up all unique Terms in the
fromSearcher.fromField for docs matching the query (across all
shards) and 2nd pass is basically a TermFilter on those...
* Not sure it matters, but... including the preComputedFromDocs in
hashCode/equals is quite costly (it goes bit by bit...). Maybe it
shouldn't be included, since it contains details about the
particular searcher that query had been run against? (In theory
Query instances are searcher independent.)
In general I think this approach is somewhat inefficient, because it
always iterates over every possible term in fromSearcher.fromField,
checking the docs for each to see if there is a match in the query.
Ie, it's like FieldCache, in that it un-inverts, but it's uninverting
on every query.
I wonder if we could DocTermOrds instead? (Or,
FieldCache.DocTermsIndex or DocValues.BYTES_*, if we know fromSearcher.fromField is
single-valued). This way we uninvert once (on init), and then doing
the join should be much faster since for each fromDocID we can lookup
the term(s) to join on.
Likewise on the toSearcher side, if we had doc <-> ord/term loaded we
could do the forward (term -> ord) lookup quickly (in memory binary
search).
But then this will obviously use RAM... so we should have the choice
(and start w/ the current patch!).
> Add join query to Lucene
> ------------------------
>
> Key: LUCENE-3602
> URL: https://issues.apache.org/jira/browse/LUCENE-3602
> Project: Lucene - Java
> Issue Type: New Feature
> Components: modules/join
> Reporter: Martijn van Groningen
> Attachments: LUCENE-3602.patch, LUCENE-3602.patch
>
>
> Solr has (psuedo) join query for a while now. I think this should also be available in Lucene.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org