You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Michael McCandless (Commented) (JIRA)" <ji...@apache.org> on 2011/12/12 14:54:30 UTC
[jira] [Commented] (LUCENE-3602) Add join query to Lucene

    [ https://issues.apache.org/jira/browse/LUCENE-3602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13167493#comment-13167493 ] 

Michael McCandless commented on LUCENE-3602:
--------------------------------------------

Patch looks good!

  * I like the test...

  * Maybe rename actualQuery to fromQuery?  (So it's clear that
    JoinQuery runs fromQuery using fromSearcher, joining on
    fromSearcher.fromField to toSearcher.toField).

  * Why preComputedFromDocs...?  Like if you were to cache something,
    wouldn't you want cache the toSearcher's bitset instead?

  * Maybe rename JoinQueryWeight.joinResult to topLevelJoinResult, to
    contrast it w/ the per-segment scoring?  And add a comment
    explaining that we compute it once (on first segment) and all
    later segments then reuse it?

  * I wonder if we could make this a Filter instead, somehow?  Ie, at
    its core it converts a top-level bitset in the fromSearcher doc
    space into the joined bitset in the toSearcher doc space.  It
    could even maybe just be a static method taking in fromBitset and
    returning toBitset, which could operate per-segment on the
    toSearcher side?  (Separately: I wonder if JoinQuery should do
    something with the scores of the fromQuery....?  Not right now but
    maybe later...).

  * Why does the JoinQuery javadoc say "The downside of this
    is that in a sharded environment not all documents might get
    joined / linked." as a downside to the top-level approach?  Maybe
    reword that to state that all joined to/from docs must reside in
    the same shard?  In theory we could (later) make a shard friendly
    approach?  Eg, first pass builds up all unique Terms in the
    fromSearcher.fromField for docs matching the query (across all
    shards) and 2nd pass is basically a TermFilter on those...

  * Not sure it matters, but... including the preComputedFromDocs in
    hashCode/equals is quite costly (it goes bit by bit...).  Maybe it
    shouldn't be included, since it contains details about the
    particular searcher that query had been run against?  (In theory
    Query instances are searcher independent.)

In general I think this approach is somewhat inefficient, because it
always iterates over every possible term in fromSearcher.fromField,
checking the docs for each to see if there is a match in the query.
Ie, it's like FieldCache, in that it un-inverts, but it's uninverting
on every query.

I wonder if we could DocTermOrds instead?  (Or,
FieldCache.DocTermsIndex or DocValues.BYTES_*, if we know fromSearcher.fromField is
single-valued).  This way we uninvert once (on init), and then doing
the join should be much faster since for each fromDocID we can lookup
the term(s) to join on.

Likewise on the toSearcher side, if we had doc <-> ord/term loaded we
could do the forward (term -> ord) lookup quickly (in memory binary
search).

But then this will obviously use RAM... so we should have the choice
(and start w/ the current patch!).

                
> Add join query to Lucene
> ------------------------
>
>                 Key: LUCENE-3602
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3602
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: modules/join
>            Reporter: Martijn van Groningen
>         Attachments: LUCENE-3602.patch, LUCENE-3602.patch
>
>
> Solr has (psuedo) join query for a while now. I think this should also be available in Lucene.  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org