You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@lucene.apache.org by "Martijn van Groningen (Created) (JIRA)" <ji...@apache.org> on 2011/11/27 21:31:40 UTC

[jira] [Created] (LUCENE-3602) Add join query to Lucene

Add join query to Lucene
------------------------

                 Key: LUCENE-3602
                 URL: https://issues.apache.org/jira/browse/LUCENE-3602
             Project: Lucene - Java
          Issue Type: New Feature
          Components: modules/join
            Reporter: Martijn van Groningen


Solr has (psuedo) join query for a while now. I think this should also be available in Lucene.  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Updated] (LUCENE-3602) Add join query to Lucene

Posted by "Martijn van Groningen (Updated) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-3602?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Martijn van Groningen updated LUCENE-3602:
------------------------------------------

    Fix Version/s: 4.0
                   3.6
    
> Add join query to Lucene
> ------------------------
>
>                 Key: LUCENE-3602
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3602
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: modules/join
>            Reporter: Martijn van Groningen
>             Fix For: 3.6, 4.0
>
>         Attachments: LUCENE-3602.patch, LUCENE-3602.patch, LUCENE-3602.patch, LUCENE-3602.patch
>
>
> Solr has (psuedo) join query for a while now. I think this should also be available in Lucene.  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Issue Comment Edited] (LUCENE-3602) Add join query to Lucene

Posted by "Martijn van Groningen (Issue Comment Edited) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-3602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13159124#comment-13159124 ] 

Martijn van Groningen edited comment on LUCENE-3602 at 11/29/11 6:56 AM:
-------------------------------------------------------------------------

I'll remove the builder api. This was just my sugar api. I'll change that in a constructor where the toSearcher and preComputedFromDocs are optional.

bq. How does the preComputedFromDocs work? It's not per-segment? Like it's a bitset across entire toplevel doc space?
Yes, it is a bitset across all segments, so toplevel. The whole query implementation is top level. People can use this the execute the real query outside the JoinQuery (if this isn't specified the query is executed when the Weight is created for the JoinQuery). I think this would be nice if people want the for example cache the encapsulated query.

bq. I wonder if we can make this work per-segment... but that can be a 2nd phase.
I thought about that. But this would mean we need to execute the join query in two phases. The first phase would collect all the "from" values from the documents matching the encapsulated query. The second phase would match the documents that have matching on the "to" side with the "from" values collected in the first phase. The will work per segment and I think would also make the JoinQuery work in distributed environment. We can do this in a second development phase.

bq. I think you can use seekExact instead of seekCeil? Better performance...
I'll change that.

bq. What is the AdjustedDISI for (and when would Weight.scorer get a top-level context...)? 
Actually that is only used to map top level docids to segment docids (Basically doing: toplevel_docid - base). This is necessary because Weight.scorer works per segment and the query implementation not.
                
      was (Author: martijn.v.groningen):
    I'll remove the builder api. This was just my sugar api. I'll change that in a constructor where the toSearcher and preComputedFromDocs are optional.

bq. How does the preComputedFromDocs work? It's not per-segment? Like it's a bitset across entire toplevel doc space?
Yes, it is a bitset across all segments, so toplevel. The whole query implementation is top level. People can use this the execute the real query outside the JoinQuery (if this isn't specified the query is executed when the Weight is created for the JoinQuery). I think this would be nice if people want the for example cache the encapsulated query.

I wonder if we can make this work per-segment... but that can be a 2nd phase.
I thought about that. But this would mean we need to execute the join query in two phases. The first phase would collect all the "from" values from the documents matching the encapsulated query. The second phase would match the documents that have matching on the "to" side with the "from" values collected in the first phase. The will work per segment and I think would also make the JoinQuery work in distributed environment. We can do this in a second development phase.

I think you can use seekExact instead of seekCeil? Better performance...
I'll change that.

What is the AdjustedDISI for (and when would Weight.scorer get a top-level context...)? Actually that is only used to map top level docids to segment docids (Basically doing: toplevel_docid - base). This is necessary because Weight.scorer works per segment and the query implementation not.
                  
> Add join query to Lucene
> ------------------------
>
>                 Key: LUCENE-3602
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3602
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: modules/join
>            Reporter: Martijn van Groningen
>         Attachments: LUCENE-3602.patch
>
>
> Solr has (psuedo) join query for a while now. I think this should also be available in Lucene.  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Commented] (LUCENE-3602) Add join query to Lucene

Posted by "Martijn van Groningen (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-3602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13202806#comment-13202806 ] 

Martijn van Groningen commented on LUCENE-3602:
-----------------------------------------------

Committed latest 3x patch to branch3x.
                
> Add join query to Lucene
> ------------------------
>
>                 Key: LUCENE-3602
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3602
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: modules/join
>            Reporter: Martijn van Groningen
>             Fix For: 3.6, 4.0
>
>         Attachments: LUCENE-3602-3x.patch, LUCENE-3602-3x.patch, LUCENE-3602.patch, LUCENE-3602.patch, LUCENE-3602.patch, LUCENE-3602.patch, LUCENE-3602.patch, LUCENE-3602.patch, LUCENE-3602.patch, LUCENE-3602.patch, LUCENE-3602.patch, LUCENE-3602.patch
>
>
> Solr has (psuedo) join query for a while now. I think this should also be available in Lucene.  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Updated] (LUCENE-3602) Add join query to Lucene

Posted by "Martijn van Groningen (Updated) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-3602?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Martijn van Groningen updated LUCENE-3602:
------------------------------------------

    Attachment: LUCENE-3602.patch

Attached a new patch!
* The static method now has Query as return type.
* TermsCollector and TermsQuery are package protected.
* TermsQuery now holds an int[] ords instead of an array of ByteRefs. This is a really nice improvement :)
* Swapped the const args of TermQuery. Consistency is important!
* After [some discussion|http://colabti.org/irclogger/irclogger_log/lucene-dev?date=2012-01-16#l231] moved the classes back to o.a.l.search.join package.  
* The query time joining uses the dict comparator on all places now (instead of ByteRef#UTF8AsUnicode).

In the that cmp==0 in TermsQuery at line 121; YES_AND_SEEK can't be returned. The FilteredTermsEnum prohibits seeking to a term smaller or equal then the current term. 
                
> Add join query to Lucene
> ------------------------
>
>                 Key: LUCENE-3602
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3602
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: modules/join
>            Reporter: Martijn van Groningen
>             Fix For: 3.6, 4.0
>
>         Attachments: LUCENE-3602.patch, LUCENE-3602.patch, LUCENE-3602.patch, LUCENE-3602.patch, LUCENE-3602.patch, LUCENE-3602.patch, LUCENE-3602.patch
>
>
> Solr has (psuedo) join query for a while now. I think this should also be available in Lucene.  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Commented] (LUCENE-3602) Add join query to Lucene

Posted by "Michael McCandless (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-3602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13184191#comment-13184191 ] 

Michael McCandless commented on LUCENE-3602:
--------------------------------------------

Looking great!  I think we are getting close :)

Maybe use BytesRefHash (oal.util) to gather the terms and sort
them in the end?  Then hold onto the BytesRef[] (sorted).  It's
designed for exactly this usage...

Also, I think you should always seek in your FilteredTermsEnum, and
simply iterate yourself through the BytesRef[]?  Probably perf wasn't
great before because you were calling TreeSet.higher on each term?

I think you should call setRewriteMethod in your TermSetQuery ctor, to
CONSTANT_SCORE_AUTO_REWRITE_DEFAULT, since we have no scores here.
This rewrite method uses BQ if number of terms/docs is "smallish" else
creates a filter, so it should give good perf.

                
> Add join query to Lucene
> ------------------------
>
>                 Key: LUCENE-3602
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3602
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: modules/join
>            Reporter: Martijn van Groningen
>             Fix For: 3.6, 4.0
>
>         Attachments: LUCENE-3602.patch, LUCENE-3602.patch, LUCENE-3602.patch, LUCENE-3602.patch
>
>
> Solr has (psuedo) join query for a while now. I think this should also be available in Lucene.  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Commented] (LUCENE-3602) Add join query to Lucene

Posted by "Jason Rutherglen (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-3602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13167959#comment-13167959 ] 

Jason Rutherglen commented on LUCENE-3602:
------------------------------------------

Maybe we can (in another issue) move bit set filter caching into SearchManager, for use by Lucene Join (here), and others.  At the same time making bitset filtering per-segment, a fundamental improvement from the existing (old) Solr code.
                
> Add join query to Lucene
> ------------------------
>
>                 Key: LUCENE-3602
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3602
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: modules/join
>            Reporter: Martijn van Groningen
>         Attachments: LUCENE-3602.patch, LUCENE-3602.patch
>
>
> Solr has (psuedo) join query for a while now. I think this should also be available in Lucene.  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Commented] (LUCENE-3602) Add join query to Lucene

Posted by "Robert Muir (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-3602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13187457#comment-13187457 ] 

Robert Muir commented on LUCENE-3602:
-------------------------------------

Just a few ideas about 3.x backport:
* MultiTermQuery doesn't have a real 'seek' there... you have to open up and swap in a new 'actualEnum' (does this clone the indexinput etc?) starting at the new seek position... I think its tricky and ugly. Maybe we can/should seek to 3.x TermEnum/FilteredTermEnum so this will work nicer...? I havent looked at how hairy this would be though.
* The rewrites are not always per-segment like they are in trunk, and it could be bad to "seek" a lot when MTQ is in 'auto mode' because if I recall its expensive (creating lots of multitermsenums). Maybe since Join knows up-front how many terms are in the hash it should determine which method to use itself (setRewriteMethod)... maybe on 3.x just set filter rewrite since you know will be "seeking"? Since you return the query the user could still always override the rewrite method if they really care, it would just be a default.

                
> Add join query to Lucene
> ------------------------
>
>                 Key: LUCENE-3602
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3602
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: modules/join
>            Reporter: Martijn van Groningen
>             Fix For: 3.6, 4.0
>
>         Attachments: LUCENE-3602.patch, LUCENE-3602.patch, LUCENE-3602.patch, LUCENE-3602.patch, LUCENE-3602.patch, LUCENE-3602.patch, LUCENE-3602.patch, LUCENE-3602.patch, LUCENE-3602.patch, LUCENE-3602.patch
>
>
> Solr has (psuedo) join query for a while now. I think this should also be available in Lucene.  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Updated] (LUCENE-3602) Add join query to Lucene

Posted by "Martijn van Groningen (Updated) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-3602?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Martijn van Groningen updated LUCENE-3602:
------------------------------------------

    Attachment: LUCENE-3602-3x.patch

Attached updated version of query time joining for 3x branch. Instead of doing a binary search for each term comparison it seeks / iterates forward. It can't do seeking like we do in trunk, so it isn't as fast as in trunk. However I do think this can be committed to at least have query time join support in 3x. Back porting per segment filtering and the MTQ that is in trunk is quite some work...
                
> Add join query to Lucene
> ------------------------
>
>                 Key: LUCENE-3602
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3602
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: modules/join
>            Reporter: Martijn van Groningen
>             Fix For: 3.6, 4.0
>
>         Attachments: LUCENE-3602-3x.patch, LUCENE-3602-3x.patch, LUCENE-3602.patch, LUCENE-3602.patch, LUCENE-3602.patch, LUCENE-3602.patch, LUCENE-3602.patch, LUCENE-3602.patch, LUCENE-3602.patch, LUCENE-3602.patch, LUCENE-3602.patch, LUCENE-3602.patch
>
>
> Solr has (psuedo) join query for a while now. I think this should also be available in Lucene.  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Commented] (LUCENE-3602) Add join query to Lucene

Posted by "Martijn van Groningen (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-3602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13202810#comment-13202810 ] 

Martijn van Groningen commented on LUCENE-3602:
-----------------------------------------------

The joining in 3x is ~3 times slower than the joining committed to trunk. This is mainly caused by the fact that the MTQ in trunk can do seeking while the the MTQ in 3x can only increment to the next term.


                
> Add join query to Lucene
> ------------------------
>
>                 Key: LUCENE-3602
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3602
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: modules/join
>            Reporter: Martijn van Groningen
>             Fix For: 3.6, 4.0
>
>         Attachments: LUCENE-3602-3x.patch, LUCENE-3602-3x.patch, LUCENE-3602.patch, LUCENE-3602.patch, LUCENE-3602.patch, LUCENE-3602.patch, LUCENE-3602.patch, LUCENE-3602.patch, LUCENE-3602.patch, LUCENE-3602.patch, LUCENE-3602.patch, LUCENE-3602.patch
>
>
> Solr has (psuedo) join query for a while now. I think this should also be available in Lucene.  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Commented] (LUCENE-3602) Add join query to Lucene

Posted by "Martijn van Groningen (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-3602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13187297#comment-13187297 ] 

Martijn van Groningen commented on LUCENE-3602:
-----------------------------------------------

Committed to trunk. I'll leave this issue open for the backport to 3.6
                
> Add join query to Lucene
> ------------------------
>
>                 Key: LUCENE-3602
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3602
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: modules/join
>            Reporter: Martijn van Groningen
>             Fix For: 3.6, 4.0
>
>         Attachments: LUCENE-3602.patch, LUCENE-3602.patch, LUCENE-3602.patch, LUCENE-3602.patch, LUCENE-3602.patch, LUCENE-3602.patch, LUCENE-3602.patch, LUCENE-3602.patch, LUCENE-3602.patch, LUCENE-3602.patch
>
>
> Solr has (psuedo) join query for a while now. I think this should also be available in Lucene.  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Commented] (LUCENE-3602) Add join query to Lucene

Posted by "Michael McCandless (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-3602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13186635#comment-13186635 ] 

Michael McCandless commented on LUCENE-3602:
--------------------------------------------

Also: I wonder if we should use DocValues in addition (instead of?) FieldCache.getTerms...
                
> Add join query to Lucene
> ------------------------
>
>                 Key: LUCENE-3602
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3602
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: modules/join
>            Reporter: Martijn van Groningen
>             Fix For: 3.6, 4.0
>
>         Attachments: LUCENE-3602.patch, LUCENE-3602.patch, LUCENE-3602.patch, LUCENE-3602.patch, LUCENE-3602.patch, LUCENE-3602.patch
>
>
> Solr has (psuedo) join query for a while now. I think this should also be available in Lucene.  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Updated] (LUCENE-3602) Add join query to Lucene

Posted by "Martijn van Groningen (Updated) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-3602?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Martijn van Groningen updated LUCENE-3602:
------------------------------------------

    Attachment: LUCENE-3602.patch

Small update. Removed unused method.
                
> Add join query to Lucene
> ------------------------
>
>                 Key: LUCENE-3602
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3602
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: modules/join
>            Reporter: Martijn van Groningen
>             Fix For: 3.6, 4.0
>
>         Attachments: LUCENE-3602.patch, LUCENE-3602.patch, LUCENE-3602.patch, LUCENE-3602.patch, LUCENE-3602.patch, LUCENE-3602.patch, LUCENE-3602.patch, LUCENE-3602.patch
>
>
> Solr has (psuedo) join query for a while now. I think this should also be available in Lucene.  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Commented] (LUCENE-3602) Add join query to Lucene

Posted by "Martijn van Groningen (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-3602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13185527#comment-13185527 ] 

Martijn van Groningen commented on LUCENE-3602:
-----------------------------------------------

Cool. I will use the BytesRefHash in the patch and see how this pans out.

{quote}
Also, I think you should always seek in your FilteredTermsEnum, and
simply iterate yourself through the BytesRef[]? Probably perf wasn't
great before because you were calling TreeSet.higher on each term?
{quote}
I've been testing this out with treeSet.higher(). In the case when there're gaps between the terms the performance is much better with seeking then not seeking (I picked random terms in a small test). In the terms are next / near to each other it seems it is better use a non seeking implementation. I think we should have a seeking and non seeking impl available. Maybe there should should be an option named useSeekingTermsEnum which decides what impl to us?

bq. I think you should call setRewriteMethod in your TermSetQuery ctor, to CONSTANT_SCORE_AUTO_REWRITE_DEFAULT
CONSTANT_SCORE_AUTO_REWRITE_DEFAULT is already the default in MTQ, right?
                
> Add join query to Lucene
> ------------------------
>
>                 Key: LUCENE-3602
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3602
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: modules/join
>            Reporter: Martijn van Groningen
>             Fix For: 3.6, 4.0
>
>         Attachments: LUCENE-3602.patch, LUCENE-3602.patch, LUCENE-3602.patch, LUCENE-3602.patch
>
>
> Solr has (psuedo) join query for a while now. I think this should also be available in Lucene.  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Commented] (LUCENE-3602) Add join query to Lucene

Posted by "Michael McCandless (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-3602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13186893#comment-13186893 ] 

Michael McCandless commented on LUCENE-3602:
--------------------------------------------

bq. To support joining by docvalues we can make a different TermsCollector implementation, but I'm unsure how the TermsQuery can be used. 

Actually I think TermsQuery (once we change it to just take the BytesRefHash and iterate over the int[] ord instead of BytesRef[]) can be re-used.

Ie, when collecting terms from DocValues, you'd just stuff them into the BytesRefHash like you do now... so supporting terms from DocValues should simply be another private impl inside TermsCollector (just like you can now pull from DocTermOrds or FieldCache).

But I agree: let's do this (enable JoinQuery to use DocValues) in a follow-on issue!
                
> Add join query to Lucene
> ------------------------
>
>                 Key: LUCENE-3602
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3602
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: modules/join
>            Reporter: Martijn van Groningen
>             Fix For: 3.6, 4.0
>
>         Attachments: LUCENE-3602.patch, LUCENE-3602.patch, LUCENE-3602.patch, LUCENE-3602.patch, LUCENE-3602.patch, LUCENE-3602.patch
>
>
> Solr has (psuedo) join query for a while now. I think this should also be available in Lucene.  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Commented] (LUCENE-3602) Add join query to Lucene

Posted by "Martijn van Groningen (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-3602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13186864#comment-13186864 ] 

Martijn van Groningen commented on LUCENE-3602:
-----------------------------------------------

I think that joining by docvalues would be great addition! I do think that joining by indexed terms should be possible to.
To support joining by docvalues we can make a different TermsCollector implementation, but I'm unsure how the TermsQuery can be used. Can A MTQ also iterate over docvalues? 

I think we should commit what we currently have first and then address joining by docvalues in a new issue.
                
> Add join query to Lucene
> ------------------------
>
>                 Key: LUCENE-3602
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3602
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: modules/join
>            Reporter: Martijn van Groningen
>             Fix For: 3.6, 4.0
>
>         Attachments: LUCENE-3602.patch, LUCENE-3602.patch, LUCENE-3602.patch, LUCENE-3602.patch, LUCENE-3602.patch, LUCENE-3602.patch
>
>
> Solr has (psuedo) join query for a while now. I think this should also be available in Lucene.  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Commented] (LUCENE-3602) Add join query to Lucene

Posted by "Martijn van Groningen (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-3602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13200913#comment-13200913 ] 

Martijn van Groningen commented on LUCENE-3602:
-----------------------------------------------

Jason: Better late then never... BRH is used to collect the matching from terms. The DTI just contains all terms / ords for a field. Comparing DTI ords isn't going to work when a term is in more than one segment or appears in a different field (fromField / toField). So I think the BRH can't be replaced by the DTI. The BRH could be cached per query.
                
> Add join query to Lucene
> ------------------------
>
>                 Key: LUCENE-3602
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3602
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: modules/join
>            Reporter: Martijn van Groningen
>             Fix For: 3.6, 4.0
>
>         Attachments: LUCENE-3602-3x.patch, LUCENE-3602.patch, LUCENE-3602.patch, LUCENE-3602.patch, LUCENE-3602.patch, LUCENE-3602.patch, LUCENE-3602.patch, LUCENE-3602.patch, LUCENE-3602.patch, LUCENE-3602.patch, LUCENE-3602.patch
>
>
> Solr has (psuedo) join query for a while now. I think this should also be available in Lucene.  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Commented] (LUCENE-3602) Add join query to Lucene

Posted by "Martijn van Groningen (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-3602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13167865#comment-13167865 ] 

Martijn van Groningen commented on LUCENE-3602:
-----------------------------------------------

bq. You mean the app sometimes actually uses the fromSearcher.fromQuery's results, directly, without joining?
Yes. In the case of Solr it is checking the filter cache.

bq. but what worries me is just how slow this non-RAM version will be.
I have been running the JoinQuery on my test data set (10.1 M docs) and it isn't as bad as I expect it would be. This data set contains 100000 products each product having 100 offers. The JoinQuery with a *:* query as fromQuery takes about 900 ms and a fromQuery selecting all products with a specific keyword takes about 350 ms. I think this specific query implementation is suitable for environments where RAM might be scarce. The RAM version should be the default.

bq. Stepping back a bit... do we know how this impl compares to how ElasticSearch does joins? And to how Solr does...?
ES only has index time joining, right? Solr basically uses the same mechanism as the JoinQuery in this patch, but a bit smarter. It tries to cache the from term to to term lookup (see JoinQParserPlugin.java line 367). I couldn't port this part to joining module since this caching relies heavily on the SolrIndexSearcher.
                
> Add join query to Lucene
> ------------------------
>
>                 Key: LUCENE-3602
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3602
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: modules/join
>            Reporter: Martijn van Groningen
>         Attachments: LUCENE-3602.patch, LUCENE-3602.patch
>
>
> Solr has (psuedo) join query for a while now. I think this should also be available in Lucene.  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Commented] (LUCENE-3602) Add join query to Lucene

Posted by "Martijn van Groningen (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-3602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13159124#comment-13159124 ] 

Martijn van Groningen commented on LUCENE-3602:
-----------------------------------------------

I'll remove the builder api. This was just my sugar api. I'll change that in a constructor where the toSearcher and preComputedFromDocs are optional.

bq. How does the preComputedFromDocs work? It's not per-segment? Like it's a bitset across entire toplevel doc space?
Yes, it is a bitset across all segments, so toplevel. The whole query implementation is top level. People can use this the execute the real query outside the JoinQuery (if this isn't specified the query is executed when the Weight is created for the JoinQuery). I think this would be nice if people want the for example cache the encapsulated query.

I wonder if we can make this work per-segment... but that can be a 2nd phase.
I thought about that. But this would mean we need to execute the join query in two phases. The first phase would collect all the "from" values from the documents matching the encapsulated query. The second phase would match the documents that have matching on the "to" side with the "from" values collected in the first phase. The will work per segment and I think would also make the JoinQuery work in distributed environment. We can do this in a second development phase.

I think you can use seekExact instead of seekCeil? Better performance...
I'll change that.

What is the AdjustedDISI for (and when would Weight.scorer get a top-level context...)? Actually that is only used to map top level docids to segment docids (Basically doing: toplevel_docid - base). This is necessary because Weight.scorer works per segment and the query implementation not.
                
> Add join query to Lucene
> ------------------------
>
>                 Key: LUCENE-3602
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3602
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: modules/join
>            Reporter: Martijn van Groningen
>         Attachments: LUCENE-3602.patch
>
>
> Solr has (psuedo) join query for a while now. I think this should also be available in Lucene.  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Updated] (LUCENE-3602) Add join query to Lucene

Posted by "Martijn van Groningen (Updated) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-3602?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Martijn van Groningen updated LUCENE-3602:
------------------------------------------

    Attachment: LUCENE-3602.patch

Attached a new patch.
* Fixed some small comment typos.
* Removed the reuse field in the TermsQuery. Since it doesn't make sense to reuse b/c it assigned null! Also I cannot know if the caller is done with the TermsEnum.

It is ready. I'll commit this asap.
                
> Add join query to Lucene
> ------------------------
>
>                 Key: LUCENE-3602
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3602
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: modules/join
>            Reporter: Martijn van Groningen
>             Fix For: 3.6, 4.0
>
>         Attachments: LUCENE-3602.patch, LUCENE-3602.patch, LUCENE-3602.patch, LUCENE-3602.patch, LUCENE-3602.patch, LUCENE-3602.patch, LUCENE-3602.patch, LUCENE-3602.patch, LUCENE-3602.patch, LUCENE-3602.patch
>
>
> Solr has (psuedo) join query for a while now. I think this should also be available in Lucene.  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Commented] (LUCENE-3602) Add join query to Lucene

Posted by "Michael McCandless (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-3602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13186629#comment-13186629 ] 

Michael McCandless commented on LUCENE-3602:
--------------------------------------------

Looking great!

  * Can't TermsCollector be package private?  Like it's only used
    privately in JoinUtil.createJoinQuery?

  * Can JoinUtil.createJoinQuery return Query not MultiTermQuery...?
    Just gives us freedom in the future to impl a different Query if
    we want/need to...

  * Typo: collecter -> collector

  * Can TermsQuery be package private?  Also, we can save a pass over
    the sorted terms by having TermsQuery hold the int[] ord array and
    then just do the lookup (against BytesRefHash) as it
    goes... really TermsQuery could just take the BytesRefHash.  Then
    we wouldn't have to materialize a new BytesRef for each matched
    term... just reuse a single scratch BytesRef inside TermsQuery.

  * In the TermsQuery.accept... should that {{return AcceptStatus.YES}}
    in the {{if (cmp == 0)}} be a YES_AND_SEEK
    (after setting the next term as the seekTerm)?

  * Hmm we sort by unicode order in TermsCollector.getCollectedTerms,
    but then by the term dict's comparator in TermsQuery; maybe just
    use the UTF8AsUnicode comparator in TermsQuery too?  And note in
    jdocs that this is required?

                
> Add join query to Lucene
> ------------------------
>
>                 Key: LUCENE-3602
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3602
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: modules/join
>            Reporter: Martijn van Groningen
>             Fix For: 3.6, 4.0
>
>         Attachments: LUCENE-3602.patch, LUCENE-3602.patch, LUCENE-3602.patch, LUCENE-3602.patch, LUCENE-3602.patch, LUCENE-3602.patch
>
>
> Solr has (psuedo) join query for a while now. I think this should also be available in Lucene.  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Commented] (LUCENE-3602) Add join query to Lucene

Posted by "Jason Rutherglen (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-3602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13159335#comment-13159335 ] 

Jason Rutherglen commented on LUCENE-3602:
------------------------------------------

Great to see this moving out of Solr and getting new eyes on it (with added improvements)!
                
> Add join query to Lucene
> ------------------------
>
>                 Key: LUCENE-3602
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3602
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: modules/join
>            Reporter: Martijn van Groningen
>         Attachments: LUCENE-3602.patch
>
>
> Solr has (psuedo) join query for a while now. I think this should also be available in Lucene.  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Commented] (LUCENE-3602) Add join query to Lucene

Posted by "Jason Rutherglen (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-3602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13189509#comment-13189509 ] 

Jason Rutherglen commented on LUCENE-3602:
------------------------------------------

Just following up on the per-segment terms collection.  Join is going to be used as a filter in most cases (?).  Filters can be applied per-segment (unlike scoring queries).  So it seems possible to avoid the BRH creation by using the DTI?
                
> Add join query to Lucene
> ------------------------
>
>                 Key: LUCENE-3602
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3602
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: modules/join
>            Reporter: Martijn van Groningen
>             Fix For: 3.6, 4.0
>
>         Attachments: LUCENE-3602-3x.patch, LUCENE-3602.patch, LUCENE-3602.patch, LUCENE-3602.patch, LUCENE-3602.patch, LUCENE-3602.patch, LUCENE-3602.patch, LUCENE-3602.patch, LUCENE-3602.patch, LUCENE-3602.patch, LUCENE-3602.patch
>
>
> Solr has (psuedo) join query for a while now. I think this should also be available in Lucene.  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Commented] (LUCENE-3602) Add join query to Lucene

Posted by "Jason Rutherglen (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-3602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13189375#comment-13189375 ] 

Jason Rutherglen commented on LUCENE-3602:
------------------------------------------

I was reviewing this issue to use where Solr's join implementation may not be the right choice.

In this Lucene Join implementation, a new BytesRefHash is built per query (and cannot be reused).  This could generate a fair amount of garbage quickly.  

Also the sort compare using BRH is per byte (not as cheap as an ord compare).  We can probably use DocTermsIndex to replace the use of BytesRefHash by comparing DTI's ords.  Then we are saving off the bytes into BRH per query, and the comparison would be faster.

Additionally, for a join with many terms, the number of postings could become a factor in performance.  Because we are not caching bitsets like Solr does, it seems like an excellent occasion for a faster less-compressed codec.

Further, to save on term seeking, if the term state was cached (eg, the file pointers into the posting list), the iteration on the terms dict would be removed.

Granted all this requires more RAM, however in many cases (eg, mine), that would be acceptable.
                
> Add join query to Lucene
> ------------------------
>
>                 Key: LUCENE-3602
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3602
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: modules/join
>            Reporter: Martijn van Groningen
>             Fix For: 3.6, 4.0
>
>         Attachments: LUCENE-3602.patch, LUCENE-3602.patch, LUCENE-3602.patch, LUCENE-3602.patch, LUCENE-3602.patch, LUCENE-3602.patch, LUCENE-3602.patch, LUCENE-3602.patch, LUCENE-3602.patch, LUCENE-3602.patch
>
>
> Solr has (psuedo) join query for a while now. I think this should also be available in Lucene.  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Issue Comment Edited] (LUCENE-3602) Add join query to Lucene

Posted by "Martijn van Groningen (Issue Comment Edited) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-3602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13186528#comment-13186528 ] 

Martijn van Groningen edited comment on LUCENE-3602 at 1/15/12 4:20 PM:
------------------------------------------------------------------------

Attached new patch.
* Extract collector from JoinUtil and named it TermsCollector.
* TermsCollector uses BytesRefHash.
* Renamed TermSetQuery to TermsQuery.
* TermsQuery uses sorted ByteRefs[]
* TermsCollector has two concrete impl. One for single value per field and one for multiple values per field.
* Moved the classes for query time joining under the package o.a.l.search.join.query
* Added documentation.

The JoinUtil is now just a convience class.

Should we add the existing block joining classes under o.a.l.search.join.index or o.a.l.search.join.block?

I think this patch is ready to get committed.
                
      was (Author: martijn.v.groningen):
    Attached new patch.
* Extract collector from JoinUtil and named it TermsCollector.
* TermsCollector uses BytesRefHash.
* Renamed TermSetQuery to TermsQuery.
* TermsQuery uses sorted ByteRes[]
* TermsCollector has two concrete impl. One for single value per field and one for multiple values per field.
* Moved the classes for query time joining under the package o.a.l.search.join.query
* Added documentation.

The JoinUtil is now just a convience class.

Should we add the existing block joining classes under o.a.l.search.join.index or o.a.l.search.join.block?

I think this patch is ready to get committed.
                  
> Add join query to Lucene
> ------------------------
>
>                 Key: LUCENE-3602
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3602
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: modules/join
>            Reporter: Martijn van Groningen
>             Fix For: 3.6, 4.0
>
>         Attachments: LUCENE-3602.patch, LUCENE-3602.patch, LUCENE-3602.patch, LUCENE-3602.patch, LUCENE-3602.patch
>
>
> Solr has (psuedo) join query for a while now. I think this should also be available in Lucene.  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Commented] (LUCENE-3602) Add join query to Lucene

Posted by "Jason Rutherglen (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-3602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13189501#comment-13189501 ] 

Jason Rutherglen commented on LUCENE-3602:
------------------------------------------

{quote}The terms collected in the first phase are from many segments{quote}

Why is that necessary?

{quote}Caching can be improved in the second phase as you described, by saving a bitset per fromTerm?{quote}

Possibly, only for terms with a high number of documents.  Or we can use a faster to decode (less compressed) posting codec.

bq. The JoinUtil is between 2 till 3 times faster than Solr's JoinQuery with this data set on my dev machine

Interesting, thanks for sharing.


                
> Add join query to Lucene
> ------------------------
>
>                 Key: LUCENE-3602
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3602
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: modules/join
>            Reporter: Martijn van Groningen
>             Fix For: 3.6, 4.0
>
>         Attachments: LUCENE-3602-3x.patch, LUCENE-3602.patch, LUCENE-3602.patch, LUCENE-3602.patch, LUCENE-3602.patch, LUCENE-3602.patch, LUCENE-3602.patch, LUCENE-3602.patch, LUCENE-3602.patch, LUCENE-3602.patch, LUCENE-3602.patch
>
>
> Solr has (psuedo) join query for a while now. I think this should also be available in Lucene.  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Commented] (LUCENE-3602) Add join query to Lucene

Posted by "Michael McCandless (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-3602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13185714#comment-13185714 ] 

Michael McCandless commented on LUCENE-3602:
--------------------------------------------

Oh yeah MTQ already defaults to that rewrite method :)

Really, since you know the exact number of terms up front... you could specialize it to constant BQ or filter yourself... but that can be later.

I don't think you should be calling TreeSet.higher on every term check -- that's a relatively costly binary search through the TreeSet each time.  Really you should be able to just walk through (with your own upto++) the pre-sorted BytesRef[] you pulled from the BytesRefHash.  Then we should retest seek vs next...
                
> Add join query to Lucene
> ------------------------
>
>                 Key: LUCENE-3602
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3602
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: modules/join
>            Reporter: Martijn van Groningen
>             Fix For: 3.6, 4.0
>
>         Attachments: LUCENE-3602.patch, LUCENE-3602.patch, LUCENE-3602.patch, LUCENE-3602.patch
>
>
> Solr has (psuedo) join query for a while now. I think this should also be available in Lucene.  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Commented] (LUCENE-3602) Add join query to Lucene

Posted by "Michael McCandless (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-3602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13167493#comment-13167493 ] 

Michael McCandless commented on LUCENE-3602:
--------------------------------------------

Patch looks good!

  * I like the test...

  * Maybe rename actualQuery to fromQuery?  (So it's clear that
    JoinQuery runs fromQuery using fromSearcher, joining on
    fromSearcher.fromField to toSearcher.toField).

  * Why preComputedFromDocs...?  Like if you were to cache something,
    wouldn't you want cache the toSearcher's bitset instead?

  * Maybe rename JoinQueryWeight.joinResult to topLevelJoinResult, to
    contrast it w/ the per-segment scoring?  And add a comment
    explaining that we compute it once (on first segment) and all
    later segments then reuse it?

  * I wonder if we could make this a Filter instead, somehow?  Ie, at
    its core it converts a top-level bitset in the fromSearcher doc
    space into the joined bitset in the toSearcher doc space.  It
    could even maybe just be a static method taking in fromBitset and
    returning toBitset, which could operate per-segment on the
    toSearcher side?  (Separately: I wonder if JoinQuery should do
    something with the scores of the fromQuery....?  Not right now but
    maybe later...).

  * Why does the JoinQuery javadoc say "The downside of this
    is that in a sharded environment not all documents might get
    joined / linked." as a downside to the top-level approach?  Maybe
    reword that to state that all joined to/from docs must reside in
    the same shard?  In theory we could (later) make a shard friendly
    approach?  Eg, first pass builds up all unique Terms in the
    fromSearcher.fromField for docs matching the query (across all
    shards) and 2nd pass is basically a TermFilter on those...

  * Not sure it matters, but... including the preComputedFromDocs in
    hashCode/equals is quite costly (it goes bit by bit...).  Maybe it
    shouldn't be included, since it contains details about the
    particular searcher that query had been run against?  (In theory
    Query instances are searcher independent.)

In general I think this approach is somewhat inefficient, because it
always iterates over every possible term in fromSearcher.fromField,
checking the docs for each to see if there is a match in the query.
Ie, it's like FieldCache, in that it un-inverts, but it's uninverting
on every query.

I wonder if we could DocTermOrds instead?  (Or,
FieldCache.DocTermsIndex or DocValues.BYTES_*, if we know fromSearcher.fromField is
single-valued).  This way we uninvert once (on init), and then doing
the join should be much faster since for each fromDocID we can lookup
the term(s) to join on.

Likewise on the toSearcher side, if we had doc <-> ord/term loaded we
could do the forward (term -> ord) lookup quickly (in memory binary
search).

But then this will obviously use RAM... so we should have the choice
(and start w/ the current patch!).

                
> Add join query to Lucene
> ------------------------
>
>                 Key: LUCENE-3602
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3602
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: modules/join
>            Reporter: Martijn van Groningen
>         Attachments: LUCENE-3602.patch, LUCENE-3602.patch
>
>
> Solr has (psuedo) join query for a while now. I think this should also be available in Lucene.  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Commented] (LUCENE-3602) Add join query to Lucene

Posted by "Martijn van Groningen (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-3602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13189469#comment-13189469 ] 

Martijn van Groningen commented on LUCENE-3602:
-----------------------------------------------

I'm not sure how you plan to sort by DTI ords. The terms collected in the first phase are from many segments. The ords from DTI are only valid inside a segment. You can create a toplevel DTI but that is expensive... 

Currently caching is minimal and can be improved at the cost of more RAM. The TermsCollector caches the from terms via DocTerms in the FC (per segment). Caching can be improved in the second phase as you described, by saving a bitset per fromTerm?. But I think we first need to tackle how bitsets are cached in general. Solr caches (which the Solr JoinQuery uses) are top level (one commit and you lose it all). I'm unsure how to cache the posting list file pointers with the current Lucene api... I think this (joining) caching should be addressed in a new issue.

Performance of the JoinUtil looks actual quite good from what I have measured. I have a test data set containing 100000 products and 100 offers per product (10100000 docs in total). The JoinUtil is between 2 till 3 times faster than Solr's JoinQuery with this data set on my dev machine.
                
> Add join query to Lucene
> ------------------------
>
>                 Key: LUCENE-3602
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3602
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: modules/join
>            Reporter: Martijn van Groningen
>             Fix For: 3.6, 4.0
>
>         Attachments: LUCENE-3602.patch, LUCENE-3602.patch, LUCENE-3602.patch, LUCENE-3602.patch, LUCENE-3602.patch, LUCENE-3602.patch, LUCENE-3602.patch, LUCENE-3602.patch, LUCENE-3602.patch, LUCENE-3602.patch
>
>
> Solr has (psuedo) join query for a while now. I think this should also be available in Lucene.  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Commented] (LUCENE-3602) Add join query to Lucene

Posted by "Michael McCandless (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-3602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13182269#comment-13182269 ] 

Michael McCandless commented on LUCENE-3602:
--------------------------------------------

Wow new patch is tiny -- just 2 static methods!

Right now you do 3 passes -- 1st pass records fromSearcher's docIDs
matching fromQuery; 2nd pass translates those matching docIDs into
the joinable terms in fromSearcher.fromField; 3rd pass then records
toSearcher docIDs matching those terms in toField.

But I think the first 2 passes could be combined?  Ie, as you collect
each hit from fromQuery, instead of recording the docID, go and look up
the term in fromField for that doc and save it away?  Then you don't
need to save the fromSearcher docIDs?  (3rd pass would then be the
same).

Also, instead of making a toplevel bit set as the return
result... could it be an ordinary filter?  Then the 3rd pass would be
implemented in Filter.getDocIDSet, and the Filter instance would hold
onto these terms computed by the combined 1st/2nd pass?

I think this is a great step forward over previous patch... so tiny
too :)

The 1st/2nd pass would have "expected" cost, ie on the order of how
many hits matched in fromQuery.  But the 3rd pass has a high cost even
for tiny queries since it visits every doc, checking whether its terms
are in the set.  We might be able to improve on that somehow... eg if
the number of terms is small, really you want to invert that process
(ie visit the postings and gather all matching docs), either with an
OR query or with TermsFilter (in modules/queries)?  Hmm, in fact,
isn't this just a MultiTermQuery?  We can then use auto rewrite mode
to rewrite as filter or small BooleanQuery?

                
> Add join query to Lucene
> ------------------------
>
>                 Key: LUCENE-3602
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3602
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: modules/join
>            Reporter: Martijn van Groningen
>         Attachments: LUCENE-3602.patch, LUCENE-3602.patch, LUCENE-3602.patch
>
>
> Solr has (psuedo) join query for a while now. I think this should also be available in Lucene.  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Updated] (LUCENE-3602) Add join query to Lucene

Posted by "Martijn van Groningen (Updated) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-3602?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Martijn van Groningen updated LUCENE-3602:
------------------------------------------

    Attachment: LUCENE-3602.patch

Fixed the YES_AND_SEEK issue. In the previously described case it uses YES_AND_SEEK now without seeking to the current term (that triggered the assertion error).
                
> Add join query to Lucene
> ------------------------
>
>                 Key: LUCENE-3602
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3602
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: modules/join
>            Reporter: Martijn van Groningen
>             Fix For: 3.6, 4.0
>
>         Attachments: LUCENE-3602.patch, LUCENE-3602.patch, LUCENE-3602.patch, LUCENE-3602.patch, LUCENE-3602.patch, LUCENE-3602.patch, LUCENE-3602.patch, LUCENE-3602.patch, LUCENE-3602.patch
>
>
> Solr has (psuedo) join query for a while now. I think this should also be available in Lucene.  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Commented] (LUCENE-3602) Add join query to Lucene

Posted by "Michael McCandless (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-3602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13159006#comment-13159006 ] 

Michael McCandless commented on LUCENE-3602:
--------------------------------------------

Awesome to finally bring JoinQuery to pure Lucene apps!

Can we cut back to normal ctor (not builder API) to create the
JoinQuery?  One can always create a builder API layer on top if
necessary.

How does the preComputedFromDocs work?  It's not per-segment?  Like
it's a bitset across entire toplevel doc space?

Hmm we are also using MultiFields.getLiveDocs, which is quite slow to
use (must do binary search on each doc lookup).

I wonder if we can make this work per-segment... but that can be a 2nd
phase.

I think you can use seekExact instead of seekCeil?  Better
performance...

What is the AdjustedDISI for (and when would Weight.scorer get a
top-level context...)?

                
> Add join query to Lucene
> ------------------------
>
>                 Key: LUCENE-3602
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3602
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: modules/join
>            Reporter: Martijn van Groningen
>         Attachments: LUCENE-3602.patch
>
>
> Solr has (psuedo) join query for a while now. I think this should also be available in Lucene.  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Updated] (LUCENE-3602) Add join query to Lucene

Posted by "Martijn van Groningen (Updated) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-3602?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Martijn van Groningen updated LUCENE-3602:
------------------------------------------

    Attachment: LUCENE-3602.patch

Attached a new patch. Changes:
* Everything happens now in two passes.
* The second pass is in the custom MTQ impl.
* JoinUtil contains one method that returns a MTQ impl.
* Reused the tests from previous patch for the JoinUtil.

bq. But they can always choose the rewriteMode to not be AUTO and explicitly choose filter if they want this.
Yes makes sense. First time I used the MTQ directly. So feedback on my on MTQ impl is welcome :)

I think that this patch can easily be backported to 3x.
                
> Add join query to Lucene
> ------------------------
>
>                 Key: LUCENE-3602
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3602
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: modules/join
>            Reporter: Martijn van Groningen
>         Attachments: LUCENE-3602.patch, LUCENE-3602.patch, LUCENE-3602.patch, LUCENE-3602.patch
>
>
> Solr has (psuedo) join query for a while now. I think this should also be available in Lucene.  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Updated] (LUCENE-3602) Add join query to Lucene

Posted by "Martijn van Groningen (Updated) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-3602?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Martijn van Groningen updated LUCENE-3602:
------------------------------------------

    Attachment: LUCENE-3602.patch

Updated patch.
* Removed builder api.
* Updated to latest Lucene api changes.
* Use seekExact instead of seekCeil method.
* Added random test.
                
> Add join query to Lucene
> ------------------------
>
>                 Key: LUCENE-3602
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3602
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: modules/join
>            Reporter: Martijn van Groningen
>         Attachments: LUCENE-3602.patch, LUCENE-3602.patch
>
>
> Solr has (psuedo) join query for a while now. I think this should also be available in Lucene.  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Updated] (LUCENE-3602) Add join query to Lucene

Posted by "Martijn van Groningen (Updated) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-3602?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Martijn van Groningen updated LUCENE-3602:
------------------------------------------

    Attachment: LUCENE-3602.patch

Attached new patch.
* Extract collector from JoinUtil and named it TermsCollector.
* TermsCollector uses BytesRefHash.
* Renamed TermSetQuery to TermsQuery.
* TermsQuery uses sorted ByteRes[]
* TermsCollector has two concrete impl. One for single value per field and one for multiple values per field.
* Moved the classes for query time joining under the package o.a.l.search.join.query
* Added documentation.

The JoinUtil is now just a convience class.

Should we add the existing block joining classes under o.a.l.search.join.index or o.a.l.search.join.block?

I think this patch is ready to get committed.
                
> Add join query to Lucene
> ------------------------
>
>                 Key: LUCENE-3602
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3602
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: modules/join
>            Reporter: Martijn van Groningen
>             Fix For: 3.6, 4.0
>
>         Attachments: LUCENE-3602.patch, LUCENE-3602.patch, LUCENE-3602.patch, LUCENE-3602.patch, LUCENE-3602.patch
>
>
> Solr has (psuedo) join query for a while now. I think this should also be available in Lucene.  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Commented] (LUCENE-3602) Add join query to Lucene

Posted by "Jason Rutherglen (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-3602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13187370#comment-13187370 ] 

Jason Rutherglen commented on LUCENE-3602:
------------------------------------------

Sweet!  How join would work in distributed mode, that would be very useful for BigData projects.
                
> Add join query to Lucene
> ------------------------
>
>                 Key: LUCENE-3602
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3602
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: modules/join
>            Reporter: Martijn van Groningen
>             Fix For: 3.6, 4.0
>
>         Attachments: LUCENE-3602.patch, LUCENE-3602.patch, LUCENE-3602.patch, LUCENE-3602.patch, LUCENE-3602.patch, LUCENE-3602.patch, LUCENE-3602.patch, LUCENE-3602.patch, LUCENE-3602.patch, LUCENE-3602.patch
>
>
> Solr has (psuedo) join query for a while now. I think this should also be available in Lucene.  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Updated] (LUCENE-3602) Add join query to Lucene

Posted by "Martijn van Groningen (Updated) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-3602?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Martijn van Groningen updated LUCENE-3602:
------------------------------------------

    Attachment: LUCENE-3602.patch

Attached a new patch with a completely different joining implementation. Basically the joining happens inside a static method, works per segment and uses FieldCache.

This is just a small try-out and is lacking any tests.
                
> Add join query to Lucene
> ------------------------
>
>                 Key: LUCENE-3602
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3602
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: modules/join
>            Reporter: Martijn van Groningen
>         Attachments: LUCENE-3602.patch, LUCENE-3602.patch, LUCENE-3602.patch
>
>
> Solr has (psuedo) join query for a while now. I think this should also be available in Lucene.  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Commented] (LUCENE-3602) Add join query to Lucene

Posted by "Robert Muir (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-3602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13186926#comment-13186926 ] 

Robert Muir commented on LUCENE-3602:
-------------------------------------

Hi Martijn: the patch is looking really nice! 

Just a trivial thing, can we swap ctor arguments of
{code}
public TermsQuery(BytesRef[] terms, String field) {
{code}

to

{code}
public TermsQuery(String field, BytesRef[] terms) {
{code}

Most of the apis on indexreader etc take field,term so I think it looks more consistent.
                
> Add join query to Lucene
> ------------------------
>
>                 Key: LUCENE-3602
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3602
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: modules/join
>            Reporter: Martijn van Groningen
>             Fix For: 3.6, 4.0
>
>         Attachments: LUCENE-3602.patch, LUCENE-3602.patch, LUCENE-3602.patch, LUCENE-3602.patch, LUCENE-3602.patch, LUCENE-3602.patch
>
>
> Solr has (psuedo) join query for a while now. I think this should also be available in Lucene.  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Commented] (LUCENE-3602) Add join query to Lucene

Posted by "Martijn van Groningen (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-3602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13182329#comment-13182329 ] 

Martijn van Groningen commented on LUCENE-3602:
-----------------------------------------------

bq. But I think the first 2 passes could be combined?
That idea also crossed my mind. We should definitely do this.

bq Also, instead of making a toplevel bit set as the return result... could it be an ordinary filter? 
I like that approach. It can just return a TermsFilter instance, right?

Maybe we should have to variants one returns a MTQ and one a TermsFilter? So users can choose whether to want a filter or a query.

                
> Add join query to Lucene
> ------------------------
>
>                 Key: LUCENE-3602
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3602
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: modules/join
>            Reporter: Martijn van Groningen
>         Attachments: LUCENE-3602.patch, LUCENE-3602.patch, LUCENE-3602.patch
>
>
> Solr has (psuedo) join query for a while now. I think this should also be available in Lucene.  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Resolved] (LUCENE-3602) Add join query to Lucene

Posted by "Martijn van Groningen (Resolved) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-3602?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Martijn van Groningen resolved LUCENE-3602.
-------------------------------------------

    Resolution: Fixed
    
> Add join query to Lucene
> ------------------------
>
>                 Key: LUCENE-3602
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3602
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: modules/join
>            Reporter: Martijn van Groningen
>             Fix For: 3.6, 4.0
>
>         Attachments: LUCENE-3602-3x.patch, LUCENE-3602-3x.patch, LUCENE-3602.patch, LUCENE-3602.patch, LUCENE-3602.patch, LUCENE-3602.patch, LUCENE-3602.patch, LUCENE-3602.patch, LUCENE-3602.patch, LUCENE-3602.patch, LUCENE-3602.patch, LUCENE-3602.patch
>
>
> Solr has (psuedo) join query for a while now. I think this should also be available in Lucene.  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Commented] (LUCENE-3602) Add join query to Lucene

Posted by "Robert Muir (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-3602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13182332#comment-13182332 ] 

Robert Muir commented on LUCENE-3602:
-------------------------------------

{quote}
Maybe we should have to variants one returns a MTQ and one a TermsFilter? So users can choose whether to want a filter or a query.
{quote}

But they can always choose the rewriteMode to not be AUTO and explicitly choose filter if they want this.
                
> Add join query to Lucene
> ------------------------
>
>                 Key: LUCENE-3602
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3602
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: modules/join
>            Reporter: Martijn van Groningen
>         Attachments: LUCENE-3602.patch, LUCENE-3602.patch, LUCENE-3602.patch
>
>
> Solr has (psuedo) join query for a while now. I think this should also be available in Lucene.  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Commented] (LUCENE-3602) Add join query to Lucene

Posted by "Michael McCandless (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-3602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13187276#comment-13187276 ] 

Michael McCandless commented on LUCENE-3602:
--------------------------------------------

+1 it looks great!  Good work :)
                
> Add join query to Lucene
> ------------------------
>
>                 Key: LUCENE-3602
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3602
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: modules/join
>            Reporter: Martijn van Groningen
>             Fix For: 3.6, 4.0
>
>         Attachments: LUCENE-3602.patch, LUCENE-3602.patch, LUCENE-3602.patch, LUCENE-3602.patch, LUCENE-3602.patch, LUCENE-3602.patch, LUCENE-3602.patch, LUCENE-3602.patch, LUCENE-3602.patch
>
>
> Solr has (psuedo) join query for a while now. I think this should also be available in Lucene.  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Commented] (LUCENE-3602) Add join query to Lucene

Posted by "Michael McCandless (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-3602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13167686#comment-13167686 ] 

Michael McCandless commented on LUCENE-3602:
--------------------------------------------

{quote}
This is in the case if your from query was cached and your toSearch's
bitset isn't, which is a likely scenario.
{quote}

Hmm can you describe this?  You mean the app sometimes actually uses the fromSearcher.fromQuery's results, directly, without joining?

{quote}
It just matches docs from one side to the to side. That is all... So static method / filter should be able to do the job.
I'm not sure, but if it is a query it might be able to one day encapsulate the joining in the Lucene query language?
{quote}

Yeah... the core API is really the join method, to translate top-level docIDs in fromSearcher over to toSearcher's top-level docIDs.

The AdjustedDISI (maybe rename SliceDISI?  SubReaderDISI?  ie, something to indicate it "slices" a sub-reader's portion of the top-level docID space) can then be used to translate back into a per-segment Filter.

I think it would be cleaner as a Filter...?  This is actually similar to DuplicateFilter, which also must operate on top-level docIDs (since dups can happen in different segments).

bq.  Would be nice if the user can choose between a more ram but faster variant and a less ram but slower variant.

Yeah I agree... but what worries me is just how slow this non-RAM version will be.  Ie, it must do the full join and uninvert every time; so even if your fromQuery only matches a tiny number of docs... you pay massive cost of the full join.  Even better than using FC/DV/DTO to map docID -> term(s) per query, we could hold in RAM the join result itself (docID -> docID(s)) in some form, then the query just directly maps the docIDs w/o having to lookup terms again.

Stepping back a bit... do we know how this impl compares to how ElasticSearch does joins?  And to how Solr does...?
                
> Add join query to Lucene
> ------------------------
>
>                 Key: LUCENE-3602
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3602
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: modules/join
>            Reporter: Martijn van Groningen
>         Attachments: LUCENE-3602.patch, LUCENE-3602.patch
>
>
> Solr has (psuedo) join query for a while now. I think this should also be available in Lucene.  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Updated] (LUCENE-3602) Add join query to Lucene

Posted by "Martijn van Groningen (Updated) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-3602?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Martijn van Groningen updated LUCENE-3602:
------------------------------------------

    Attachment: LUCENE-3602.patch

Attached initial version of the JoinQuery which based on the one in Solr.
                
> Add join query to Lucene
> ------------------------
>
>                 Key: LUCENE-3602
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3602
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: modules/join
>            Reporter: Martijn van Groningen
>         Attachments: LUCENE-3602.patch
>
>
> Solr has (psuedo) join query for a while now. I think this should also be available in Lucene.  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Updated] (LUCENE-3602) Add join query to Lucene

Posted by "Martijn van Groningen (Updated) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-3602?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Martijn van Groningen updated LUCENE-3602:
------------------------------------------

    Attachment: LUCENE-3602.patch

Oops... I found out that the assertion in FilteredTermsEnum line 207 was failing. Before I added the previous patch I ran the tests from my IDE and assertions weren't enabled...

In some cases I was seeking backwards which isn't good. The attached patch fixes this issue. All the tests in the join module pass now.
                
> Add join query to Lucene
> ------------------------
>
>                 Key: LUCENE-3602
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3602
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: modules/join
>            Reporter: Martijn van Groningen
>             Fix For: 3.6, 4.0
>
>         Attachments: LUCENE-3602.patch, LUCENE-3602.patch, LUCENE-3602.patch, LUCENE-3602.patch, LUCENE-3602.patch, LUCENE-3602.patch
>
>
> Solr has (psuedo) join query for a while now. I think this should also be available in Lucene.  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Updated] (LUCENE-3602) Add join query to Lucene

Posted by "Martijn van Groningen (Updated) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-3602?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Martijn van Groningen updated LUCENE-3602:
------------------------------------------

    Attachment: LUCENE-3602-3x.patch

Attached initial patch for 3x branch.
* I replaced the BytesRefHash with a Set for now. FC in 3x is String based.
* TermsQuery doesn't seek, but does a binary search to find the index in the String[] (created from the Set).... Not ideal. So we need some seeking.

I ran some quick tests and performance is 3 till 4 times worse then the JoinUtil committed to trunk.
                
> Add join query to Lucene
> ------------------------
>
>                 Key: LUCENE-3602
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3602
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: modules/join
>            Reporter: Martijn van Groningen
>             Fix For: 3.6, 4.0
>
>         Attachments: LUCENE-3602-3x.patch, LUCENE-3602.patch, LUCENE-3602.patch, LUCENE-3602.patch, LUCENE-3602.patch, LUCENE-3602.patch, LUCENE-3602.patch, LUCENE-3602.patch, LUCENE-3602.patch, LUCENE-3602.patch, LUCENE-3602.patch
>
>
> Solr has (psuedo) join query for a while now. I think this should also be available in Lucene.  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Commented] (LUCENE-3602) Add join query to Lucene

Posted by "Martijn van Groningen (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-3602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13189316#comment-13189316 ] 

Martijn van Groningen commented on LUCENE-3602:
-----------------------------------------------

bq. Sweet! How join would work in distributed mode, that would be very useful for BigData projects.
The join is actually executed in a two pass search. During the first pass search all the terms are gathered for the matching documents based on the fromQuery. In the second pass search all documents are collected that match with the gather terms for a specific field. To only way I currently see how this can work in a distributed environment is that all machines in the cluster execute the first pass search and then copy the collected terms between machines. After this is done each machine can execute the second pass. If your data allows it, you can partition data in your cluster this allows you to skip the copying of terms. 

Currently the api is just one static method and assumes that the joining happens locally. I think we need to have two more methods. One method that returns the first pass terms and one method that constructs a query based on terms from the first pass.

Robert: Yes I see that 3.x MTQ isn't as great as MTQ in trunk. Maybe we need a different approach (not use MTQ)? The api is clean for users, and allows us to do joining different in 3x. I'll start backporting and see how well it goes.
                
> Add join query to Lucene
> ------------------------
>
>                 Key: LUCENE-3602
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3602
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: modules/join
>            Reporter: Martijn van Groningen
>             Fix For: 3.6, 4.0
>
>         Attachments: LUCENE-3602.patch, LUCENE-3602.patch, LUCENE-3602.patch, LUCENE-3602.patch, LUCENE-3602.patch, LUCENE-3602.patch, LUCENE-3602.patch, LUCENE-3602.patch, LUCENE-3602.patch, LUCENE-3602.patch
>
>
> Solr has (psuedo) join query for a while now. I think this should also be available in Lucene.  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Commented] (LUCENE-3602) Add join query to Lucene

Posted by "Martijn van Groningen (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-3602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13167571#comment-13167571 ] 

Martijn van Groningen commented on LUCENE-3602:
-----------------------------------------------

bq. Maybe rename actualQuery to fromQuery?
Yes, fromQuery makes more sense than actualQuery.

{quote}
Why preComputedFromDocs...? Like if you were to cache something,
wouldn't you want cache the toSearcher's bitset instead?
{quote}
This is in the case if your from query was cached and your toSearch's
bitset isn't, which is a likely scenario.
But caching the toSearcher's bitset is better off course when
possible. But this should be happen outside the JoinQuery, right?

bq. Maybe rename JoinQueryWeight.joinResult to topLevelJoinResult,
I agree a much more descriptive name.

{quote}
I wonder if we could make this a Filter instead, somehow? Ie, at
its core it converts a top-level bitset in the fromSearcher doc
space into the joined bitset in the toSearcher doc space. It
could even maybe just be a static method taking in fromBitset and
returning toBitset, which could operate per-segment on the
toSearcher side? (Separately: I wonder if JoinQuery should do
something with the scores of the fromQuery....? Not right now but
maybe later...).
{quote}
It just matches docs from one side to the to side. That is all... So static method / filter should be able to do the job.
I'm not sure, but if it is a query it might be able to one day encapsulate the joining in the Lucene query language?

bq. Maybe reword that to state that all joined to/from docs must reside in the same shard?
+1

{quote}
I wonder if we could DocTermOrds instead? (Or,
FieldCache.DocTermsIndex or DocValues.BYTES_*, if we know
fromSearcher.fromField is
single-valued). This way we uninvert once (on init), and then doing
the join should be much faster since for each fromDocID we can lookup
the term(s) to join on.
{quote}
I really like that idea! This already crossed my mind a few days ago
as an improvement to speedup the joining. Would be nice if the user can 
choose between a more ram but faster variant and a less ram but slower variant.
I think we can just make two concrete JoinQuery impl that both have a different
joinResult(...) impl.
                
> Add join query to Lucene
> ------------------------
>
>                 Key: LUCENE-3602
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3602
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: modules/join
>            Reporter: Martijn van Groningen
>         Attachments: LUCENE-3602.patch, LUCENE-3602.patch
>
>
> Solr has (psuedo) join query for a while now. I think this should also be available in Lucene.  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Issue Comment Edited] (LUCENE-3602) Add join query to Lucene

Posted by "Martijn van Groningen (Issue Comment Edited) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-3602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13182329#comment-13182329 ] 

Martijn van Groningen edited comment on LUCENE-3602 at 1/13/12 10:00 AM:
-------------------------------------------------------------------------

bq. But I think the first 2 passes could be combined?
That idea also crossed my mind. We should definitely do this.

bq. Also, instead of making a toplevel bit set as the return result... could it be an ordinary filter? 
I like that approach. It can just return a TermsFilter instance, right?

Maybe we should have to variants one returns a MTQ and one a TermsFilter? So users can choose whether to want a filter or a query.

                
      was (Author: martijn.v.groningen):
    bq. But I think the first 2 passes could be combined?
That idea also crossed my mind. We should definitely do this.

bq Also, instead of making a toplevel bit set as the return result... could it be an ordinary filter? 
I like that approach. It can just return a TermsFilter instance, right?

Maybe we should have to variants one returns a MTQ and one a TermsFilter? So users can choose whether to want a filter or a query.

                  
> Add join query to Lucene
> ------------------------
>
>                 Key: LUCENE-3602
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3602
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: modules/join
>            Reporter: Martijn van Groningen
>             Fix For: 3.6, 4.0
>
>         Attachments: LUCENE-3602.patch, LUCENE-3602.patch, LUCENE-3602.patch, LUCENE-3602.patch
>
>
> Solr has (psuedo) join query for a while now. I think this should also be available in Lucene.  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org