You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Erik Groeneveld (JIRA)" <ji...@apache.org> on 2013/10/19 18:53:42 UTC

[jira] [Comment Edited] (LUCENE-5291) Faster Query-Time Join

    [ https://issues.apache.org/jira/browse/LUCENE-5291?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13798879#comment-13798879 ] 

Erik Groeneveld edited comment on LUCENE-5291 at 10/19/13 4:53 PM:
-------------------------------------------------------------------

This patch does not patch anything in Lucene. Its just three classes that apply Lucene.  They live in src/org/meresco/lucene. If adopted, they could be moved to org/apache/lucene.

I used "diff -urN" instead of svn diff, since the code is in git, not in subversion.

The split between KeyCollector and CachingKeyCollector is not essential.  It only shows how simple the idea is, and how caching complicates things.

Intended usage.

EDIT

Indexing:
Create NumericDocValues for the fields you want to join on.  We translate URIs to ords using DirectoryTaxonomyWriter, but that's just one way of doing it. As long as the number is small and monotonically increasing.

Searching:
You first use CachingKeyCollector to collect keys from one index. Then you use CachingKeyCollector.getFilter() to filter keys in another index.  I went to some lengths to add documentation to the code, so I hope it is clear how it works.


was (Author: erik@seecr.nl):
This patch does not patch anything in Lucene. Its just three classes that apply Lucene.  They live in src/org/meresco/lucene. If adopted, they could be moved to org/apache/lucene.

I used "diff -urN" instead of svn diff, since the code is in git, not in subversion.

The split between KeyCollector and CachingKeyCollector is not essential.  It only shows how simple the idea is, and how caching complicates things.

Intended usage.

You first use CachingKeyCollector to collect keys from one index. Then you use CachingKeyCollector.getFilter() to filter keys in another index.  I went to some lengths to add documentation to the code, so I hope it is clear how it works.

> Faster Query-Time Join
> ----------------------
>
>                 Key: LUCENE-5291
>                 URL: https://issues.apache.org/jira/browse/LUCENE-5291
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: core/index, core/search
>    Affects Versions: 4.5
>            Reporter: Erik Groeneveld
>            Priority: Minor
>              Labels: join, query
>         Attachments: LUCENE-5291.patch
>
>
> The current implementation of query-time join could be complemented with a much faster one, provided some choices can be made about what to join on.
> Since join is really a database concept, we found it quite natural to restrict the keys to be integers and be single valued. 
> We found that if it is possible to use integers keys, and having single valued key fields, the speed of join can be improved 50 fold. Proper caching again speeds up about 20 times.
> I'd like to contribute our code if you agree that it is a useful contribution.  That probably depends on what you think of the choices we made about the keys, so that need to be discussed first?



--
This message was sent by Atlassian JIRA
(v6.1#6144)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org