You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Peter Keegan <pe...@gmail.com> on 2007/08/13 20:33:37 UTC

SpanQuery and database join

I've been experimenting with using SpanQuery to perform what is essentially
a limited type of database 'join'. Each document in the index contains 1 or
more 'rows' of meta data from another 'table'. The meta data are simple
tokens representing a column name/value pair ( e.g. color$red or
location$123).  Each row is represented by a span with a maximum token
length equal to the maximum number of meta data columns. If a column has
multiple values, they are all indexed at the same position ( e.g. color$red,
color$blue). All rows are added to a single field. The spans are 'separated'
from each other by introducing a position gap between them via '
Analyzer.getPositionIncrementGap'. This gap should be greater than the
number of columns in each span.

At query time, a SpanNearQuery is constructed to represent the meta data to
join. The 'slop' value is set to the maximum number of meta data columns
(minus 1). Using a simple Antlr parser, boolean span queries with AND, OR,
NOT can be constructed fairly easily. The SpanQuery is And'd to the main
query to build the final query.

This approach is flexible and pretty efficient because no stored fields or
external data are accessed at query time. Span queries are more expensive
compared than other queries, though. We measure performance via throughput
(as opposed to the response time for a single query), and the addition of a
SpanQuery reduced throughput by 5X for ordered spans and 10X for unordered
spans. Still, this may be acceptable for some applications, especially if
spans are not used on every query.

I thought this might interest some of you.

Peter

Re: SpanQuery and database join

Posted by Peter Keegan <pe...@gmail.com>.

I added this under Use Cases. Thanks for the suggestion.

Peter


On 8/13/07, Grant Ingersoll <gs...@apache.org> wrote:
>
> There is also a Use Cases item on the Wiki...
>
> On Aug 13, 2007, at 3:26 PM, Peter Keegan wrote:
>
> > I suppose it could go under performance or HowTo/Interesting uses of
> > SpanQuery.
> >
> > Peter
> >
> > On 8/13/07, Erick Erickson <er...@gmail.com> wrote:
> >>
> >> Thanks for writing this up. Do you think this is an appropriate
> >> subject
> >> for the Wiki performance page?
> >>
> >> Erick
> >>
> >> On 8/13/07, Peter Keegan <pe...@gmail.com> wrote:
> >>>
> >>> I've been experimenting with using SpanQuery to perform what is
> >>> essentially
> >>> a limited type of database 'join'. Each document in the index
> >>> contains 1
> >>> or
> >>> more 'rows' of meta data from another 'table'. The meta data are
> >>> simple
> >>> tokens representing a column name/value pair ( e.g. color$red or
> >>> location$123).  Each row is represented by a span with a maximum
> >>> token
> >>> length equal to the maximum number of meta data columns. If a
> >>> column has
> >>> multiple values, they are all indexed at the same position ( e.g.
> >>> color$red,
> >>> color$blue). All rows are added to a single field. The spans are
> >>> 'separated'
> >>> from each other by introducing a position gap between them via '
> >>> Analyzer.getPositionIncrementGap'. This gap should be greater
> >>> than the
> >>> number of columns in each span.
> >>>
> >>> At query time, a SpanNearQuery is constructed to represent the
> >>> meta data
> >>> to
> >>> join. The 'slop' value is set to the maximum number of meta data
> >>> columns
> >>> (minus 1). Using a simple Antlr parser, boolean span queries with
> >>> AND,
> >> OR,
> >>> NOT can be constructed fairly easily. The SpanQuery is And'd to
> >>> the main
> >>> query to build the final query.
> >>>
> >>> This approach is flexible and pretty efficient because no stored
> >>> fields
> >> or
> >>> external data are accessed at query time. Span queries are more
> >> expensive
> >>> compared than other queries, though. We measure performance via
> >> throughput
> >>> (as opposed to the response time for a single query), and the
> >>> addition
> >> of
> >>> a
> >>> SpanQuery reduced throughput by 5X for ordered spans and 10X for
> >> unordered
> >>> spans. Still, this may be acceptable for some applications,
> >>> especially
> >> if
> >>> spans are not used on every query.
> >>>
> >>> I thought this might interest some of you.
> >>>
> >>> Peter
> >>>
> >>
>
> --------------------------
> Grant Ingersoll
> http://lucene.grantingersoll.com
>
> Lucene Helpful Hints:
> http://wiki.apache.org/lucene-java/BasicsOfPerformance
> http://wiki.apache.org/lucene-java/LuceneFAQ
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Re: SpanQuery and database join

Posted by Grant Ingersoll <gs...@apache.org>.

There is also a Use Cases item on the Wiki...

On Aug 13, 2007, at 3:26 PM, Peter Keegan wrote:

> I suppose it could go under performance or HowTo/Interesting uses of
> SpanQuery.
>
> Peter
>
> On 8/13/07, Erick Erickson <er...@gmail.com> wrote:
>>
>> Thanks for writing this up. Do you think this is an appropriate  
>> subject
>> for the Wiki performance page?
>>
>> Erick
>>
>> On 8/13/07, Peter Keegan <pe...@gmail.com> wrote:
>>>
>>> I've been experimenting with using SpanQuery to perform what is
>>> essentially
>>> a limited type of database 'join'. Each document in the index  
>>> contains 1
>>> or
>>> more 'rows' of meta data from another 'table'. The meta data are  
>>> simple
>>> tokens representing a column name/value pair ( e.g. color$red or
>>> location$123).  Each row is represented by a span with a maximum  
>>> token
>>> length equal to the maximum number of meta data columns. If a  
>>> column has
>>> multiple values, they are all indexed at the same position ( e.g.
>>> color$red,
>>> color$blue). All rows are added to a single field. The spans are
>>> 'separated'
>>> from each other by introducing a position gap between them via '
>>> Analyzer.getPositionIncrementGap'. This gap should be greater  
>>> than the
>>> number of columns in each span.
>>>
>>> At query time, a SpanNearQuery is constructed to represent the  
>>> meta data
>>> to
>>> join. The 'slop' value is set to the maximum number of meta data  
>>> columns
>>> (minus 1). Using a simple Antlr parser, boolean span queries with  
>>> AND,
>> OR,
>>> NOT can be constructed fairly easily. The SpanQuery is And'd to  
>>> the main
>>> query to build the final query.
>>>
>>> This approach is flexible and pretty efficient because no stored  
>>> fields
>> or
>>> external data are accessed at query time. Span queries are more
>> expensive
>>> compared than other queries, though. We measure performance via
>> throughput
>>> (as opposed to the response time for a single query), and the  
>>> addition
>> of
>>> a
>>> SpanQuery reduced throughput by 5X for ordered spans and 10X for
>> unordered
>>> spans. Still, this may be acceptable for some applications,  
>>> especially
>> if
>>> spans are not used on every query.
>>>
>>> I thought this might interest some of you.
>>>
>>> Peter
>>>
>>

--------------------------
Grant Ingersoll
http://lucene.grantingersoll.com

Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: SpanQuery and database join

Posted by Peter Keegan <pe...@gmail.com>.

I suppose it could go under performance or HowTo/Interesting uses of
SpanQuery.

Peter

On 8/13/07, Erick Erickson <er...@gmail.com> wrote:
>
> Thanks for writing this up. Do you think this is an appropriate subject
> for the Wiki performance page?
>
> Erick
>
> On 8/13/07, Peter Keegan <pe...@gmail.com> wrote:
> >
> > I've been experimenting with using SpanQuery to perform what is
> > essentially
> > a limited type of database 'join'. Each document in the index contains 1
> > or
> > more 'rows' of meta data from another 'table'. The meta data are simple
> > tokens representing a column name/value pair ( e.g. color$red or
> > location$123).  Each row is represented by a span with a maximum token
> > length equal to the maximum number of meta data columns. If a column has
> > multiple values, they are all indexed at the same position ( e.g.
> > color$red,
> > color$blue). All rows are added to a single field. The spans are
> > 'separated'
> > from each other by introducing a position gap between them via '
> > Analyzer.getPositionIncrementGap'. This gap should be greater than the
> > number of columns in each span.
> >
> > At query time, a SpanNearQuery is constructed to represent the meta data
> > to
> > join. The 'slop' value is set to the maximum number of meta data columns
> > (minus 1). Using a simple Antlr parser, boolean span queries with AND,
> OR,
> > NOT can be constructed fairly easily. The SpanQuery is And'd to the main
> > query to build the final query.
> >
> > This approach is flexible and pretty efficient because no stored fields
> or
> > external data are accessed at query time. Span queries are more
> expensive
> > compared than other queries, though. We measure performance via
> throughput
> > (as opposed to the response time for a single query), and the addition
> of
> > a
> > SpanQuery reduced throughput by 5X for ordered spans and 10X for
> unordered
> > spans. Still, this may be acceptable for some applications, especially
> if
> > spans are not used on every query.
> >
> > I thought this might interest some of you.
> >
> > Peter
> >
>

Re: SpanQuery and database join

Posted by Erick Erickson <er...@gmail.com>.

Thanks for writing this up. Do you think this is an appropriate subject
for the Wiki performance page?

Erick

On 8/13/07, Peter Keegan <pe...@gmail.com> wrote:
>
> I've been experimenting with using SpanQuery to perform what is
> essentially
> a limited type of database 'join'. Each document in the index contains 1
> or
> more 'rows' of meta data from another 'table'. The meta data are simple
> tokens representing a column name/value pair ( e.g. color$red or
> location$123).  Each row is represented by a span with a maximum token
> length equal to the maximum number of meta data columns. If a column has
> multiple values, they are all indexed at the same position ( e.g.
> color$red,
> color$blue). All rows are added to a single field. The spans are
> 'separated'
> from each other by introducing a position gap between them via '
> Analyzer.getPositionIncrementGap'. This gap should be greater than the
> number of columns in each span.
>
> At query time, a SpanNearQuery is constructed to represent the meta data
> to
> join. The 'slop' value is set to the maximum number of meta data columns
> (minus 1). Using a simple Antlr parser, boolean span queries with AND, OR,
> NOT can be constructed fairly easily. The SpanQuery is And'd to the main
> query to build the final query.
>
> This approach is flexible and pretty efficient because no stored fields or
> external data are accessed at query time. Span queries are more expensive
> compared than other queries, though. We measure performance via throughput
> (as opposed to the response time for a single query), and the addition of
> a
> SpanQuery reduced throughput by 5X for ordered spans and 10X for unordered
> spans. Still, this may be acceptable for some applications, especially if
> spans are not used on every query.
>
> I thought this might interest some of you.
>
> Peter
>