You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Igor Shalyminov <is...@yandex-team.ru> on 2013/04/02 14:29:30 UTC

How to use concurrency efficiently

Hello!

I have a ~20GB index and try to make a concurrent search over it.
The index has 16 segments, I run SpanQuery.getSpans() on each segment concurrently.
I see really small performance improvement of searching concurrently. I suppose, the reason is that the sizes of the segments are very non-uniform (3 segments have ~20 000 docs each, and the others have less than 1 000 each).
How to make more uniformly sized segments (I now use just writer.forceMerge(16)), and are multiple index segments the most important thing in Lucene concurrency?

-- 
Best Regards,
Igor

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: How to use concurrency efficiently

Posted by Paul Bell <ar...@gmail.com>.
All,

Sorry, but I inadvertenly put my post re MultiFieldQueryParser in the wrong
thread (wrong subject via cut and paste).

Igor, thank you for the reply. I will look into what you suggest.

-Paul


On Wed, Apr 3, 2013 at 6:58 AM, Igor Shalyminov
<is...@yandex-team.ru>wrote:

> I personally use SpanNearQuey (span positions are always needed), and for
> different fields I use FieldMaskingSpanQuery class.
> I just choose one field name and then mask each SpanTermQuery's real field
> name with this field via wrapper.
>
> Maybe it can help.
>
> --
> Igor
>
> 03.04.2013, 06:59, "Paul" <ar...@gmail.com>:
> > Hi,
> >
> > I've experimented a bit with MultiFieldQueryParser (
> http://lucene.apache.org/core/4_2_0/queryparser/org/apache/lucene/queryparser/classic/MultiFieldQueryParser.html
> )
> >
> > But it seems to search for each of a query's terms in each field
> specified in the constructor. So, as the doc says, if you query on two
> terms against two fields, it will search for each term in each field.
> >
> > What's the best way to construct a search for, say, two terms where one
> should be looked for in field1 and the other in field2? Can this be done by
> a BooleanQuery that ANDs two TermQuerys?
> >
> > I read something about the abstract class MultiTermQuery, but I don't
> really understand whether or not it would help with this problem.
> >
> > Thank you.
> >
> > -Paul
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Re: How to use concurrency efficiently

Posted by Igor Shalyminov <is...@yandex-team.ru>.
I personally use SpanNearQuey (span positions are always needed), and for different fields I use FieldMaskingSpanQuery class.
I just choose one field name and then mask each SpanTermQuery's real field name with this field via wrapper.

Maybe it can help.

-- 
Igor

03.04.2013, 06:59, "Paul" <ar...@gmail.com>:
> Hi,
>
> I've experimented a bit with MultiFieldQueryParser (http://lucene.apache.org/core/4_2_0/queryparser/org/apache/lucene/queryparser/classic/MultiFieldQueryParser.html)
>
> But it seems to search for each of a query's terms in each field specified in the constructor. So, as the doc says, if you query on two terms against two fields, it will search for each term in each field.
>
> What's the best way to construct a search for, say, two terms where one should be looked for in field1 and the other in field2? Can this be done by a BooleanQuery that ANDs two TermQuerys?
>
> I read something about the abstract class MultiTermQuery, but I don't really understand whether or not it would help with this problem.
>
> Thank you.
>
> -Paul
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: How to use concurrency efficiently

Posted by Paul <ar...@gmail.com>.
Hi, 

I've experimented a bit with MultiFieldQueryParser (http://lucene.apache.org/core/4_2_0/queryparser/org/apache/lucene/queryparser/classic/MultiFieldQueryParser.html)

But it seems to search for each of a query's terms in each field specified in the constructor. So, as the doc says, if you query on two terms against two fields, it will search for each term in each field.

What's the best way to construct a search for, say, two terms where one should be looked for in field1 and the other in field2? Can this be done by a BooleanQuery that ANDs two TermQuerys?

I read something about the abstract class MultiTermQuery, but I don't really understand whether or not it would help with this problem.

Thank you.

-Paul
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


RE: How to use concurrency efficiently

Posted by Uwe Schindler <uw...@thetaphi.de>.
If you are using MMapDirectory (default on 64 bit platforms) then they are already in filesystem cache and directly accessible like RAM to indexreader. No need to cache separately.

-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: uwe@thetaphi.de


> -----Original Message-----
> From: Igor Shalyminov [mailto:ishalyminov@yandex-team.ru]
> Sent: Tuesday, April 02, 2013 9:58 PM
> To: java-user@lucene.apache.org
> Subject: Re: How to use concurrency efficiently
> 
> These are not document hits but text hits (to be more specific, spans).
> For the search result it is necessary to have the precise number of document
> and text hits and a relatively small number of matched text snippets.
> 
> I've tried several approaches to optimize the search algorithm but they didn't
> help - for the specific types of queries there is indeed a great amount of data
> to be retrieved from the index.
> At the moment I'm thinking about in-RAM caching of posting lists. Is it
> possible in Lucene?
> 
> --
> Igor
> 
> 02.04.2013, 20:44, "Adrien Grand" <jp...@gmail.com>:
> > On Tue, Apr 2, 2013 at 4:39 PM, Igor Shalyminov
> > <is...@yandex-team.ru> wrote:
> >
> >>  Yes, the number of documents is not too large (about 90 000), but the
> queries are very hard. Although they're just boolean, a typical query can
> produce a result with tens of millions of hits.
> >
> > How can there be tens of millions of hits with only 90000 docs?
> >
> >>  Single-threadedly such a query runs ~20 seconds, which is too slow.
> therefore, multithreading is vital for this task.
> >
> > Indeed, that's super slow. Multithreading could help a little, but
> > maybe there is something to do to better index your data so that
> > queries get faster?
> >
> > --
> > Adrien
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: How to use concurrency efficiently

Posted by Igor Shalyminov <is...@yandex-team.ru>.
These are not document hits but text hits (to be more specific, spans).
For the search result it is necessary to have the precise number of document and text hits and a relatively small number of matched text snippets.

I've tried several approaches to optimize the search algorithm but they didn't help - for the specific types of queries there is indeed a great amount of data to be retrieved from the index.
At the moment I'm thinking about in-RAM caching of posting lists. Is it possible in Lucene?

-- 
Igor

02.04.2013, 20:44, "Adrien Grand" <jp...@gmail.com>:
> On Tue, Apr 2, 2013 at 4:39 PM, Igor Shalyminov
> <is...@yandex-team.ru> wrote:
>
>>  Yes, the number of documents is not too large (about 90 000), but the queries are very hard. Although they're just boolean, a typical query can produce a result with tens of millions of hits.
>
> How can there be tens of millions of hits with only 90000 docs?
>
>>  Single-threadedly such a query runs ~20 seconds, which is too slow. therefore, multithreading is vital for this task.
>
> Indeed, that's super slow. Multithreading could help a little, but
> maybe there is something to do to better index your data so that
> queries get faster?
>
> --
> Adrien
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: How to use concurrency efficiently

Posted by Adrien Grand <jp...@gmail.com>.
On Tue, Apr 2, 2013 at 4:39 PM, Igor Shalyminov
<is...@yandex-team.ru> wrote:
> Yes, the number of documents is not too large (about 90 000), but the queries are very hard. Although they're just boolean, a typical query can produce a result with tens of millions of hits.

How can there be tens of millions of hits with only 90000 docs?

> Single-threadedly such a query runs ~20 seconds, which is too slow. therefore, multithreading is vital for this task.

Indeed, that's super slow. Multithreading could help a little, but
maybe there is something to do to better index your data so that
queries get faster?

--
Adrien

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: How to use concurrency efficiently

Posted by Igor Shalyminov <is...@yandex-team.ru>.
Yes, the number of documents is not too large (about 90 000), but the queries are very hard. Although they're just boolean, a typical query can produce a result with tens of millions of hits.
Single-threadedly such a query runs ~20 seconds, which is too slow. therefore, multithreading is vital for this task.

As you mentioned, merges are the source of non-uniform segments sizes. Therefore, as my index is fully static (every time I need a re-index, I can do it from scratch), I'm gonna give a try to NoMergePolicy with some reasonable maximum segment size.
If there are some other multithreading caveats, they're highly welcomed.

-- 
Best Regards,
Igor

02.04.2013, 18:07, "Adrien Grand" <jp...@gmail.com>:
> On Tue, Apr 2, 2013 at 2:29 PM, Igor Shalyminov
> <is...@yandex-team.ru> wrote:
>
>>  Hello!
>
> Hi Igor,
>
>>  I have a ~20GB index and try to make a concurrent search over it.
>>  The index has 16 segments, I run SpanQuery.getSpans() on each segment concurrently.
>>  I see really small performance improvement of searching concurrently. I suppose, the reason is that the sizes of the segments are very non-uniform (3 segments have ~20 000 docs each, and the others have less than 1 000 each).
>>  How to make more uniformly sized segments (I now use just writer.forceMerge(16)), and are multiple index segments the most important thing in Lucene concurrency?
>
> Segments have non uniform sizes by design. A segment is generated
> every time a flush happens (when the ram buffer is full or if you
> explicitely call commit). When there are two many segments, Lucene
> merges some of them while new segments keep being generated as you add
> data. So the "flush" segments will always be small while segments
> resulting from a merge will be much larger since they contain data
> from several other segments.
>
> Even if segments are collected concurrently, IndexSearcher needs to
> merge the results of the collection of each segments in the end. Since
> your segments are very small (20000 docs), maybe the cost of
> initialization/merge is not negligible compared to single-segment
> collection.
>
> --
> Adrien
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: How to use concurrency efficiently

Posted by Adrien Grand <jp...@gmail.com>.
On Tue, Apr 2, 2013 at 2:29 PM, Igor Shalyminov
<is...@yandex-team.ru> wrote:
> Hello!

Hi Igor,

> I have a ~20GB index and try to make a concurrent search over it.
> The index has 16 segments, I run SpanQuery.getSpans() on each segment concurrently.
> I see really small performance improvement of searching concurrently. I suppose, the reason is that the sizes of the segments are very non-uniform (3 segments have ~20 000 docs each, and the others have less than 1 000 each).
> How to make more uniformly sized segments (I now use just writer.forceMerge(16)), and are multiple index segments the most important thing in Lucene concurrency?

Segments have non uniform sizes by design. A segment is generated
every time a flush happens (when the ram buffer is full or if you
explicitely call commit). When there are two many segments, Lucene
merges some of them while new segments keep being generated as you add
data. So the "flush" segments will always be small while segments
resulting from a merge will be much larger since they contain data
from several other segments.

Even if segments are collected concurrently, IndexSearcher needs to
merge the results of the collection of each segments in the end. Since
your segments are very small (20000 docs), maybe the cost of
initialization/merge is not negligible compared to single-segment
collection.

--
Adrien

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org