You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Igor Shalyminov <is...@yandex-team.ru> on 2013/10/10 00:17:32 UTC

Lucene in-memory index

Hello!

I need to perform an experiment of loading the entire index in RAM and seeing how the search performance changes.
My index has TermVectors with payload and position info, StoredFields, and DocValues. It takes ~30GB on disk (the server has 48).

_indexDirectoryReader = DirectoryReader.open(RAMDirectory.open(new File(_indexDirectory)));

Is the line above the only thing I have to do to complete my goal?

And also:
- will all the data be loaded in the RAM right after opening, or during the reading stage?
- will the index data be stored in RAM as it is on disk, or will it be uncompressed first?

-- 
Best Regards,
Igor

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Lucene in-memory index

Posted by Michael McCandless <lu...@mikemccandless.com>.

On Fri, Oct 18, 2013 at 1:19 PM, Igor Shalyminov
<is...@yandex-team.ru> wrote:

> OK, it turns out that DirectPostingsFormat is really an extreme thing: 8GB of index couldn't fit into 20+ java heap.
> I wonder if there is a postings format that works from disk the standard way but uses no compression?

Yes, it's very RAM hungry ... it does no compression.

You could instantiate the current default postings format
(Lucene41PostingsFormat) with a higher acceptableOverheadRatio; this
would cause it to use less compression for faster decoding ... but I
strongly suspect your hotspot is the SpanNearQuery and not postings
decode.

Mike McCandless

http://blog.mikemccandless.com

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Lucene in-memory index

Posted by Igor Shalyminov <is...@yandex-team.ru>.

Hello!

OK, it turns out that DirectPostingsFormat is really an extreme thing: 8GB of index couldn't fit into 20+ java heap.
I wonder if there is a postings format that works from disk the standard way but uses no compression?


-- 
Best Regards,
Igor

18.10.2013, 02:06, "Igor Shalyminov" <is...@yandex-team.ru>:
> Mike,
>
> For now I'm using just a SpanQuery over a ~600MB index segment single-threadedly (one segment - one thread, the complete setup is 30 segments with the total of 20GB).
>
> I'm trying to use Lucene for the morphologically annotated text corpus (namely, Russian National Corpus).
> The main query type in it is co-occurrence search with desired word morphological features and distance between tokens.
>
> In my test case I work with a single field - grammar (it is word-level - every word in the corpus has one). Full grammar annotation of a word is a set of atomic grammar features.
> For an example, the verb "book" has in its grammar:
> - POS  tag (V);
> - time (pres);
>
> and the noun "book":
> - POS tag (N)
> - number (sg).
>
> In general one grammar annotation has approximately 8 atomic features.
>
> Words are treated as initially ambiguous, so that for the word "book" occurrence in the text we get grammar tokens:
> V    pres    N    sg
> 2 parses: "V,pres" and "N,sg" are just independent tokens with positionIncrement=0 in the index.
>
> Moreover, each such token has parse bitmask in its payload:
> V|0001    pres|0001    N|0010    sg|0010
>
> Here, V and pres appeared in the 1st parse; N and sg in the 2nd with the maximum of 4 parse variants. It allows me to find the word "book" for the query "V" & "pres" but not for the query "V" & "sg".
>
> So, I'm performing a SpanNearQuery "{"A,sg" that goes right before "N,sg"} with position and payload checking over a 600MB segment and getting the precise doc hits number and overall matches number via iterating over getSpans().
>
> This takes me about 20 seconds, even if everything is in RAM.
> The next thing I'm going to explore is compression, I'll try DirectPostingsFormat as you suggested.
>
> --
> Best Regards,
> Igor
>
> 17.10.2013, 20:26, "Michael McCandless" <lu...@mikemccandless.com>:
>
>>  DirectPostingsFormat holds all postings in RAM, uncompressed, as
>>  simple java arrays.  But it's quite RAM heavy...
>>
>>  The hotspots may also be in the queries you are running ... maybe you
>>  can describe more how you're using Lucene?
>>
>>  Mike McCandless
>>
>>  http://blog.mikemccandless.com
>>
>>  On Thu, Oct 17, 2013 at 10:56 AM, Igor Shalyminov
>>  <is...@yandex-team.ru> wrote:
>>>   Hello!
>>>
>>>   I've tried two approaches: 1) RAMDirectory, 2) MMapDirectory + tmpfs. Both work the same for me (the same bad:( ).
>>>   Thus, I think my problem is not disk access (although I always see getPayload() in the VisualVM top).
>>>   So, maybe the hard part in the postings traversal is decompression?
>>>   Are there Lucene codecs which use light postings compression (maybe none at all)?
>>>
>>>   And, getting back to in-memory index topic, is lucene.codecs.memory somewhat similar to RAMDirectory?
>>>
>>>   --
>>>   Best Regards,
>>>   Igor
>>>
>>>   10.10.2013, 03:01, "Vitaly Funstein" <vf...@gmail.com>:
>>>>   I don't think you want to load indexes of this size into a RAMDirectory.
>>>>   The reasons have been listed multiple times here... in short, just use
>>>>   MMapDirectory.
>>>>
>>>>   On Wed, Oct 9, 2013 at 3:17 PM, Igor Shalyminov
>>>>   <is...@yandex-team.ru>wrote:
>>>>>    Hello!
>>>>>
>>>>>    I need to perform an experiment of loading the entire index in RAM and
>>>>>    seeing how the search performance changes.
>>>>>    My index has TermVectors with payload and position info, StoredFields, and
>>>>>    DocValues. It takes ~30GB on disk (the server has 48).
>>>>>
>>>>>    _indexDirectoryReader = DirectoryReader.open(RAMDirectory.open(new
>>>>>    File(_indexDirectory)));
>>>>>
>>>>>    Is the line above the only thing I have to do to complete my goal?
>>>>>
>>>>>    And also:
>>>>>    - will all the data be loaded in the RAM right after opening, or during
>>>>>    the reading stage?
>>>>>    - will the index data be stored in RAM as it is on disk, or will it be
>>>>>    uncompressed first?
>>>>>
>>>>>    --
>>>>>    Best Regards,
>>>>>    Igor
>>>>>
>>>>>    ---------------------------------------------------------------------
>>>>>    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>>>    For additional commands, e-mail: java-user-help@lucene.apache.org
>>>   ---------------------------------------------------------------------
>>>   To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>   For additional commands, e-mail: java-user-help@lucene.apache.org
>>  ---------------------------------------------------------------------
>>  To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>  For additional commands, e-mail: java-user-help@lucene.apache.org
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Lucene in-memory index

Posted by Michael McCandless <lu...@mikemccandless.com>.

On Fri, Oct 25, 2013 at 9:58 AM, Igor Shalyminov
<is...@yandex-team.ru> wrote:

> What is ProxBooleanTermQuery?
> I couldn't find it in the trunk and in that ticket's (https://issues.apache.org/jira/browse/LUCENE-2878) patch.

Sorry, this is on https://issues.apache.org/jira/browse/LUCENE-5288

Next time try searching for ProxBooleanTermQuery at
http://jirasearch.mikemccandless.com -- only one match :)

As that patch now stands, it only uses prox to boost scoring ... so
you'd need to heavily modify it to do matching as well.

> And for now it's very fuzzy to me how the searching/scoring works. Are there any tutorials or talks on how do Queries, Scorers, Collectors interoperate?

Here's an overview: https://lucene.apache.org/core/3_6_2/scoring.html

It's a bit old (3.6.x) but much of it is still relevant.

Mike McCandless

http://blog.mikemccandless.com

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Lucene in-memory index

Posted by Igor Shalyminov <is...@yandex-team.ru>.

What is ProxBooleanTermQuery?
I couldn't find it in the trunk and in that ticket's (https://issues.apache.org/jira/browse/LUCENE-2878) patch.
And for now it's very fuzzy to me how the searching/scoring works. Are there any tutorials or talks on how do Queries, Scorers, Collectors interoperate?


-- 
Igor

23.10.2013, 19:06, "Michael McCandless" <lu...@mikemccandless.com>:
> On Tue, Oct 22, 2013 at 9:43 AM, Igor Shalyminov
> <is...@yandex-team.ru> wrote:
>
>>  Thanks for the link, I'll definitely dig into SpanQuery internals very soon.
>
> You could also just make a custom query.  If you start from the
> ProxBooleanTermQuery on that issue, but change it so that it rejects
> hits that didn't have terms in the right positions, then you'll likely
> have a much faster way to do your query.
>
>>>>   For "A,sg" and "A,pl" I use unordered SpanNearQueries with the slop=-1.
>>>  I didn't even realize you could pass negative slop to span queries.
>>>  What does that do?  Or did you mean slop=1?
>>  I indeed use an unordered SpanNearQuery with the slop = --1 (I saw it on some forum, maybe here: http://www.gossamer-threads.com/lists/lucene/java-user/89377?do=post_view_flat#89377)
>
> Wow, OK.  I have no idea what slop=-1 does...
>
>>  So far it works for me:)
>>>>   I wrap them into an ordered SpanNearQuery with the slop=0.
>>>>
>>>>   I see getPayload() in the profiler top. I think I can emulate payload checking with cleverly assigned position increments (and then maximum position in a document might jump up to ~10^9 - I hope it won't blow the whole index up).
>>>>
>>>>   If I remove payload matching and keep only position checking, will it speed up everything, or the positions and payloads are the same?
>>>  I think it would help to avoid payloads, but I'm not sure by how much.
>>>   E.g., I see that NearSpansOrdered creates a new Set for every hit
>>>  just to hold payloads, even if payloads are not going to be used.
>>>  Really the span scorers should check Terms.hasPayloads up front ...
>>>>   My main goal is getting the precise results for a query, so proximity boosting won't help, unfortunately.
>>>  OK.
>>>
>>>  I wonder if you can somehow identify the spans you care about at
>>>  indexing time, e.g. A,sg followed by N,sg and e.g. add a span into the
>>>  index at that point; this would make searching much faster (it becomes
>>>  a TermQuery).  For exact matching (slop=0) you can also index
>>>  shingles.
>>  Thanks for the clue, I think it can be a good optimization heuristic.
>>  I actually tried a similar approach to optimize search of attributes at the same position.
>>  Here's how it was supposed to work for a feature set "S,sg,nom,fem":
>>
>>  * the regular approach: split it into grammar atomics: "S", "sg", "nom", "fem". With payloads and positions assigned the right way, this would allow us to search for an arbitrary combination of these attributes _but_ with multiple postings merging.
>>  * the experimental approach: sort the atomics lexicographically and index all the subsets: "S", "fem", "nom", "sg", "S,fem", "S,nom", ..., "S,fem,nom,sg". With the preprocessing of the user query the same way (split - sort - join) it would allow us to process the same queries exactly within one posting.
>>
>>  This technique is actually used in our current production index based on Yandex.Server engine.
>>  But Yandex.Server somehow makes the index size reasonable (within the order of magnitude of original text size), while Lucene index blows up totally ( >10 times original text size) and no search performance improvements appear.
>
> That's really odd.  I would expect index to become much larger, but
> search performance ought to be much faster since you run simple
> TermQuery.
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Lucene in-memory index

Posted by Michael McCandless <lu...@mikemccandless.com>.

On Tue, Oct 22, 2013 at 9:43 AM, Igor Shalyminov
<is...@yandex-team.ru> wrote:

> Thanks for the link, I'll definitely dig into SpanQuery internals very soon.

You could also just make a custom query.  If you start from the
ProxBooleanTermQuery on that issue, but change it so that it rejects
hits that didn't have terms in the right positions, then you'll likely
have a much faster way to do your query.

>>>  For "A,sg" and "A,pl" I use unordered SpanNearQueries with the slop=-1.
>>
>> I didn't even realize you could pass negative slop to span queries.
>> What does that do?  Or did you mean slop=1?
>
> I indeed use an unordered SpanNearQuery with the slop = --1 (I saw it on some forum, maybe here: http://www.gossamer-threads.com/lists/lucene/java-user/89377?do=post_view_flat#89377)

Wow, OK.  I have no idea what slop=-1 does...

> So far it works for me:)
>
>>
>>>  I wrap them into an ordered SpanNearQuery with the slop=0.
>>>
>>>  I see getPayload() in the profiler top. I think I can emulate payload checking with cleverly assigned position increments (and then maximum position in a document might jump up to ~10^9 - I hope it won't blow the whole index up).
>>>
>>>  If I remove payload matching and keep only position checking, will it speed up everything, or the positions and payloads are the same?
>>
>> I think it would help to avoid payloads, but I'm not sure by how much.
>>  E.g., I see that NearSpansOrdered creates a new Set for every hit
>> just to hold payloads, even if payloads are not going to be used.
>> Really the span scorers should check Terms.hasPayloads up front ...
>>
>>>  My main goal is getting the precise results for a query, so proximity boosting won't help, unfortunately.
>>
>> OK.
>>
>> I wonder if you can somehow identify the spans you care about at
>> indexing time, e.g. A,sg followed by N,sg and e.g. add a span into the
>> index at that point; this would make searching much faster (it becomes
>> a TermQuery).  For exact matching (slop=0) you can also index
>> shingles.
>
> Thanks for the clue, I think it can be a good optimization heuristic.
> I actually tried a similar approach to optimize search of attributes at the same position.
> Here's how it was supposed to work for a feature set "S,sg,nom,fem":
>
> * the regular approach: split it into grammar atomics: "S", "sg", "nom", "fem". With payloads and positions assigned the right way, this would allow us to search for an arbitrary combination of these attributes _but_ with multiple postings merging.
> * the experimental approach: sort the atomics lexicographically and index all the subsets: "S", "fem", "nom", "sg", "S,fem", "S,nom", ..., "S,fem,nom,sg". With the preprocessing of the user query the same way (split - sort - join) it would allow us to process the same queries exactly within one posting.
>
> This technique is actually used in our current production index based on Yandex.Server engine.
> But Yandex.Server somehow makes the index size reasonable (within the order of magnitude of original text size), while Lucene index blows up totally ( >10 times original text size) and no search performance improvements appear.

That's really odd.  I would expect index to become much larger, but
search performance ought to be much faster since you run simple
TermQuery.

Mike McCandless

http://blog.mikemccandless.com

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Lucene in-memory index

Posted by Igor Shalyminov <is...@yandex-team.ru>.

Hello Mike!


19.10.2013, 14:54, "Michael McCandless" <lu...@mikemccandless.com>:
> On Fri, Oct 18, 2013 at 5:50 PM, Igor Shalyminov
> <is...@yandex-team.ru> wrote:
>
>>  But why is it so costly?
>
> I think because the matching is inherently complex?  But also because
> it does high-cost things like allocating new List and Set for every
> matched doc (e.g. NearSpansOrdered.shrinkToAfterShortestMatch) to hold
> all payloads it encountered within each span. Patches welcome!
>
>>  In a regular query we walk postings and match document numbers, in a SpanQuery we match position numbers (or position segments), what's the principal difference?
>>  I think it's just that #documents << #positions.
>
> Conceptually, that's right, we just need to decode "more ints" (and
> also the payloads), but then need to essentially merge-sort the
> positions of N terms, and then "coalesce" them into spans, is at heart
> rather costly.  Lots of hard-for-CPU-to-predict branches...
>
> But I suspect we could get some good speedups on span queries with a
> better implementation;
> https://issues.apache.org/jira/browse/LUCENE-2878 is [slowly]
> exploring making positions "first class" in Scorer, so you can iterate
> over position + payload for each hit.

Thanks for the link, I'll definitely dig into SpanQuery internals very soon.

>
>>  For "A,sg" and "A,pl" I use unordered SpanNearQueries with the slop=-1.
>
> I didn't even realize you could pass negative slop to span queries.
> What does that do?  Or did you mean slop=1?

I indeed use an unordered SpanNearQuery with the slop = --1 (I saw it on some forum, maybe here: http://www.gossamer-threads.com/lists/lucene/java-user/89377?do=post_view_flat#89377)
So far it works for me:)

>
>>  I wrap them into an ordered SpanNearQuery with the slop=0.
>>
>>  I see getPayload() in the profiler top. I think I can emulate payload checking with cleverly assigned position increments (and then maximum position in a document might jump up to ~10^9 - I hope it won't blow the whole index up).
>>
>>  If I remove payload matching and keep only position checking, will it speed up everything, or the positions and payloads are the same?
>
> I think it would help to avoid payloads, but I'm not sure by how much.
>  E.g., I see that NearSpansOrdered creates a new Set for every hit
> just to hold payloads, even if payloads are not going to be used.
> Really the span scorers should check Terms.hasPayloads up front ...
>
>>  My main goal is getting the precise results for a query, so proximity boosting won't help, unfortunately.
>
> OK.
>
> I wonder if you can somehow identify the spans you care about at
> indexing time, e.g. A,sg followed by N,sg and e.g. add a span into the
> index at that point; this would make searching much faster (it becomes
> a TermQuery).  For exact matching (slop=0) you can also index
> shingles.

Thanks for the clue, I think it can be a good optimization heuristic.
I actually tried a similar approach to optimize search of attributes at the same position.
Here's how it was supposed to work for a feature set "S,sg,nom,fem":

* the regular approach: split it into grammar atomics: "S", "sg", "nom", "fem". With payloads and positions assigned the right way, this would allow us to search for an arbitrary combination of these attributes _but_ with multiple postings merging.
* the experimental approach: sort the atomics lexicographically and index all the subsets: "S", "fem", "nom", "sg", "S,fem", "S,nom", ..., "S,fem,nom,sg". With the preprocessing of the user query the same way (split - sort - join) it would allow us to process the same queries exactly within one posting.

This technique is actually used in our current production index based on Yandex.Server engine.
But Yandex.Server somehow makes the index size reasonable (within the order of magnitude of original text size), while Lucene index blows up totally ( >10 times original text size) and no search performance improvements appear.

>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Lucene in-memory index

Posted by Michael McCandless <lu...@mikemccandless.com>.

On Fri, Oct 18, 2013 at 5:50 PM, Igor Shalyminov
<is...@yandex-team.ru> wrote:
> But why is it so costly?

I think because the matching is inherently complex?  But also because
it does high-cost things like allocating new List and Set for every
matched doc (e.g. NearSpansOrdered.shrinkToAfterShortestMatch) to hold
all payloads it encountered within each span. Patches welcome!

> In a regular query we walk postings and match document numbers, in a SpanQuery we match position numbers (or position segments), what's the principal difference?
> I think it's just that #documents << #positions.

Conceptually, that's right, we just need to decode "more ints" (and
also the payloads), but then need to essentially merge-sort the
positions of N terms, and then "coalesce" them into spans, is at heart
rather costly.  Lots of hard-for-CPU-to-predict branches...

But I suspect we could get some good speedups on span queries with a
better implementation;
https://issues.apache.org/jira/browse/LUCENE-2878 is [slowly]
exploring making positions "first class" in Scorer, so you can iterate
over position + payload for each hit.

> For "A,sg" and "A,pl" I use unordered SpanNearQueries with the slop=-1.

I didn't even realize you could pass negative slop to span queries.
What does that do?  Or did you mean slop=1?

> I wrap them into an ordered SpanNearQuery with the slop=0.
>
> I see getPayload() in the profiler top. I think I can emulate payload checking with cleverly assigned position increments (and then maximum position in a document might jump up to ~10^9 - I hope it won't blow the whole index up).
>
> If I remove payload matching and keep only position checking, will it speed up everything, or the positions and payloads are the same?

I think it would help to avoid payloads, but I'm not sure by how much.
 E.g., I see that NearSpansOrdered creates a new Set for every hit
just to hold payloads, even if payloads are not going to be used.
Really the span scorers should check Terms.hasPayloads up front ...

> My main goal is getting the precise results for a query, so proximity boosting won't help, unfortunately.

OK.

I wonder if you can somehow identify the spans you care about at
indexing time, e.g. A,sg followed by N,sg and e.g. add a span into the
index at that point; this would make searching much faster (it becomes
a TermQuery).  For exact matching (slop=0) you can also index
shingles.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Lucene in-memory index

Posted by Igor Shalyminov <is...@yandex-team.ru>.

But why is it so costly?

In a regular query we walk postings and match document numbers, in a SpanQuery we match position numbers (or position segments), what's the principal difference?
I think it's just that #documents << #positions.

For "A,sg" and "A,pl" I use unordered SpanNearQueries with the slop=-1. I wrap them into an ordered SpanNearQuery with the slop=0.

I see getPayload() in the profiler top. I think I can emulate payload checking with cleverly assigned position increments (and then maximum position in a document might jump up to ~10^9 - I hope it won't blow the whole index up).
If I remove payload matching and keep only position checking, will it speed up everything, or the positions and payloads are the same?

My main goal is getting the precise results for a query, so proximity boosting won't help, unfortunately.


-- 
Best Regards,
Igor

18.10.2013, 23:37, "Michael McCandless" <lu...@mikemccandless.com>:
> Unfortunately, SpanNearQuery is a very costly query.  What slop are you passing?
>
> You might want to check out
> https://issues.apache.org/jira/browse/LUCENE-5288 ... it adds
> proximity boosting to queries, but it's still very early in the
> iterating, and if you need a precise count of only those documents
> matching the SpanNearQuery, then that issue won't help.
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
> On Thu, Oct 17, 2013 at 6:05 PM, Igor Shalyminov
> <is...@yandex-team.ru> wrote:
>
>>  Mike,
>>
>>  For now I'm using just a SpanQuery over a ~600MB index segment single-threadedly (one segment - one thread, the complete setup is 30 segments with the total of 20GB).
>>
>>  I'm trying to use Lucene for the morphologically annotated text corpus (namely, Russian National Corpus).
>>  The main query type in it is co-occurrence search with desired word morphological features and distance between tokens.
>>
>>  In my test case I work with a single field - grammar (it is word-level - every word in the corpus has one). Full grammar annotation of a word is a set of atomic grammar features.
>>  For an example, the verb "book" has in its grammar:
>>  - POS  tag (V);
>>  - time (pres);
>>
>>  and the noun "book":
>>  - POS tag (N)
>>  - number (sg).
>>
>>  In general one grammar annotation has approximately 8 atomic features.
>>
>>  Words are treated as initially ambiguous, so that for the word "book" occurrence in the text we get grammar tokens:
>>  V    pres    N    sg
>>  2 parses: "V,pres" and "N,sg" are just independent tokens with positionIncrement=0 in the index.
>>
>>  Moreover, each such token has parse bitmask in its payload:
>>  V|0001    pres|0001    N|0010    sg|0010
>>
>>  Here, V and pres appeared in the 1st parse; N and sg in the 2nd with the maximum of 4 parse variants. It allows me to find the word "book" for the query "V" & "pres" but not for the query "V" & "sg".
>>
>>  So, I'm performing a SpanNearQuery "{"A,sg" that goes right before "N,sg"} with position and payload checking over a 600MB segment and getting the precise doc hits number and overall matches number via iterating over getSpans().
>>
>>  This takes me about 20 seconds, even if everything is in RAM.
>>  The next thing I'm going to explore is compression, I'll try DirectPostingsFormat as you suggested.
>>
>>  --
>>  Best Regards,
>>  Igor
>>
>>  17.10.2013, 20:26, "Michael McCandless" <lu...@mikemccandless.com>:
>>>  DirectPostingsFormat holds all postings in RAM, uncompressed, as
>>>  simple java arrays.  But it's quite RAM heavy...
>>>
>>>  The hotspots may also be in the queries you are running ... maybe you
>>>  can describe more how you're using Lucene?
>>>
>>>  Mike McCandless
>>>
>>>  http://blog.mikemccandless.com
>>>
>>>  On Thu, Oct 17, 2013 at 10:56 AM, Igor Shalyminov
>>>  <is...@yandex-team.ru> wrote:
>>>>   Hello!
>>>>
>>>>   I've tried two approaches: 1) RAMDirectory, 2) MMapDirectory + tmpfs. Both work the same for me (the same bad:( ).
>>>>   Thus, I think my problem is not disk access (although I always see getPayload() in the VisualVM top).
>>>>   So, maybe the hard part in the postings traversal is decompression?
>>>>   Are there Lucene codecs which use light postings compression (maybe none at all)?
>>>>
>>>>   And, getting back to in-memory index topic, is lucene.codecs.memory somewhat similar to RAMDirectory?
>>>>
>>>>   --
>>>>   Best Regards,
>>>>   Igor
>>>>
>>>>   10.10.2013, 03:01, "Vitaly Funstein" <vf...@gmail.com>:
>>>>>   I don't think you want to load indexes of this size into a RAMDirectory.
>>>>>   The reasons have been listed multiple times here... in short, just use
>>>>>   MMapDirectory.
>>>>>
>>>>>   On Wed, Oct 9, 2013 at 3:17 PM, Igor Shalyminov
>>>>>   <is...@yandex-team.ru>wrote:
>>>>>>    Hello!
>>>>>>
>>>>>>    I need to perform an experiment of loading the entire index in RAM and
>>>>>>    seeing how the search performance changes.
>>>>>>    My index has TermVectors with payload and position info, StoredFields, and
>>>>>>    DocValues. It takes ~30GB on disk (the server has 48).
>>>>>>
>>>>>>    _indexDirectoryReader = DirectoryReader.open(RAMDirectory.open(new
>>>>>>    File(_indexDirectory)));
>>>>>>
>>>>>>    Is the line above the only thing I have to do to complete my goal?
>>>>>>
>>>>>>    And also:
>>>>>>    - will all the data be loaded in the RAM right after opening, or during
>>>>>>    the reading stage?
>>>>>>    - will the index data be stored in RAM as it is on disk, or will it be
>>>>>>    uncompressed first?
>>>>>>
>>>>>>    --
>>>>>>    Best Regards,
>>>>>>    Igor
>>>>>>
>>>>>>    ---------------------------------------------------------------------
>>>>>>    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>>>>    For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>   ---------------------------------------------------------------------
>>>>   To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>>   For additional commands, e-mail: java-user-help@lucene.apache.org
>>>  ---------------------------------------------------------------------
>>>  To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>  For additional commands, e-mail: java-user-help@lucene.apache.org
>>  ---------------------------------------------------------------------
>>  To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>  For additional commands, e-mail: java-user-help@lucene.apache.org
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Lucene in-memory index

Posted by Michael McCandless <lu...@mikemccandless.com>.

Unfortunately, SpanNearQuery is a very costly query.  What slop are you passing?

You might want to check out
https://issues.apache.org/jira/browse/LUCENE-5288 ... it adds
proximity boosting to queries, but it's still very early in the
iterating, and if you need a precise count of only those documents
matching the SpanNearQuery, then that issue won't help.

Mike McCandless

http://blog.mikemccandless.com


On Thu, Oct 17, 2013 at 6:05 PM, Igor Shalyminov
<is...@yandex-team.ru> wrote:
> Mike,
>
> For now I'm using just a SpanQuery over a ~600MB index segment single-threadedly (one segment - one thread, the complete setup is 30 segments with the total of 20GB).
>
> I'm trying to use Lucene for the morphologically annotated text corpus (namely, Russian National Corpus).
> The main query type in it is co-occurrence search with desired word morphological features and distance between tokens.
>
> In my test case I work with a single field - grammar (it is word-level - every word in the corpus has one). Full grammar annotation of a word is a set of atomic grammar features.
> For an example, the verb "book" has in its grammar:
> - POS  tag (V);
> - time (pres);
>
> and the noun "book":
> - POS tag (N)
> - number (sg).
>
> In general one grammar annotation has approximately 8 atomic features.
>
> Words are treated as initially ambiguous, so that for the word "book" occurrence in the text we get grammar tokens:
> V    pres    N    sg
> 2 parses: "V,pres" and "N,sg" are just independent tokens with positionIncrement=0 in the index.
>
> Moreover, each such token has parse bitmask in its payload:
> V|0001    pres|0001    N|0010    sg|0010
>
> Here, V and pres appeared in the 1st parse; N and sg in the 2nd with the maximum of 4 parse variants. It allows me to find the word "book" for the query "V" & "pres" but not for the query "V" & "sg".
>
> So, I'm performing a SpanNearQuery "{"A,sg" that goes right before "N,sg"} with position and payload checking over a 600MB segment and getting the precise doc hits number and overall matches number via iterating over getSpans().
>
> This takes me about 20 seconds, even if everything is in RAM.
> The next thing I'm going to explore is compression, I'll try DirectPostingsFormat as you suggested.
>
> --
> Best Regards,
> Igor
>
> 17.10.2013, 20:26, "Michael McCandless" <lu...@mikemccandless.com>:
>> DirectPostingsFormat holds all postings in RAM, uncompressed, as
>> simple java arrays.  But it's quite RAM heavy...
>>
>> The hotspots may also be in the queries you are running ... maybe you
>> can describe more how you're using Lucene?
>>
>> Mike McCandless
>>
>> http://blog.mikemccandless.com
>>
>> On Thu, Oct 17, 2013 at 10:56 AM, Igor Shalyminov
>> <is...@yandex-team.ru> wrote:
>>
>>>  Hello!
>>>
>>>  I've tried two approaches: 1) RAMDirectory, 2) MMapDirectory + tmpfs. Both work the same for me (the same bad:( ).
>>>  Thus, I think my problem is not disk access (although I always see getPayload() in the VisualVM top).
>>>  So, maybe the hard part in the postings traversal is decompression?
>>>  Are there Lucene codecs which use light postings compression (maybe none at all)?
>>>
>>>  And, getting back to in-memory index topic, is lucene.codecs.memory somewhat similar to RAMDirectory?
>>>
>>>  --
>>>  Best Regards,
>>>  Igor
>>>
>>>  10.10.2013, 03:01, "Vitaly Funstein" <vf...@gmail.com>:
>>>>  I don't think you want to load indexes of this size into a RAMDirectory.
>>>>  The reasons have been listed multiple times here... in short, just use
>>>>  MMapDirectory.
>>>>
>>>>  On Wed, Oct 9, 2013 at 3:17 PM, Igor Shalyminov
>>>>  <is...@yandex-team.ru>wrote:
>>>>>   Hello!
>>>>>
>>>>>   I need to perform an experiment of loading the entire index in RAM and
>>>>>   seeing how the search performance changes.
>>>>>   My index has TermVectors with payload and position info, StoredFields, and
>>>>>   DocValues. It takes ~30GB on disk (the server has 48).
>>>>>
>>>>>   _indexDirectoryReader = DirectoryReader.open(RAMDirectory.open(new
>>>>>   File(_indexDirectory)));
>>>>>
>>>>>   Is the line above the only thing I have to do to complete my goal?
>>>>>
>>>>>   And also:
>>>>>   - will all the data be loaded in the RAM right after opening, or during
>>>>>   the reading stage?
>>>>>   - will the index data be stored in RAM as it is on disk, or will it be
>>>>>   uncompressed first?
>>>>>
>>>>>   --
>>>>>   Best Regards,
>>>>>   Igor
>>>>>
>>>>>   ---------------------------------------------------------------------
>>>>>   To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>>>   For additional commands, e-mail: java-user-help@lucene.apache.org
>>>  ---------------------------------------------------------------------
>>>  To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>  For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Lucene in-memory index

Posted by Igor Shalyminov <is...@yandex-team.ru>.

Mike,

For now I'm using just a SpanQuery over a ~600MB index segment single-threadedly (one segment - one thread, the complete setup is 30 segments with the total of 20GB).

I'm trying to use Lucene for the morphologically annotated text corpus (namely, Russian National Corpus).
The main query type in it is co-occurrence search with desired word morphological features and distance between tokens.

In my test case I work with a single field - grammar (it is word-level - every word in the corpus has one). Full grammar annotation of a word is a set of atomic grammar features.
For an example, the verb "book" has in its grammar:
- POS  tag (V);
- time (pres);

and the noun "book":
- POS tag (N)
- number (sg).
 
In general one grammar annotation has approximately 8 atomic features.

Words are treated as initially ambiguous, so that for the word "book" occurrence in the text we get grammar tokens:
V    pres    N    sg
2 parses: "V,pres" and "N,sg" are just independent tokens with positionIncrement=0 in the index.

Moreover, each such token has parse bitmask in its payload: 
V|0001    pres|0001    N|0010    sg|0010

Here, V and pres appeared in the 1st parse; N and sg in the 2nd with the maximum of 4 parse variants. It allows me to find the word "book" for the query "V" & "pres" but not for the query "V" & "sg".

So, I'm performing a SpanNearQuery "{"A,sg" that goes right before "N,sg"} with position and payload checking over a 600MB segment and getting the precise doc hits number and overall matches number via iterating over getSpans().

This takes me about 20 seconds, even if everything is in RAM.
The next thing I'm going to explore is compression, I'll try DirectPostingsFormat as you suggested.

--
Best Regards,
Igor

17.10.2013, 20:26, "Michael McCandless" <lu...@mikemccandless.com>:
> DirectPostingsFormat holds all postings in RAM, uncompressed, as
> simple java arrays.  But it's quite RAM heavy...
>
> The hotspots may also be in the queries you are running ... maybe you
> can describe more how you're using Lucene?
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
> On Thu, Oct 17, 2013 at 10:56 AM, Igor Shalyminov
> <is...@yandex-team.ru> wrote:
>
>>  Hello!
>>
>>  I've tried two approaches: 1) RAMDirectory, 2) MMapDirectory + tmpfs. Both work the same for me (the same bad:( ).
>>  Thus, I think my problem is not disk access (although I always see getPayload() in the VisualVM top).
>>  So, maybe the hard part in the postings traversal is decompression?
>>  Are there Lucene codecs which use light postings compression (maybe none at all)?
>>
>>  And, getting back to in-memory index topic, is lucene.codecs.memory somewhat similar to RAMDirectory?
>>
>>  --
>>  Best Regards,
>>  Igor
>>
>>  10.10.2013, 03:01, "Vitaly Funstein" <vf...@gmail.com>:
>>>  I don't think you want to load indexes of this size into a RAMDirectory.
>>>  The reasons have been listed multiple times here... in short, just use
>>>  MMapDirectory.
>>>
>>>  On Wed, Oct 9, 2013 at 3:17 PM, Igor Shalyminov
>>>  <is...@yandex-team.ru>wrote:
>>>>   Hello!
>>>>
>>>>   I need to perform an experiment of loading the entire index in RAM and
>>>>   seeing how the search performance changes.
>>>>   My index has TermVectors with payload and position info, StoredFields, and
>>>>   DocValues. It takes ~30GB on disk (the server has 48).
>>>>
>>>>   _indexDirectoryReader = DirectoryReader.open(RAMDirectory.open(new
>>>>   File(_indexDirectory)));
>>>>
>>>>   Is the line above the only thing I have to do to complete my goal?
>>>>
>>>>   And also:
>>>>   - will all the data be loaded in the RAM right after opening, or during
>>>>   the reading stage?
>>>>   - will the index data be stored in RAM as it is on disk, or will it be
>>>>   uncompressed first?
>>>>
>>>>   --
>>>>   Best Regards,
>>>>   Igor
>>>>
>>>>   ---------------------------------------------------------------------
>>>>   To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>>   For additional commands, e-mail: java-user-help@lucene.apache.org
>>  ---------------------------------------------------------------------
>>  To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>  For additional commands, e-mail: java-user-help@lucene.apache.org
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Lucene in-memory index

Posted by Michael McCandless <lu...@mikemccandless.com>.

DirectPostingsFormat holds all postings in RAM, uncompressed, as
simple java arrays.  But it's quite RAM heavy...

The hotspots may also be in the queries you are running ... maybe you
can describe more how you're using Lucene?

Mike McCandless

http://blog.mikemccandless.com


On Thu, Oct 17, 2013 at 10:56 AM, Igor Shalyminov
<is...@yandex-team.ru> wrote:
> Hello!
>
> I've tried two approaches: 1) RAMDirectory, 2) MMapDirectory + tmpfs. Both work the same for me (the same bad:( ).
> Thus, I think my problem is not disk access (although I always see getPayload() in the VisualVM top).
> So, maybe the hard part in the postings traversal is decompression?
> Are there Lucene codecs which use light postings compression (maybe none at all)?
>
> And, getting back to in-memory index topic, is lucene.codecs.memory somewhat similar to RAMDirectory?
>
> --
> Best Regards,
> Igor
>
> 10.10.2013, 03:01, "Vitaly Funstein" <vf...@gmail.com>:
>> I don't think you want to load indexes of this size into a RAMDirectory.
>> The reasons have been listed multiple times here... in short, just use
>> MMapDirectory.
>>
>> On Wed, Oct 9, 2013 at 3:17 PM, Igor Shalyminov
>> <is...@yandex-team.ru>wrote:
>>
>>>  Hello!
>>>
>>>  I need to perform an experiment of loading the entire index in RAM and
>>>  seeing how the search performance changes.
>>>  My index has TermVectors with payload and position info, StoredFields, and
>>>  DocValues. It takes ~30GB on disk (the server has 48).
>>>
>>>  _indexDirectoryReader = DirectoryReader.open(RAMDirectory.open(new
>>>  File(_indexDirectory)));
>>>
>>>  Is the line above the only thing I have to do to complete my goal?
>>>
>>>  And also:
>>>  - will all the data be loaded in the RAM right after opening, or during
>>>  the reading stage?
>>>  - will the index data be stored in RAM as it is on disk, or will it be
>>>  uncompressed first?
>>>
>>>  --
>>>  Best Regards,
>>>  Igor
>>>
>>>  ---------------------------------------------------------------------
>>>  To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>  For additional commands, e-mail: java-user-help@lucene.apache.org
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Lucene in-memory index

Posted by Igor Shalyminov <is...@yandex-team.ru>.

Hello!

I've tried two approaches: 1) RAMDirectory, 2) MMapDirectory + tmpfs. Both work the same for me (the same bad:( ).
Thus, I think my problem is not disk access (although I always see getPayload() in the VisualVM top).
So, maybe the hard part in the postings traversal is decompression?
Are there Lucene codecs which use light postings compression (maybe none at all)?

And, getting back to in-memory index topic, is lucene.codecs.memory somewhat similar to RAMDirectory?

-- 
Best Regards,
Igor

10.10.2013, 03:01, "Vitaly Funstein" <vf...@gmail.com>:
> I don't think you want to load indexes of this size into a RAMDirectory.
> The reasons have been listed multiple times here... in short, just use
> MMapDirectory.
>
> On Wed, Oct 9, 2013 at 3:17 PM, Igor Shalyminov
> <is...@yandex-team.ru>wrote:
>
>>  Hello!
>>
>>  I need to perform an experiment of loading the entire index in RAM and
>>  seeing how the search performance changes.
>>  My index has TermVectors with payload and position info, StoredFields, and
>>  DocValues. It takes ~30GB on disk (the server has 48).
>>
>>  _indexDirectoryReader = DirectoryReader.open(RAMDirectory.open(new
>>  File(_indexDirectory)));
>>
>>  Is the line above the only thing I have to do to complete my goal?
>>
>>  And also:
>>  - will all the data be loaded in the RAM right after opening, or during
>>  the reading stage?
>>  - will the index data be stored in RAM as it is on disk, or will it be
>>  uncompressed first?
>>
>>  --
>>  Best Regards,
>>  Igor
>>
>>  ---------------------------------------------------------------------
>>  To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>  For additional commands, e-mail: java-user-help@lucene.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Lucene in-memory index

Posted by Vitaly Funstein <vf...@gmail.com>.

I don't think you want to load indexes of this size into a RAMDirectory.
The reasons have been listed multiple times here... in short, just use
MMapDirectory.


On Wed, Oct 9, 2013 at 3:17 PM, Igor Shalyminov
<is...@yandex-team.ru>wrote:

> Hello!
>
> I need to perform an experiment of loading the entire index in RAM and
> seeing how the search performance changes.
> My index has TermVectors with payload and position info, StoredFields, and
> DocValues. It takes ~30GB on disk (the server has 48).
>
> _indexDirectoryReader = DirectoryReader.open(RAMDirectory.open(new
> File(_indexDirectory)));
>
> Is the line above the only thing I have to do to complete my goal?
>
> And also:
> - will all the data be loaded in the RAM right after opening, or during
> the reading stage?
> - will the index data be stored in RAM as it is on disk, or will it be
> uncompressed first?
>
> --
> Best Regards,
> Igor
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Re: Wildcard question

Posted by Jack Krupansky <ja...@basetechnology.com>.

You get to decide:

class QueryParser extends QueryParserBase:

/**
* Set to <code>true</code> to allow leading wildcard characters.
* <p>
* When set, <code>*</code> or <code>?</code> are allowed as
* the first character of a PrefixQuery and WildcardQuery.
* Note that this can produce very slow
* queries on big indexes.
* <p>
* Default: false.
*/
@Override
public void setAllowLeadingWildcard(boolean allowLeadingWildcard) {
  this.allowLeadingWildcard = allowLeadingWildcard;
}

And the default is "false" (leading wildcard not allowed.)

-- Jack Krupansky

-----Original Message----- 
From: Carlos de Luna Saenz
Sent: Wednesday, October 09, 2013 6:32 PM
To: java-user@lucene.apache.org
Subject: Wildcard question

I've used Lucene 2,3 and now 4... i used to believe that * wildcard on the 
begining was acepted since 3 (but never used) and reviewing documentation 
says "Note: You cannot use a * or ? symbol as the first character of a 
search." is that ok or is a missupdated note on the 
http://lucene.apache.org/core/4_4_0/queryparser/org/apache/lucene/queryparser/classic/package-summary.html#package_description 
documentation?
Thanks in advance. 


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Wildcard question

Posted by Carlos de Luna Saenz <cd...@yahoo.com.mx>.

I've used Lucene 2,3 and now 4... i used to believe that * wildcard on the begining was acepted since 3 (but never used) and reviewing documentation says "Note: You cannot use a * or ? symbol as the first character of a search." is that ok or is a missupdated note on the http://lucene.apache.org/core/4_4_0/queryparser/org/apache/lucene/queryparser/classic/package-summary.html#package_description documentation?
Thanks in advance.