You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by "Cabansag, Ronald-Alvin R" <ro...@cengage.com> on 2010/10/29 15:27:06 UTC

lucene norms cached twice

We are working with a large readonly lucene index(single segment) with large number of fields and documents and are running into memory usage problems.

We found that when using a ReadOnlyDirectoryReader and IndexSearcher created using the same reader, the norms are cached twice - first by the reader itself and second by the reader's subreaders. Is there an easy way to avoid having the norms cached twice when we only have a single subreader?

We thought of the following options:
1.) pass in the main reader as a subreader when creating the  IndexSearcher?  ( e.g. new IndexSearcher(mainReader, IndexReader[] {mainReader}, int[] {0} )
2.) override ReadOnlyDirectoryReader.getSequentialSubReaders() method and return null. This tells the IndexSearcher to use the main reader- ReadOnlyDirectoryReader.
3.) use SegmentReader.get(boolean, SegmentInfo, int) to create a ReadOnlySegmentReader that we use as our main reader instead.

Are there any negative implications to the above approaches? Or are there better approaches to the problem?

Thanks in advance for any help.

Alvin Cab
Cengage Learning


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Grammatical terms

Posted by Shashi Kant <sk...@sloan.mit.edu>.

For Part-of-Speech (POS) identification you are better off looking at a took
like OpenNLP or NLTK.


2010/10/30 Mário André <ma...@infonet.com.br>

>
> Hi,
> I need a Java API that identify the grammatical terms in noun phrase (NP).
> Eg: I see the words when you are talking.
> See: Verb
> Words: Noun
> are: Verb
> talk: Verb
> Can I use the Lucene for that?
>
> Thanks!
>
> -------------------------------
> Mário André
> Master's degree in MCC
> Federal University of Alagoas
> Federal Institute of Sergipe, Professor
> Skype: mario-fa
> www.marioandre.com.br
> www.neurominer.com
> -------------------------------
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Grammatical terms

Posted by Mário André <ma...@infonet.com.br>.

Hi,
I need a Java API that identify the grammatical terms in noun phrase (NP).
Eg: I see the words when you are talking. 
See: Verb
Words: Noun
are: Verb
talk: Verb
Can I use the Lucene for that?

Thanks!

-------------------------------
Mário André
Master's degree in MCC
Federal University of Alagoas
Federal Institute of Sergipe, Professor
Skype: mario-fa
www.marioandre.com.br
www.neurominer.com
-------------------------------



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

RE: lucene norms cached twice

Posted by "Cabansag, Ronald-Alvin R" <ro...@cengage.com>.

Yonik,

Thanks for the input. We'll try this out.
And you're right - I tried to simplify our first operation's description.

-Al
Cengage Learning

-----Original Message-----
From: yseeley@gmail.com [mailto:yseeley@gmail.com] On Behalf Of Yonik Seeley
Sent: Friday, October 29, 2010 3:40 PM
To: java-user@lucene.apache.org
Subject: Re: lucene norms cached twice

On Fri, Oct 29, 2010 at 3:32 PM, Cabansag, Ronald-Alvin R
<ro...@cengage.com> wrote:
> We use a QueryWrapperFilter.getDocIdSet(indexReader) to get the DocIdSet and compute the hit count using its iterator.

If you want to avoid double-caching of norms, then you should call
getDocIdSet() for each segment reader, not the top level reader.

Aside: presumably you're actually doing something more advanced than
getting the hit count (and you just simplified your description
because it wasn't pertinent)... since you can get the hit count from
TopDocs.

-Yonik
http://www.lucidimagination.com

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: lucene norms cached twice

Posted by Yonik Seeley <yo...@lucidimagination.com>.

On Fri, Oct 29, 2010 at 3:32 PM, Cabansag, Ronald-Alvin R
<ro...@cengage.com> wrote:
> We use a QueryWrapperFilter.getDocIdSet(indexReader) to get the DocIdSet and compute the hit count using its iterator.

If you want to avoid double-caching of norms, then you should call
getDocIdSet() for each segment reader, not the top level reader.

Aside: presumably you're actually doing something more advanced than
getting the hit count (and you just simplified your description
because it wasn't pertinent)... since you can get the hit count from
TopDocs.

-Yonik
http://www.lucidimagination.com

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

RE: lucene norms cached twice

Posted by "Cabansag, Ronald-Alvin R" <ro...@cengage.com>.

We create a single instance of IndexReader using IndexReader.open(new MMapDirectory(file))and a single instance of IndexSearcher using this index reader.

Searches in our application are done with two operations. The first operation gets the hit count. We use a QueryWrapperFilter.getDocIdSet(indexReader) to get the DocIdSet and compute the hit count using its iterator. This is where we see the ReadOnlyDirectoryReader caching its own copy of norms.

The second operation is where we actually do the search using IndexSearcher.search(new TermQuery(...), filter, collector) method. This is where the sub-reader caches its own copy of norms.

Regards,
Al

-----Original Message-----
From: Michael McCandless [mailto:lucene@mikemccandless.com] 
Sent: Friday, October 29, 2010 1:27 PM
To: java-user@lucene.apache.org
Subject: Re: lucene norms cached twice

Norms should not normally be loaded twice.

Since 2.9, searching is done at the sub-reader level, and so norms
should never be loaded for the main reader.

But can you describe how you're using Lucene?

Mike

On Fri, Oct 29, 2010 at 9:27 AM, Cabansag, Ronald-Alvin R
<ro...@cengage.com> wrote:
>
> We are working with a large readonly lucene index(single segment) with large number of fields and documents and are running into memory usage problems.
>
> We found that when using a ReadOnlyDirectoryReader and IndexSearcher created using the same reader, the norms are cached twice - first by the reader itself and second by the reader's subreaders. Is there an easy way to avoid having the norms cached twice when we only have a single subreader?
>
> We thought of the following options:
> 1.) pass in the main reader as a subreader when creating the  IndexSearcher?  ( e.g. new IndexSearcher(mainReader, IndexReader[] {mainReader}, int[] {0} )
> 2.) override ReadOnlyDirectoryReader.getSequentialSubReaders() method and return null. This tells the IndexSearcher to use the main reader- ReadOnlyDirectoryReader.
> 3.) use SegmentReader.get(boolean, SegmentInfo, int) to create a ReadOnlySegmentReader that we use as our main reader instead.
>
> Are there any negative implications to the above approaches? Or are there better approaches to the problem?
>
> Thanks in advance for any help.
>
> Alvin Cab
> Cengage Learning
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: lucene norms cached twice

Posted by Michael McCandless <lu...@mikemccandless.com>.

Norms should not normally be loaded twice.

Since 2.9, searching is done at the sub-reader level, and so norms
should never be loaded for the main reader.

But can you describe how you're using Lucene?

Mike

On Fri, Oct 29, 2010 at 9:27 AM, Cabansag, Ronald-Alvin R
<ro...@cengage.com> wrote:
>
> We are working with a large readonly lucene index(single segment) with large number of fields and documents and are running into memory usage problems.
>
> We found that when using a ReadOnlyDirectoryReader and IndexSearcher created using the same reader, the norms are cached twice - first by the reader itself and second by the reader's subreaders. Is there an easy way to avoid having the norms cached twice when we only have a single subreader?
>
> We thought of the following options:
> 1.) pass in the main reader as a subreader when creating the  IndexSearcher?  ( e.g. new IndexSearcher(mainReader, IndexReader[] {mainReader}, int[] {0} )
> 2.) override ReadOnlyDirectoryReader.getSequentialSubReaders() method and return null. This tells the IndexSearcher to use the main reader- ReadOnlyDirectoryReader.
> 3.) use SegmentReader.get(boolean, SegmentInfo, int) to create a ReadOnlySegmentReader that we use as our main reader instead.
>
> Are there any negative implications to the above approaches? Or are there better approaches to the problem?
>
> Thanks in advance for any help.
>
> Alvin Cab
> Cengage Learning
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org