You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by "m.harig" <m....@gmail.com> on 2009/06/29 12:49:38 UTC

Read large size index

hello all


        Am doing a search application on lucene, its working fine when my
index size is small, am getting java heap space error when am using large
size index, i came to know about hadoop with lucene to solve this problem ,
but i don't have any idea about hadoop , i've searched thru the net , but i
can't find better solutions , am tired of searching , am very curios if
some1 tell me how to integrate lucene with hadoop , and i'll be very
thankful to you , please any1 help me
-- 
View this message in context: http://www.nabble.com/Read-large-size-index-tp24251993p24251993.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Read large size index

Posted by "m.harig" <m....@gmail.com>.

Thanks SImon ,

Example:

  IndexReader open = IndexReader.open("/tmp/testindex/");
    IndexSearcher searcher = new IndexSearcher(open);
    final String fName = "test";

is fName a field like summary , contents??

    TopDocs topDocs = searcher.search(new TermQuery(new Term(fName,
"lucene")),
        Integer.MAX_VALUE);
am getting an error that search(Query,Filter) method not applicable for
search(TermQuery,int)

    FieldSelector selector = new FieldSelector() {
      public FieldSelectorResult accept(String fieldName) {
        return fieldName == fName ? FieldSelectorResult.LOAD
            : FieldSelectorResult.NO_LOAD;
      }

    };

    final int totalHits = topDocs.totalHits;
    ScoreDoc[] scoreDocs = topDocs.scoreDocs;
    for (int i = 0; i < totalHits; i++) {
      Document doc = searcher.doc(scoreDocs[i].doc, selector);
    }


could you please explain the code , please
-- 
View this message in context: http://www.nabble.com/Read-large-size-index-tp24251993p24266289.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Read large size index

Posted by Simon Willnauer <si...@googlemail.com>.

On Mon, Jun 29, 2009 at 6:36 PM, m.harig<m....@gmail.com> wrote:
>
> Thanks Simon ,
>
> Hey there, that makes things easier. :)
>
> ok here are some questions:
>
>>>>Do you iterate over all docs calling hits.doc(i) ?If so do you have to
> load all fields to render your results, if not you should not retrieve
> all of them?
>
>
> Yes, am iterating over all docs by calling hits.doc(i) ,
do you really need to get all docs? wouldn't it be enough to fetch
just the top N you want to display or do you want to display all of
them?
>
>
>
> You use IndexSearchersearch(Query q,...) which returns a Hits object
> have you tried to use the new search methods returning TopDocs?
>
> Sorry, i didn't , could you please send me a piece of code.
Example:
  IndexReader open = IndexReader.open("/tmp/testindex/");
    IndexSearcher searcher = new IndexSearcher(open);
    final String fName = "test";
    TopDocs topDocs = searcher.search(new TermQuery(new Term(fName, "lucene")),
        Integer.MAX_VALUE);
    FieldSelector selector = new FieldSelector() {
      public FieldSelectorResult accept(String fieldName) {
        return fieldName == fName ? FieldSelectorResult.LOAD
            : FieldSelectorResult.NO_LOAD;
      }

    };

    final int totalHits = topDocs.totalHits;
    ScoreDoc[] scoreDocs = topDocs.scoreDocs;
    for (int i = 0; i < totalHits; i++) {
      Document doc = searcher.doc(scoreDocs[i].doc, selector);
    }
>
> when you search for pdf and get 30k results you load all the "stored"
> field content into memory once you call IndexSearcher.doc(i) as it
> internally calls IndexReader.document(i, null). This is equivalent to
> a "Load All fields" FieldSelector.
> You can have a closer look at FieldSelector and the new search methods
> which accept them. This is a way to make you retrieval faster and load
> only the fields you really need.
>
>
>
> --
> View this message in context: http://www.nabble.com/Read-large-size-index-tp24251993p24257547.html
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Read large size index

Posted by "m.harig" <m....@gmail.com>.

Thanks Simon , 

Hey there, that makes things easier. :)

ok here are some questions:

>>>Do you iterate over all docs calling hits.doc(i) ?If so do you have to
load all fields to render your results, if not you should not retrieve
all of them?


Yes, am iterating over all docs by calling hits.doc(i) , 



You use IndexSearchersearch(Query q,...) which returns a Hits object
have you tried to use the new search methods returning TopDocs?

Sorry, i didn't , could you please send me a piece of code.

when you search for pdf and get 30k results you load all the "stored"
field content into memory once you call IndexSearcher.doc(i) as it
internally calls IndexReader.document(i, null). This is equivalent to
a "Load All fields" FieldSelector.
You can have a closer look at FieldSelector and the new search methods
which accept them. This is a way to make you retrieval faster and load
only the fields you really need.



-- 
View this message in context: http://www.nabble.com/Read-large-size-index-tp24251993p24257547.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Read large size index

Posted by Simon Willnauer <si...@googlemail.com>.

Hey there, that makes things easier. :)

ok here are some questions:

Do you iterate over all docs calling hits.doc(i) ?If so do you have to
load all fields to render your results, if not you should not retrieve
all of them?
You use IndexSearchersearch(Query q,...) which returns a Hits object
have you tried to use the new search methods returning TopDocs?

when you search for pdf and get 30k results you load all the "stored"
field content into memory once you call IndexSearcher.doc(i) as it
internally calls IndexReader.document(i, null). This is equivalent to
a "Load All fields" FieldSelector.
You can have a closer look at FieldSelector and the new search methods
which accept them. This is a way to make you retrieval faster and load
only the fields you really need.



On Mon, Jun 29, 2009 at 3:31 PM, m.harig<m....@gmail.com> wrote:
>
> Thanks again,
>
>
>       Did i index my files correctly, please need some tips, the following
> is the error when i run my keyword , i typed pdf , thats it , because i've
> got around 30,000 files named pdf,
>
>
> HTTP Status 500 -
>
>
>
> type Exception report
>
>
>
> message
>
>
>
> description The server encountered an internal error () that prevented it
> from fulfilling this request.
>
>
>
> exception
>
>
>
> javax.servlet.ServletException: Servlet execution threw an exception
>
>
>
> root cause
>
>
>
> java.lang.OutOfMemoryError: Java heap space
>
>        java.util.Arrays.copyOfRange(Unknown Source)
>
>        java.lang.String.<init>(Unknown Source)
>
>        org.apache.lucene.store.IndexInput.readString(IndexInput.java:113)
>
>        org.apache.lucene.index.FieldsReader.addField(FieldsReader.java:324)
>
>        org.apache.lucene.index.FieldsReader.doc(FieldsReader.java:166)
>
>        org.apache.lucene.index.SegmentReader.document(SegmentReader.java:659)
>
>        org.apache.lucene.index.IndexReader.document(IndexReader.java:525)
>
>        org.apache.lucene.search.IndexSearcher.doc(IndexSearcher.java:92)
>
>        org.apache.lucene.search.Hits.doc(Hits.java:167)
>
>        com.npedia.liteSearch.helper.SearchHelper.getResults(SearchHelper.java:103)
>
>
> com.npedia.liteSearch.servlet.SearchServlet.doProcess(SearchServlet.java:164)
>
>        com.npedia.liteSearch.servlet.SearchServlet.doGet(SearchServlet.java:39)
>
>        javax.servlet.http.HttpServlet.service(HttpServlet.java:690)
>
>        javax.servlet.http.HttpServlet.service(HttpServlet.java:803)
>
>
>
> note The full stack trace of the root cause is available in the Apache
> Tomcat/6.0.10 logs.
>
> Apache Tomcat/6.0.10
> --
> View this message in context: http://www.nabble.com/Read-large-size-index-tp24251993p24254191.html
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Read large size index

Posted by "m.harig" <m....@gmail.com>.

Thanks again,


       Did i index my files correctly, please need some tips, the following
is the error when i run my keyword , i typed pdf , thats it , because i've
got around 30,000 files named pdf, 


HTTP Status 500 -



type Exception report



message



description The server encountered an internal error () that prevented it
from fulfilling this request.



exception



javax.servlet.ServletException: Servlet execution threw an exception



root cause



java.lang.OutOfMemoryError: Java heap space

	java.util.Arrays.copyOfRange(Unknown Source)

	java.lang.String.<init>(Unknown Source)

	org.apache.lucene.store.IndexInput.readString(IndexInput.java:113)

	org.apache.lucene.index.FieldsReader.addField(FieldsReader.java:324)

	org.apache.lucene.index.FieldsReader.doc(FieldsReader.java:166)

	org.apache.lucene.index.SegmentReader.document(SegmentReader.java:659)

	org.apache.lucene.index.IndexReader.document(IndexReader.java:525)

	org.apache.lucene.search.IndexSearcher.doc(IndexSearcher.java:92)

	org.apache.lucene.search.Hits.doc(Hits.java:167)

	com.npedia.liteSearch.helper.SearchHelper.getResults(SearchHelper.java:103)


com.npedia.liteSearch.servlet.SearchServlet.doProcess(SearchServlet.java:164)

	com.npedia.liteSearch.servlet.SearchServlet.doGet(SearchServlet.java:39)

	javax.servlet.http.HttpServlet.service(HttpServlet.java:690)

	javax.servlet.http.HttpServlet.service(HttpServlet.java:803)



note The full stack trace of the root cause is available in the Apache
Tomcat/6.0.10 logs.

Apache Tomcat/6.0.10
-- 
View this message in context: http://www.nabble.com/Read-large-size-index-tp24251993p24254191.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Read large size index

Posted by Simon Willnauer <si...@googlemail.com>.

On Mon, Jun 29, 2009 at 3:07 PM, m.harig<m....@gmail.com> wrote:
>
> Thanks Simon ,
>
>           This is how am indexing my documents ,
>
>
>                indexWriter.addDocument(doc, new StopAnalyzer());
>
>
>                indexWriter.setMergeFactor(10);
>
>                indexWriter.setMaxBufferedDocs(100);
>
>                indexWriter.setMaxMergeDocs(Integer.MAX_VALUE);
>
>                indexWriter.setTermIndexInterval(128);
>
>                indexWriter.setMaxFieldLength(10000);
>
> Do i need improve on this ??
As you said, the problem occurs when you search right?! so for now I
would not care about indexing too much.
>
>>> Sorry man if you can not provide any details about how you search
>
> What it does mean ?? please let me know...
What I want to know is how does you search look like, for instance:
- do you sort on any field
- which query do you use (e.g. wilidcard searches)

And again, the source of the error is very important :)

simon
> --
> View this message in context: http://www.nabble.com/Read-large-size-index-tp24251993p24253760.html
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Read large size index

Posted by "m.harig" <m....@gmail.com>.

Thanks Simon ,

           This is how am indexing my documents , 


                indexWriter.addDocument(doc, new StopAnalyzer());


		indexWriter.setMergeFactor(10);

		indexWriter.setMaxBufferedDocs(100);

		indexWriter.setMaxMergeDocs(Integer.MAX_VALUE);

		indexWriter.setTermIndexInterval(128);

		indexWriter.setMaxFieldLength(10000);

Do i need improve on this ?? 

>> Sorry man if you can not provide any details about how you search  

What it does mean ?? please let me know...
-- 
View this message in context: http://www.nabble.com/Read-large-size-index-tp24251993p24253760.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Read large size index

Posted by Simon Willnauer <si...@googlemail.com>.

On Mon, Jun 29, 2009 at 2:55 PM, m.harig<m....@gmail.com> wrote:
>
> Thanks Simon
>
>           I don't run any application on the tomcat , moreover i restarted
> it , am not doing any jobs except searching , we've a 500GB drive , we've
> indexed around 100,000 documents , it gives me around 1GB index . When i
> tried to search pdf i got the heap space error ,

bq.  Am running my application on tomcat 6.0 , i set java heap max as
256MB for tomcat , when i tried to search a query it just showing me heap
space error.
you don't run any app in tomcat?! I'm confused...

Sorry man if you can not provide any details about how you search and
what you do or stack traces it's very tough to give you any help or
advice.
> --
> View this message in context: http://www.nabble.com/Read-large-size-index-tp24251993p24253583.html
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Read large size index

Posted by "m.harig" <m....@gmail.com>.

Thanks Simon

           I don't run any application on the tomcat , moreover i restarted
it , am not doing any jobs except searching , we've a 500GB drive , we've
indexed around 100,000 documents , it gives me around 1GB index . When i
tried to search pdf i got the heap space error ,  
-- 
View this message in context: http://www.nabble.com/Read-large-size-index-tp24251993p24253583.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Read large size index

Posted by Simon Willnauer <si...@googlemail.com>.

Well, with this information I can hardly tell what the cause of the
OOM is. It would be really really helpful if you could figure out
where it happens. Do you get the OOM on the first try? I guess you do
not do any indexing in the background?!
What is your index "layout" I mean what kind of fields do you use for
search and again do you do any sorting. Did you try to run you app
with a bit more memory? Are you able to run you app outside of a
servlet container if you can not find the stacktrace in the logs?! Did
you try a profiler or something similar to figure out where is
happens?

simon

On Mon, Jun 29, 2009 at 2:35 PM, m.harig<m....@gmail.com> wrote:
>
>
>
> Simon Willnauer wrote:
>>
>> On Mon, Jun 29, 2009 at 1:48 PM, m.harig<m....@gmail.com> wrote:
>>>
>>>
>>>
>>> Simon Willnauer wrote:
>>>>
>>>> Hey there,
>>>> before going out to use hadoop (hadoop mailing list would help you
>>>> better I guess) you could provide more information about you
>>>> situation. For instance:
>>>> - how big is you index
>>>> - version of lucene
>>>> - which java vm
>>>> - how much heap space
>>>> - where does the OOM occure
>>>>
>>>> or maybe there is already an issue that is related to you like this
>>>> one: https://issues.apache.org/jira/browse/LUCENE-1566
>>>>
>>>> simon
>>>>
>>>> On Mon, Jun 29, 2009 at 12:49 PM, m.harig<m....@gmail.com> wrote:
>>>>>
>>>>> hello all
>>>>>
>>>>>
>>>>>        Am doing a search application on lucene, its working fine when
>>>>> my
>>>>> index size is small, am getting java heap space error when am using
>>>>> large
>>>>> size index, i came to know about hadoop with lucene to solve this
>>>>> problem
>>>>> ,
>>>>> but i don't have any idea about hadoop , i've searched thru the net ,
>>>>> but
>>>>> i
>>>>> can't find better solutions , am tired of searching , am very curios if
>>>>> some1 tell me how to integrate lucene with hadoop , and i'll be very
>>>>> thankful to you , please any1 help me
>>>>> --
>>>>> View this message in context:
>>>>> http://www.nabble.com/Read-large-size-index-tp24251993p24251993.html
>>>>> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>>>>>
>>>>>
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>>
>>>>>
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>
>>>>
>>>>
>>>
>> Hey there,
>> 1GB is not a big index though. What happens when you search? Can you
>> post the stacktrace where the OOM occurs? Do you do any kind of
>> sorting in your application?
>>
>> simon
>>> i've posted a mail at hadoop lucene forum too , but i didn't get any
>>> response. my index size is 1GB , am using lucene 2.3.0 , java 1.6 , am
>>> setting 1024 as java max , when i tried to search a query it gives me
>>> java
>>> heap space . please any one help me .
>>>
>>> --
>>> View this message in context:
>>> http://www.nabble.com/Read-large-size-index-tp24251993p24252732.html
>>> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>
>>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>>
>
>
> Thanks Simon,
>         Am running my application on tomcat 6.0 , i set java heap max as
> 256MB for tomcat , when i tried to search a query it just showing me heap
> space error.
>
>
> --
> View this message in context: http://www.nabble.com/Read-large-size-index-tp24251993p24253338.html
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Read large size index

Posted by "m.harig" <m....@gmail.com>.



Simon Willnauer wrote:
> 
> On Mon, Jun 29, 2009 at 1:48 PM, m.harig<m....@gmail.com> wrote:
>>
>>
>>
>> Simon Willnauer wrote:
>>>
>>> Hey there,
>>> before going out to use hadoop (hadoop mailing list would help you
>>> better I guess) you could provide more information about you
>>> situation. For instance:
>>> - how big is you index
>>> - version of lucene
>>> - which java vm
>>> - how much heap space
>>> - where does the OOM occure
>>>
>>> or maybe there is already an issue that is related to you like this
>>> one: https://issues.apache.org/jira/browse/LUCENE-1566
>>>
>>> simon
>>>
>>> On Mon, Jun 29, 2009 at 12:49 PM, m.harig<m....@gmail.com> wrote:
>>>>
>>>> hello all
>>>>
>>>>
>>>>        Am doing a search application on lucene, its working fine when
>>>> my
>>>> index size is small, am getting java heap space error when am using
>>>> large
>>>> size index, i came to know about hadoop with lucene to solve this
>>>> problem
>>>> ,
>>>> but i don't have any idea about hadoop , i've searched thru the net ,
>>>> but
>>>> i
>>>> can't find better solutions , am tired of searching , am very curios if
>>>> some1 tell me how to integrate lucene with hadoop , and i'll be very
>>>> thankful to you , please any1 help me
>>>> --
>>>> View this message in context:
>>>> http://www.nabble.com/Read-large-size-index-tp24251993p24251993.html
>>>> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>>>>
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>
>>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>
>>>
>>>
>>
> Hey there,
> 1GB is not a big index though. What happens when you search? Can you
> post the stacktrace where the OOM occurs? Do you do any kind of
> sorting in your application?
> 
> simon
>> i've posted a mail at hadoop lucene forum too , but i didn't get any
>> response. my index size is 1GB , am using lucene 2.3.0 , java 1.6 , am
>> setting 1024 as java max , when i tried to search a query it gives me
>> java
>> heap space . please any one help me .
>>
>> --
>> View this message in context:
>> http://www.nabble.com/Read-large-size-index-tp24251993p24252732.html
>> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
> 
> 
> 


Thanks Simon,
         Am running my application on tomcat 6.0 , i set java heap max as
256MB for tomcat , when i tried to search a query it just showing me heap
space error.


-- 
View this message in context: http://www.nabble.com/Read-large-size-index-tp24251993p24253338.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Read large size index

Posted by Simon Willnauer <si...@googlemail.com>.

On Mon, Jun 29, 2009 at 1:48 PM, m.harig<m....@gmail.com> wrote:
>
>
>
> Simon Willnauer wrote:
>>
>> Hey there,
>> before going out to use hadoop (hadoop mailing list would help you
>> better I guess) you could provide more information about you
>> situation. For instance:
>> - how big is you index
>> - version of lucene
>> - which java vm
>> - how much heap space
>> - where does the OOM occure
>>
>> or maybe there is already an issue that is related to you like this
>> one: https://issues.apache.org/jira/browse/LUCENE-1566
>>
>> simon
>>
>> On Mon, Jun 29, 2009 at 12:49 PM, m.harig<m....@gmail.com> wrote:
>>>
>>> hello all
>>>
>>>
>>>        Am doing a search application on lucene, its working fine when my
>>> index size is small, am getting java heap space error when am using large
>>> size index, i came to know about hadoop with lucene to solve this problem
>>> ,
>>> but i don't have any idea about hadoop , i've searched thru the net , but
>>> i
>>> can't find better solutions , am tired of searching , am very curios if
>>> some1 tell me how to integrate lucene with hadoop , and i'll be very
>>> thankful to you , please any1 help me
>>> --
>>> View this message in context:
>>> http://www.nabble.com/Read-large-size-index-tp24251993p24251993.html
>>> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>
>>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>>
>
Hey there,
1GB is not a big index though. What happens when you search? Can you
post the stacktrace where the OOM occurs? Do you do any kind of
sorting in your application?

simon
> i've posted a mail at hadoop lucene forum too , but i didn't get any
> response. my index size is 1GB , am using lucene 2.3.0 , java 1.6 , am
> setting 1024 as java max , when i tried to search a query it gives me java
> heap space . please any one help me .
>
> --
> View this message in context: http://www.nabble.com/Read-large-size-index-tp24251993p24252732.html
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Read large size index

Posted by "m.harig" <m....@gmail.com>.



Simon Willnauer wrote:
> 
> Hey there,
> before going out to use hadoop (hadoop mailing list would help you
> better I guess) you could provide more information about you
> situation. For instance:
> - how big is you index
> - version of lucene
> - which java vm
> - how much heap space
> - where does the OOM occure
> 
> or maybe there is already an issue that is related to you like this
> one: https://issues.apache.org/jira/browse/LUCENE-1566
> 
> simon
> 
> On Mon, Jun 29, 2009 at 12:49 PM, m.harig<m....@gmail.com> wrote:
>>
>> hello all
>>
>>
>>        Am doing a search application on lucene, its working fine when my
>> index size is small, am getting java heap space error when am using large
>> size index, i came to know about hadoop with lucene to solve this problem
>> ,
>> but i don't have any idea about hadoop , i've searched thru the net , but
>> i
>> can't find better solutions , am tired of searching , am very curios if
>> some1 tell me how to integrate lucene with hadoop , and i'll be very
>> thankful to you , please any1 help me
>> --
>> View this message in context:
>> http://www.nabble.com/Read-large-size-index-tp24251993p24251993.html
>> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
> 
> 
> 


i've posted a mail at hadoop lucene forum too , but i didn't get any
response. my index size is 1GB , am using lucene 2.3.0 , java 1.6 , am
setting 1024 as java max , when i tried to search a query it gives me java
heap space . please any one help me . 

-- 
View this message in context: http://www.nabble.com/Read-large-size-index-tp24251993p24252732.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Read large size index

Posted by Simon Willnauer <si...@googlemail.com>.

Hey there,
before going out to use hadoop (hadoop mailing list would help you
better I guess) you could provide more information about you
situation. For instance:
- how big is you index
- version of lucene
- which java vm
- how much heap space
- where does the OOM occure

or maybe there is already an issue that is related to you like this
one: https://issues.apache.org/jira/browse/LUCENE-1566

simon

On Mon, Jun 29, 2009 at 12:49 PM, m.harig<m....@gmail.com> wrote:
>
> hello all
>
>
>        Am doing a search application on lucene, its working fine when my
> index size is small, am getting java heap space error when am using large
> size index, i came to know about hadoop with lucene to solve this problem ,
> but i don't have any idea about hadoop , i've searched thru the net , but i
> can't find better solutions , am tired of searching , am very curios if
> some1 tell me how to integrate lucene with hadoop , and i'll be very
> thankful to you , please any1 help me
> --
> View this message in context: http://www.nabble.com/Read-large-size-index-tp24251993p24251993.html
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Read large size index

Posted by Simon Willnauer <si...@googlemail.com>.

On Tue, Jun 30, 2009 at 2:30 PM, m.harig<m....@gmail.com> wrote:
>
>
>
> Hi there,
>
> On Tue, Jun 30, 2009 at 12:41 PM, m.harig<m....@gmail.com> wrote:
>>
>> Thanks Simon ,
>>
>>          Its working now , thanks a lot , i've a doubt
>>
>>       i've got 30,000 pdf files indexed ,  but if i use the code which you
>> sent , returns only 200 results , because am setting   TopDocs topDocs =
>> searcher.search(query,200);  as i said if use Integer.MAX_VALUE , it
>> returns
>> java heap space error , even i can't use 300 ,
> The Integer.MAX_VALUE was my fault. Internally lucene allocates an
> array of the size n (searcher.search(query,n)) even if your query only
> returns 1 document. This causes the OOM. Only get as many results as
> you need!
>
> In turn is iterating and loading of all those documents necessary?
>
> no need to iterate all documents , i set searcher.search(query,10000) , am
> getting the results ,
>
> What is your usecase of lucene where you have to load 30k of
> documents? You have to be aware of that if you load 30k docs you need
> enough memory for them in you JVM. I have no idea how you index and
> what you store in the index but 30k pdf with -Xmx128M is not much :)
>
> is there any way to get the total hits from the index when i search for a
> keyword? i mean i set TopDocCollector collector = new
> TopDocCollector(10000); , so the results will not exceed more than 10k ,
> what am asking is i need to display the total hits from the index , it might
> be more than 10k , like google did ,  Results 1 - 10 of about 51,200 , can
> you please tell me..

TopDoc returned by several Searcher#search methods has a field
TopDocs#totalHits thats the total number of hits for a query!

simon
>
> simon
>
> --
> View this message in context: http://www.nabble.com/Read-large-size-index-tp24251993p24271025.html
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

RE: Read large size index

Posted by Uwe Schindler <uw...@thetaphi.de>.

There was a code snipplet in my mail, just fill in your code. I cannot do
everything for you. With some programming experience you should understand
what's going on:

> searcher.search(query, new HitCollector() {
> 	@Override public void collect(int docid, float score) {
> 		// do something with docid
> 	}
> });

Uwe

-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: uwe@thetaphi.de


> -----Original Message-----
> From: m.harig [mailto:m.harig@gmail.com]
> Sent: Tuesday, June 30, 2009 2:52 PM
> To: java-user@lucene.apache.org
> Subject: RE: Read large size index
> 
> 
> Thanks Uwe,
> 
>             can you please give me a code snippet , so that i can resolve
> my
> issue , please
> 
> 
> 
> The correct way to iterate over all results is to use a custom
> HitCollector
> (Collector in 2.9) instance. The HitCollector's method collect(docid,
> score)
> is called for every hit. No need to allocate arrays then:
> 
> e.g.:
> searcher.search(query, new HitCollector() {
> 	@Override public void collect(int docid, float score) {
> 		// do something with docid
> 	}
> });
> 
> TopDocsCollector is used to get a relevance-sorted view on the top ranking
> hits. It is not for iterating over the whole results (in full text search,
> nobody would normally do this. E.g. Google does not allow you to go beyond
> page 100). If you want to display the top 10 results you can use
> TopDocCollector(10).
> 
> Uwe
> 
> 
> 
> --
> View this message in context: http://www.nabble.com/Read-large-size-index-
> tp24251993p24271288.html
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

RE: Read large size index

Posted by "m.harig" <m....@gmail.com>.

Thanks Uwe,

            can you please give me a code snippet , so that i can resolve my
issue , please



The correct way to iterate over all results is to use a custom HitCollector
(Collector in 2.9) instance. The HitCollector's method collect(docid, score)
is called for every hit. No need to allocate arrays then:

e.g.:
searcher.search(query, new HitCollector() {
	@Override public void collect(int docid, float score) {
		// do something with docid
	}
});

TopDocsCollector is used to get a relevance-sorted view on the top ranking
hits. It is not for iterating over the whole results (in full text search,
nobody would normally do this. E.g. Google does not allow you to go beyond
page 100). If you want to display the top 10 results you can use
TopDocCollector(10).

Uwe



-- 
View this message in context: http://www.nabble.com/Read-large-size-index-tp24251993p24271288.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

RE: Read large size index

Posted by Uwe Schindler <uw...@thetaphi.de>.

The correct way to iterate over all results is to use a custom HitCollector
(Collector in 2.9) instance. The HitCollector's method collect(docid, score)
is called for every hit. No need to allocate arrays then:

e.g.:
searcher.search(query, new HitCollector() {
	@Override public void collect(int docid, float score) {
		// do something with docid
	}
});

TopDocsCollector is used to get a relevance-sorted view on the top ranking
hits. It is not for iterating over the whole results (in full text search,
nobody would normally do this. E.g. Google does not allow you to go beyond
page 100). If you want to display the top 10 results you can use
TopDocCollector(10).

Uwe


-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: uwe@thetaphi.de

> -----Original Message-----
> From: m.harig [mailto:m.harig@gmail.com]
> Sent: Tuesday, June 30, 2009 2:31 PM
> To: java-user@lucene.apache.org
> Subject: Re: Read large size index
> 
> 
> 
> 
> Hi there,
> 
> On Tue, Jun 30, 2009 at 12:41 PM, m.harig<m....@gmail.com> wrote:
> >
> > Thanks Simon ,
> >
> >          Its working now , thanks a lot , i've a doubt
> >
> >       i've got 30,000 pdf files indexed ,  but if i use the code which
> you
> > sent , returns only 200 results , because am setting   TopDocs topDocs =
> > searcher.search(query,200);  as i said if use Integer.MAX_VALUE , it
> > returns
> > java heap space error , even i can't use 300 ,
> The Integer.MAX_VALUE was my fault. Internally lucene allocates an
> array of the size n (searcher.search(query,n)) even if your query only
> returns 1 document. This causes the OOM. Only get as many results as
> you need!
> 
> In turn is iterating and loading of all those documents necessary?
> 
> no need to iterate all documents , i set searcher.search(query,10000) , am
> getting the results ,
> 
> What is your usecase of lucene where you have to load 30k of
> documents? You have to be aware of that if you load 30k docs you need
> enough memory for them in you JVM. I have no idea how you index and
> what you store in the index but 30k pdf with -Xmx128M is not much :)
> 
> is there any way to get the total hits from the index when i search for a
> keyword? i mean i set TopDocCollector collector = new
> TopDocCollector(10000); , so the results will not exceed more than 10k ,
> what am asking is i need to display the total hits from the index , it
> might
> be more than 10k , like google did ,  Results 1 - 10 of about 51,200 , can
> you please tell me..
> 
> simon
> 
> --
> View this message in context: http://www.nabble.com/Read-large-size-index-
> tp24251993p24271025.html
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Read large size index

Posted by "m.harig" <m....@gmail.com>.

Hi there,

On Tue, Jun 30, 2009 at 12:41 PM, m.harig<m....@gmail.com> wrote:
>
> Thanks Simon ,
>
>          Its working now , thanks a lot , i've a doubt
>
>       i've got 30,000 pdf files indexed ,  but if i use the code which you
> sent , returns only 200 results , because am setting   TopDocs topDocs =
> searcher.search(query,200);  as i said if use Integer.MAX_VALUE , it
> returns
> java heap space error , even i can't use 300 ,
The Integer.MAX_VALUE was my fault. Internally lucene allocates an
array of the size n (searcher.search(query,n)) even if your query only
returns 1 document. This causes the OOM. Only get as many results as
you need!

In turn is iterating and loading of all those documents necessary?

no need to iterate all documents , i set searcher.search(query,10000) , am
getting the results , 

What is your usecase of lucene where you have to load 30k of
documents? You have to be aware of that if you load 30k docs you need
enough memory for them in you JVM. I have no idea how you index and
what you store in the index but 30k pdf with -Xmx128M is not much :)

is there any way to get the total hits from the index when i search for a
keyword? i mean i set TopDocCollector collector = new
TopDocCollector(10000); , so the results will not exceed more than 10k ,
what am asking is i need to display the total hits from the index , it might
be more than 10k , like google did ,  Results 1 - 10 of about 51,200 , can
you please tell me..

simon

-- 
View this message in context: http://www.nabble.com/Read-large-size-index-tp24251993p24271025.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Read large size index

Posted by Simon Willnauer <si...@googlemail.com>.

Hi there,

On Tue, Jun 30, 2009 at 12:41 PM, m.harig<m....@gmail.com> wrote:
>
> Thanks Simon ,
>
>          Its working now , thanks a lot , i've a doubt
>
>       i've got 30,000 pdf files indexed ,  but if i use the code which you
> sent , returns only 200 results , because am setting   TopDocs topDocs =
> searcher.search(query,200);  as i said if use Integer.MAX_VALUE , it returns
> java heap space error , even i can't use 300 ,
The Integer.MAX_VALUE was my fault. Internally lucene allocates an
array of the size n (searcher.search(query,n)) even if your query only
returns 1 document. This causes the OOM. Only get as many results as
you need!
In turn is iterating and loading of all those documents necessary?
What is your usecase of lucene where you have to load 30k of
documents? You have to be aware of that if you load 30k docs you need
enough memory for them in you JVM. I have no idea how you index and
what you store in the index but 30k pdf with -Xmx128M is not much :)

simon
>
>    here is my code
>
>
>               IndexReader open = IndexReader.open(indexDir);
>
>                IndexSearcher searcher = new IndexSearcher(open);
>
>                final String fName = "contents";
>
>                QueryParser parser = new QueryParser(fName, new StopAnalyzer());
>                Query query = parser.parse(qryStr);
>
>                TopDocs topDocs = searcher.search(query,200);
>
>                FieldSelector selector = new FieldSelector() {
>                        public FieldSelectorResult accept(String fieldName) {
>                                return fieldName == fName ? FieldSelectorResult.LOAD
>                                                : FieldSelectorResult.LAZY_LOAD;
>                        }
>
>
>                };
>
>                final int totalHits = topDocs.totalHits;
>                ScoreDoc[] scoreDocs = topDocs.scoreDocs;
>
>                for (int i = 0; i < totalHits; i++) {
>                        Document doc = searcher.doc(scoreDocs[i].doc, selector);
>
>                        System.out.println(doc.get("title"));
>                        System.out.println(doc.get("path"));
>
>                }
>
>
>
>                searcher.close();
>
>
> ---------------- can you please clear my doubt
> --
> View this message in context: http://www.nabble.com/Read-large-size-index-tp24251993p24269693.html
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Read large size index

Posted by "m.harig" <m....@gmail.com>.

Thanks Simon , 

          Its working now , thanks a lot , i've a doubt

       i've got 30,000 pdf files indexed ,  but if i use the code which you
sent , returns only 200 results , because am setting   TopDocs topDocs =
searcher.search(query,200);  as i said if use Integer.MAX_VALUE , it returns
java heap space error , even i can't use 300 , 

    here is my code


               IndexReader open = IndexReader.open(indexDir);

		IndexSearcher searcher = new IndexSearcher(open);

		final String fName = "contents";

		QueryParser parser = new QueryParser(fName, new StopAnalyzer());
		Query query = parser.parse(qryStr);

		TopDocs topDocs = searcher.search(query,200);

		FieldSelector selector = new FieldSelector() {
			public FieldSelectorResult accept(String fieldName) {
				return fieldName == fName ? FieldSelectorResult.LOAD
						: FieldSelectorResult.LAZY_LOAD;
			}
			

		};

		final int totalHits = topDocs.totalHits;
		ScoreDoc[] scoreDocs = topDocs.scoreDocs;
		
		for (int i = 0; i < totalHits; i++) {
			Document doc = searcher.doc(scoreDocs[i].doc, selector);

			System.out.println(doc.get("title"));
			System.out.println(doc.get("path"));

		}

		

		searcher.close();


---------------- can you please clear my doubt
-- 
View this message in context: http://www.nabble.com/Read-large-size-index-tp24251993p24269693.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org