You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Brian Carmalt <bc...@contact.de> on 2007/09/05 17:18:09 UTC

Indexing very large files.

Hello all,

I will apologize up front if this is comes twice.

I've bin trying to index a 300MB file to solr 1.2. I keep getting out of 
memory heap errors.
Even on an empty index with one Gig of vm memory it sill won't work.
Is it even possible to get Solr to index such large files?
Do I need to write a custom index handler?

Thanks,  Brian


Re: Indexing very large files.

Posted by Mike Klaas <mi...@gmail.com>.
On 7-Sep-07, at 4:47 AM, Brian Carmalt wrote:

> Lance Norskog schrieb:
>> Now I'm curious: what is the use case for documents this large?
>

> It is a rand use case, but could become relevant for us. I was told  
> to explore the possibilities, and that's what I'm doing. :)
>
> Since I haven't heard any suggestions as to how to do this with a  
> stock Solr install, other than increase vm memory, I'll assume it  
> will have to be done
> with a custom solution.

Well, have you tried the CSV importer?

-Mike

Re: Indexing very large files.

Posted by Walter Underwood <wu...@netflix.com>.
Legal discovery can have requirements like this. --wunder

On 9/7/07 4:47 AM, "Brian Carmalt" <bc...@contact.de> wrote:

> Lance Norskog schrieb:
>> Now I'm curious: what is the use case for documents this large?
>> 
>> Thanks,
>> 
>> Lance Norskog
>> 
>> 
>>   
> It is a rand use case, but could become relevant for us. I was told to
> explore the possibilities, and that's what I'm doing. :)
> 
> Since I haven't heard any suggestions as to how to do this with a stock
> Solr install, other than increase vm memory, I'll assume it will have to
> be done
> with a custom solution.
> 
> Thanks for the answers and the interest.
> 
> Brian


Re: Indexing very large files.

Posted by Brian Carmalt <bc...@contact.de>.
Lance Norskog schrieb:
> Now I'm curious: what is the use case for documents this large?
>
> Thanks,
>
> Lance Norskog
>
>
>   
It is a rand use case, but could become relevant for us. I was told to 
explore the possibilities, and that's what I'm doing. :)

Since I haven't heard any suggestions as to how to do this with a stock 
Solr install, other than increase vm memory, I'll assume it will have to 
be done
with a custom solution.

Thanks for the answers and the interest.

Brian

RE: Indexing very large files.

Posted by Lance Norskog <go...@gmail.com>.
Now I'm curious: what is the use case for documents this large?

Thanks,

Lance Norskog


Re: Indexing very large files.

Posted by Brian Carmalt <bc...@contact.de>.
Moin Thorsten,
I am using Solr 1.2.0. I'll try the svn version out and see of that helps.

Thanks,
Brian

> Which version do you use of solr?
>
> http://svn.apache.org/viewvc/lucene/solr/trunk/src/java/org/apache/solr/handler/XmlUpdateRequestHandler.java?view=markup
>
> The trunk version of the XmlUpdateRequestHandler is now based on StAX.
> You may want to try whether that is working better.
>
> Please try and report back.
>
> salu2
>   


Re: Indexing very large files.

Posted by Thorsten Scherler <th...@juntadeandalucia.es>.
On Thu, 2007-09-06 at 11:26 +0200, Brian Carmalt wrote:
> Hallo again,
> 
> I checked out the solr source and built the 1.3-dev version and then I 
> tried to index the same file to the new server.
> I do get a different exception trace, but the result is the same.
> 
> java.lang.OutOfMemoryError: Java heap space
>     at java.util.Arrays.copyOf(Arrays.java:2882)
>     at 
> java.lang.AbstractStringBuilder.expandCapacity(AbstractStringBuilder.java:100)

It seems that you are reaching the limits because of the StringBuilder.

Did you try to raise the mem to the max like:
java  -Xms1536m -Xmx1788m -jar start.jar

Anyway you will have to look into 
SolrInputDocument readDoc(XMLStreamReader parser) throws
XMLStreamException {
...
StringBuilder text = new StringBuilder();
...
case XMLStreamConstants.CHARACTERS:
  text.append( parser.getText() );
  break;
...

The problem is that the "text" object is bigger then heaps, 
maybe invoking garbage collection before will help.

salu2
-- 
Thorsten Scherler                                 thorsten.at.apache.org
Open Source Java                      consulting, training and solutions


Re: Indexing very large files.

Posted by Mike Klaas <mi...@gmail.com>.
On 6-Sep-07, at 2:26 AM, Brian Carmalt wrote:

> Hallo again,
>
> I checked out the solr source and built the 1.3-dev version and  
> then I tried to index the same file to the new server.
> I do get a different exception trace, but the result is the same.

Note that StringBuilder expands capacity by allocating a new buffer  
and copying the old one in, so double the memory is needed during  
that operation.  The new buffer is probably a good fraction bigger  
(traditionally 2x, but typically implemented 1/8 or 1/4), so simply  
storing the text for that one document could require 600-700MB for  
that expansion operation.  Then you have overhead for the doc, and  
all the other solr memory requirements... also perhaps the serialized  
xml is in memory too, which brings us back up to close to a gig.

Under Solr grows special support for processing huge docs without  
copying, just increase you jvm while indexing such hugeness.  (Note  
that other input methods, like cvs, might behave better, but I  
haven't examined them to verify.)

-Mike

> java.lang.OutOfMemoryError: Java heap space
>    at java.util.Arrays.copyOf(Arrays.java:2882)
>    at java.lang.AbstractStringBuilder.expandCapacity 
> (AbstractStringBuilder.java:100)
>    at java.lang.AbstractStringBuilder.append 
> (AbstractStringBuilder.java:390)
>    at java.lang.StringBuilder.append(StringBuilder.java:119)
>    at org.apache.solr.handler.XmlUpdateRequestHandler.readDoc 
> (XmlUpdateRequestHandler.java:310)
>    at org.apache.solr.handler.XmlUpdateRequestHandler.processUpdate 
> (XmlUpdateRequestHandler.java:181)
>    at  
> org.apache.solr.handler.XmlUpdateRequestHandler.handleRequestBody 
> (XmlUpdateRequestHandler.java:109)
>    at org.apache.solr.handler.RequestHandlerBase.handleRequest 
> (RequestHandlerBase.java:78)
>    at org.apache.solr.core.SolrCore.execute(SolrCore.java:723)
>    at org.apache.solr.servlet.SolrDispatchFilter.execute 
> (SolrDispatchFilter.java:193)
>    at org.apache.solr.servlet.SolrDispatchFilter.doFilter 
> (SolrDispatchFilter.java:161)
>    at  
> org.apache.catalina.core.ApplicationFilterChain.internalDoFilter 
> (ApplicationFilterChain.java:235)
>    at org.apache.catalina.core.ApplicationFilterChain.doFilter 
> (ApplicationFilterChain.java:206)
>    at org.apache.catalina.core.StandardWrapperValve.invoke 
> (StandardWrapperValve.java:230)
>    at org.apache.catalina.core.StandardContextValve.invoke 
> (StandardContextValve.java:175)
>    at org.apache.catalina.core.StandardHostValve.invoke 
> (StandardHostValve.java:128)
>    at org.apache.catalina.valves.ErrorReportValve.invoke 
> (ErrorReportValve.java:104)
>    at org.apache.catalina.core.StandardEngineValve.invoke 
> (StandardEngineValve.java:109)
>    at org.apache.catalina.connector.CoyoteAdapter.service 
> (CoyoteAdapter.java:261)
>    at org.apache.coyote.http11.Http11Processor.process 
> (Http11Processor.java:844)
>    at org.apache.coyote.http11.Http11Protocol 
> $Http11ConnectionHandler.process(Http11Protocol.java:581)
>    at org.apache.tomcat.util.net.JIoEndpoint$Worker.run 
> (JIoEndpoint.java:447)
>    at java.lang.Thread.run(Thread.java:619)


Re: Indexing very large files.

Posted by Brian Carmalt <bc...@contact.de>.
Hallo again,

I checked out the solr source and built the 1.3-dev version and then I 
tried to index the same file to the new server.
I do get a different exception trace, but the result is the same.

java.lang.OutOfMemoryError: Java heap space
    at java.util.Arrays.copyOf(Arrays.java:2882)
    at 
java.lang.AbstractStringBuilder.expandCapacity(AbstractStringBuilder.java:100)
    at 
java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:390)
    at java.lang.StringBuilder.append(StringBuilder.java:119)
    at 
org.apache.solr.handler.XmlUpdateRequestHandler.readDoc(XmlUpdateRequestHandler.java:310)
    at 
org.apache.solr.handler.XmlUpdateRequestHandler.processUpdate(XmlUpdateRequestHandler.java:181)
    at 
org.apache.solr.handler.XmlUpdateRequestHandler.handleRequestBody(XmlUpdateRequestHandler.java:109)
    at 
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:78)
    at org.apache.solr.core.SolrCore.execute(SolrCore.java:723)
    at 
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:193)
    at 
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:161)
    at 
org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235)
    at 
org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
    at 
org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:230)
    at 
org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:175)
    at 
org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:128)
    at 
org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:104)
    at 
org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109)
    at 
org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:261)
    at 
org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:844)
    at 
org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:581)
    at 
org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:447)
    at java.lang.Thread.run(Thread.java:619)


Brian
> Which version do you use of solr?
>
> http://svn.apache.org/viewvc/lucene/solr/trunk/src/java/org/apache/solr/handler/XmlUpdateRequestHandler.java?view=markup
>
> The trunk version of the XmlUpdateRequestHandler is now based on StAX.
> You may want to try whether that is working better.
>
> Please try and report back.
>
> salu2
>   


Re: Indexing very large files.

Posted by Thorsten Scherler <th...@juntadeandalucia.es>.
On Thu, 2007-09-06 at 08:55 +0200, Brian Carmalt wrote:
> Hello again,
> 
> I run Solr on Tomcat under windows and use the tomcat monitor to start 
> the service. I have set the minimum heap
> size to be 512MB and then maximum to be 1024mb. The system has 2 Gigs of 
> ram. The error that I get after sending
> approximately 300 MB is:
> 
> java.lang.OutOfMemoryError: Java heap space
>     at org.xmlpull.mxp1.MXParser.fillBuf(MXParser.java:2947)
>     at org.xmlpull.mxp1.MXParser.more(MXParser.java:3026)
>     at org.xmlpull.mxp1.MXParser.nextImpl(MXParser.java:1384)
>     at org.xmlpull.mxp1.MXParser.next(MXParser.java:1093)
>     at org.xmlpull.mxp1.MXParser.nextText(MXParser.java:1058)
>     at 
> org.apache.solr.handler.XmlUpdateRequestHandler.readDoc(XmlUpdateRequestHandler.java:332)
>     at 
> org.apache.solr.handler.XmlUpdateRequestHandler.update(XmlUpdateRequestHandler.java:162)
>     at 
> org.apache.solr.handler.XmlUpdateRequestHandler.handleRequestBody(XmlUpdateRequestHandler.java:84)
>     at 
> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:77)
>     at org.apache.solr.core.SolrCore.execute(SolrCore.java:658)
>     at 
> org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:191)
>     at 
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:159)
>     at 
> org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235)
>     at 
> org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
>     at 
> org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:230)
>     at 
> org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:175)
>     at 
> org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:128)
>     at 
> org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:104)
>     at 
> org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109)
>     at 
> org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:261)
>     at 
> org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:844)
>     at 
> org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:581)
>     at 
> org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:447)
>     at java.lang.Thread.run(Thread.java:619)
> 
> After sleeping on the problem I see that it does not directly stem from 
> Solr, but from the
> module  org.xmlpull.mxp1.MXParser. Hmmm. I'm open to sugestions and ideas.

Which version do you use of solr?

http://svn.apache.org/viewvc/lucene/solr/trunk/src/java/org/apache/solr/handler/XmlUpdateRequestHandler.java?view=markup

The trunk version of the XmlUpdateRequestHandler is now based on StAX.
You may want to try whether that is working better.

Please try and report back.

salu2
-- 
Thorsten Scherler                                 thorsten.at.apache.org
Open Source Java                      consulting, training and solutions


Re: Indexing very large files.

Posted by Brian Carmalt <bc...@contact.de>.
Hello again,

I run Solr on Tomcat under windows and use the tomcat monitor to start 
the service. I have set the minimum heap
size to be 512MB and then maximum to be 1024mb. The system has 2 Gigs of 
ram. The error that I get after sending
approximately 300 MB is:

java.lang.OutOfMemoryError: Java heap space
    at org.xmlpull.mxp1.MXParser.fillBuf(MXParser.java:2947)
    at org.xmlpull.mxp1.MXParser.more(MXParser.java:3026)
    at org.xmlpull.mxp1.MXParser.nextImpl(MXParser.java:1384)
    at org.xmlpull.mxp1.MXParser.next(MXParser.java:1093)
    at org.xmlpull.mxp1.MXParser.nextText(MXParser.java:1058)
    at 
org.apache.solr.handler.XmlUpdateRequestHandler.readDoc(XmlUpdateRequestHandler.java:332)
    at 
org.apache.solr.handler.XmlUpdateRequestHandler.update(XmlUpdateRequestHandler.java:162)
    at 
org.apache.solr.handler.XmlUpdateRequestHandler.handleRequestBody(XmlUpdateRequestHandler.java:84)
    at 
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:77)
    at org.apache.solr.core.SolrCore.execute(SolrCore.java:658)
    at 
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:191)
    at 
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:159)
    at 
org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235)
    at 
org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
    at 
org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:230)
    at 
org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:175)
    at 
org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:128)
    at 
org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:104)
    at 
org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109)
    at 
org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:261)
    at 
org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:844)
    at 
org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:581)
    at 
org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:447)
    at java.lang.Thread.run(Thread.java:619)

After sleeping on the problem I see that it does not directly stem from 
Solr, but from the
module  org.xmlpull.mxp1.MXParser. Hmmm. I'm open to sugestions and ideas.

First is this doable?
If yes, will I have to modify the code to save the file to disk and then 
read it back
in order to index it in chunks.
Or can I get it it working on a stock Solr install.

Thanks,

Brian

Norberto Meijome schrieb:
> On Wed, 05 Sep 2007 17:18:09 +0200
> Brian Carmalt <bc...@contact.de> wrote:
>
>   
>> I've bin trying to index a 300MB file to solr 1.2. I keep getting out of 
>> memory heap errors.
>> Even on an empty index with one Gig of vm memory it sill won't work.
>>     
>
> Hi Brian,
>
> VM != heap memory.
>
> VM = OS memory
> heap memory = memory made available by the JavaVM to the Java process. Heap memory errors are hardly ever an issue of the app itself (other , of course, with bad programming... but it doesnt seem to be issue here so far)
>
>
> [betom@ayiin] [Thu Sep  6 14:59:21 2007]
> /usr/home/betom
> $ java -X
> [...]
>     -Xms<size>        set initial Java heap size
>     -Xmx<size>        set maximum Java heap size
>     -Xss<size>        set java thread stack size
> [...]
>
> For example, start solr as :
> java  -Xms64m -Xmx512m   -jar start.jar
>
> YMMV with respect to the actual values you use.
>
> Good luck,
> B
> _________________________
> {Beto|Norberto|Numard} Meijome
>
> Windows caters to everyone as though they are idiots. UNIX makes no such assumption. 
> It assumes you know what you are doing, and presents the challenge of figuring it out for yourself if you don't.
>
> I speak for myself, not my employer. Contents may be hot. Slippery when wet. Reading disclaimers makes you go blind. Writing them is worse. You have been Warned.
>
>   


Re: Indexing very large files.

Posted by Norberto Meijome <fr...@meijome.net>.
On Wed, 05 Sep 2007 17:18:09 +0200
Brian Carmalt <bc...@contact.de> wrote:

> I've bin trying to index a 300MB file to solr 1.2. I keep getting out of 
> memory heap errors.
> Even on an empty index with one Gig of vm memory it sill won't work.

Hi Brian,

VM != heap memory.

VM = OS memory
heap memory = memory made available by the JavaVM to the Java process. Heap memory errors are hardly ever an issue of the app itself (other , of course, with bad programming... but it doesnt seem to be issue here so far)


[betom@ayiin] [Thu Sep  6 14:59:21 2007]
/usr/home/betom
$ java -X
[...]
    -Xms<size>        set initial Java heap size
    -Xmx<size>        set maximum Java heap size
    -Xss<size>        set java thread stack size
[...]

For example, start solr as :
java  -Xms64m -Xmx512m   -jar start.jar

YMMV with respect to the actual values you use.

Good luck,
B
_________________________
{Beto|Norberto|Numard} Meijome

Windows caters to everyone as though they are idiots. UNIX makes no such assumption. 
It assumes you know what you are doing, and presents the challenge of figuring it out for yourself if you don't.

I speak for myself, not my employer. Contents may be hot. Slippery when wet. Reading disclaimers makes you go blind. Writing them is worse. You have been Warned.

Re: Indexing very large files.

Posted by Brian Carmalt <bc...@contact.de>.
Yonik Seeley schrieb:
> On 9/5/07, Brian Carmalt <bc...@contact.de> wrote:
>   
>> I've bin trying to index a 300MB file to solr 1.2. I keep getting out of
>> memory heap errors.
>>     
>
> 300MB of what... a single 300MB document?  Or is that file represent
> multiple documents in XML or CSV format?
>
> -Yonik
>   
Hello Yonik,

Thank you for your fast reply.  It is one large document. If it was made up
of smaller docs, I would split it up and index them separately.

Can Solr be made to handle such large docs?

Thanks, Brian

Re: Indexing very large files.

Posted by Yonik Seeley <yo...@apache.org>.
On 9/5/07, Brian Carmalt <bc...@contact.de> wrote:
> I've bin trying to index a 300MB file to solr 1.2. I keep getting out of
> memory heap errors.

300MB of what... a single 300MB document?  Or is that file represent
multiple documents in XML or CSV format?

-Yonik