You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Brian Carmalt <bc...@contact.de> on 2007/09/05 17:18:09 UTC
Indexing very large files.
Hello all,
I will apologize up front if this is comes twice.
I've bin trying to index a 300MB file to solr 1.2. I keep getting out of
memory heap errors.
Even on an empty index with one Gig of vm memory it sill won't work.
Is it even possible to get Solr to index such large files?
Do I need to write a custom index handler?
Thanks, Brian
Re: Indexing very large files.
Posted by Mike Klaas <mi...@gmail.com>.
On 7-Sep-07, at 4:47 AM, Brian Carmalt wrote:
> Lance Norskog schrieb:
>> Now I'm curious: what is the use case for documents this large?
>
> It is a rand use case, but could become relevant for us. I was told
> to explore the possibilities, and that's what I'm doing. :)
>
> Since I haven't heard any suggestions as to how to do this with a
> stock Solr install, other than increase vm memory, I'll assume it
> will have to be done
> with a custom solution.
Well, have you tried the CSV importer?
-Mike
Re: Indexing very large files.
Posted by Walter Underwood <wu...@netflix.com>.
Legal discovery can have requirements like this. --wunder
On 9/7/07 4:47 AM, "Brian Carmalt" <bc...@contact.de> wrote:
> Lance Norskog schrieb:
>> Now I'm curious: what is the use case for documents this large?
>>
>> Thanks,
>>
>> Lance Norskog
>>
>>
>>
> It is a rand use case, but could become relevant for us. I was told to
> explore the possibilities, and that's what I'm doing. :)
>
> Since I haven't heard any suggestions as to how to do this with a stock
> Solr install, other than increase vm memory, I'll assume it will have to
> be done
> with a custom solution.
>
> Thanks for the answers and the interest.
>
> Brian
Re: Indexing very large files.
Posted by Brian Carmalt <bc...@contact.de>.
Lance Norskog schrieb:
> Now I'm curious: what is the use case for documents this large?
>
> Thanks,
>
> Lance Norskog
>
>
>
It is a rand use case, but could become relevant for us. I was told to
explore the possibilities, and that's what I'm doing. :)
Since I haven't heard any suggestions as to how to do this with a stock
Solr install, other than increase vm memory, I'll assume it will have to
be done
with a custom solution.
Thanks for the answers and the interest.
Brian
RE: Indexing very large files.
Posted by Lance Norskog <go...@gmail.com>.
Now I'm curious: what is the use case for documents this large?
Thanks,
Lance Norskog
Re: Indexing very large files.
Posted by Brian Carmalt <bc...@contact.de>.
Moin Thorsten,
I am using Solr 1.2.0. I'll try the svn version out and see of that helps.
Thanks,
Brian
> Which version do you use of solr?
>
> http://svn.apache.org/viewvc/lucene/solr/trunk/src/java/org/apache/solr/handler/XmlUpdateRequestHandler.java?view=markup
>
> The trunk version of the XmlUpdateRequestHandler is now based on StAX.
> You may want to try whether that is working better.
>
> Please try and report back.
>
> salu2
>
Re: Indexing very large files.
Posted by Thorsten Scherler <th...@juntadeandalucia.es>.
On Thu, 2007-09-06 at 11:26 +0200, Brian Carmalt wrote:
> Hallo again,
>
> I checked out the solr source and built the 1.3-dev version and then I
> tried to index the same file to the new server.
> I do get a different exception trace, but the result is the same.
>
> java.lang.OutOfMemoryError: Java heap space
> at java.util.Arrays.copyOf(Arrays.java:2882)
> at
> java.lang.AbstractStringBuilder.expandCapacity(AbstractStringBuilder.java:100)
It seems that you are reaching the limits because of the StringBuilder.
Did you try to raise the mem to the max like:
java -Xms1536m -Xmx1788m -jar start.jar
Anyway you will have to look into
SolrInputDocument readDoc(XMLStreamReader parser) throws
XMLStreamException {
...
StringBuilder text = new StringBuilder();
...
case XMLStreamConstants.CHARACTERS:
text.append( parser.getText() );
break;
...
The problem is that the "text" object is bigger then heaps,
maybe invoking garbage collection before will help.
salu2
--
Thorsten Scherler thorsten.at.apache.org
Open Source Java consulting, training and solutions
Re: Indexing very large files.
Posted by Mike Klaas <mi...@gmail.com>.
On 6-Sep-07, at 2:26 AM, Brian Carmalt wrote:
> Hallo again,
>
> I checked out the solr source and built the 1.3-dev version and
> then I tried to index the same file to the new server.
> I do get a different exception trace, but the result is the same.
Note that StringBuilder expands capacity by allocating a new buffer
and copying the old one in, so double the memory is needed during
that operation. The new buffer is probably a good fraction bigger
(traditionally 2x, but typically implemented 1/8 or 1/4), so simply
storing the text for that one document could require 600-700MB for
that expansion operation. Then you have overhead for the doc, and
all the other solr memory requirements... also perhaps the serialized
xml is in memory too, which brings us back up to close to a gig.
Under Solr grows special support for processing huge docs without
copying, just increase you jvm while indexing such hugeness. (Note
that other input methods, like cvs, might behave better, but I
haven't examined them to verify.)
-Mike
> java.lang.OutOfMemoryError: Java heap space
> at java.util.Arrays.copyOf(Arrays.java:2882)
> at java.lang.AbstractStringBuilder.expandCapacity
> (AbstractStringBuilder.java:100)
> at java.lang.AbstractStringBuilder.append
> (AbstractStringBuilder.java:390)
> at java.lang.StringBuilder.append(StringBuilder.java:119)
> at org.apache.solr.handler.XmlUpdateRequestHandler.readDoc
> (XmlUpdateRequestHandler.java:310)
> at org.apache.solr.handler.XmlUpdateRequestHandler.processUpdate
> (XmlUpdateRequestHandler.java:181)
> at
> org.apache.solr.handler.XmlUpdateRequestHandler.handleRequestBody
> (XmlUpdateRequestHandler.java:109)
> at org.apache.solr.handler.RequestHandlerBase.handleRequest
> (RequestHandlerBase.java:78)
> at org.apache.solr.core.SolrCore.execute(SolrCore.java:723)
> at org.apache.solr.servlet.SolrDispatchFilter.execute
> (SolrDispatchFilter.java:193)
> at org.apache.solr.servlet.SolrDispatchFilter.doFilter
> (SolrDispatchFilter.java:161)
> at
> org.apache.catalina.core.ApplicationFilterChain.internalDoFilter
> (ApplicationFilterChain.java:235)
> at org.apache.catalina.core.ApplicationFilterChain.doFilter
> (ApplicationFilterChain.java:206)
> at org.apache.catalina.core.StandardWrapperValve.invoke
> (StandardWrapperValve.java:230)
> at org.apache.catalina.core.StandardContextValve.invoke
> (StandardContextValve.java:175)
> at org.apache.catalina.core.StandardHostValve.invoke
> (StandardHostValve.java:128)
> at org.apache.catalina.valves.ErrorReportValve.invoke
> (ErrorReportValve.java:104)
> at org.apache.catalina.core.StandardEngineValve.invoke
> (StandardEngineValve.java:109)
> at org.apache.catalina.connector.CoyoteAdapter.service
> (CoyoteAdapter.java:261)
> at org.apache.coyote.http11.Http11Processor.process
> (Http11Processor.java:844)
> at org.apache.coyote.http11.Http11Protocol
> $Http11ConnectionHandler.process(Http11Protocol.java:581)
> at org.apache.tomcat.util.net.JIoEndpoint$Worker.run
> (JIoEndpoint.java:447)
> at java.lang.Thread.run(Thread.java:619)
Re: Indexing very large files.
Posted by Brian Carmalt <bc...@contact.de>.
Hallo again,
I checked out the solr source and built the 1.3-dev version and then I
tried to index the same file to the new server.
I do get a different exception trace, but the result is the same.
java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOf(Arrays.java:2882)
at
java.lang.AbstractStringBuilder.expandCapacity(AbstractStringBuilder.java:100)
at
java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:390)
at java.lang.StringBuilder.append(StringBuilder.java:119)
at
org.apache.solr.handler.XmlUpdateRequestHandler.readDoc(XmlUpdateRequestHandler.java:310)
at
org.apache.solr.handler.XmlUpdateRequestHandler.processUpdate(XmlUpdateRequestHandler.java:181)
at
org.apache.solr.handler.XmlUpdateRequestHandler.handleRequestBody(XmlUpdateRequestHandler.java:109)
at
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:78)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:723)
at
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:193)
at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:161)
at
org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235)
at
org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
at
org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:230)
at
org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:175)
at
org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:128)
at
org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:104)
at
org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109)
at
org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:261)
at
org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:844)
at
org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:581)
at
org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:447)
at java.lang.Thread.run(Thread.java:619)
Brian
> Which version do you use of solr?
>
> http://svn.apache.org/viewvc/lucene/solr/trunk/src/java/org/apache/solr/handler/XmlUpdateRequestHandler.java?view=markup
>
> The trunk version of the XmlUpdateRequestHandler is now based on StAX.
> You may want to try whether that is working better.
>
> Please try and report back.
>
> salu2
>
Re: Indexing very large files.
Posted by Thorsten Scherler <th...@juntadeandalucia.es>.
On Thu, 2007-09-06 at 08:55 +0200, Brian Carmalt wrote:
> Hello again,
>
> I run Solr on Tomcat under windows and use the tomcat monitor to start
> the service. I have set the minimum heap
> size to be 512MB and then maximum to be 1024mb. The system has 2 Gigs of
> ram. The error that I get after sending
> approximately 300 MB is:
>
> java.lang.OutOfMemoryError: Java heap space
> at org.xmlpull.mxp1.MXParser.fillBuf(MXParser.java:2947)
> at org.xmlpull.mxp1.MXParser.more(MXParser.java:3026)
> at org.xmlpull.mxp1.MXParser.nextImpl(MXParser.java:1384)
> at org.xmlpull.mxp1.MXParser.next(MXParser.java:1093)
> at org.xmlpull.mxp1.MXParser.nextText(MXParser.java:1058)
> at
> org.apache.solr.handler.XmlUpdateRequestHandler.readDoc(XmlUpdateRequestHandler.java:332)
> at
> org.apache.solr.handler.XmlUpdateRequestHandler.update(XmlUpdateRequestHandler.java:162)
> at
> org.apache.solr.handler.XmlUpdateRequestHandler.handleRequestBody(XmlUpdateRequestHandler.java:84)
> at
> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:77)
> at org.apache.solr.core.SolrCore.execute(SolrCore.java:658)
> at
> org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:191)
> at
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:159)
> at
> org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235)
> at
> org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
> at
> org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:230)
> at
> org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:175)
> at
> org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:128)
> at
> org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:104)
> at
> org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109)
> at
> org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:261)
> at
> org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:844)
> at
> org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:581)
> at
> org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:447)
> at java.lang.Thread.run(Thread.java:619)
>
> After sleeping on the problem I see that it does not directly stem from
> Solr, but from the
> module org.xmlpull.mxp1.MXParser. Hmmm. I'm open to sugestions and ideas.
Which version do you use of solr?
http://svn.apache.org/viewvc/lucene/solr/trunk/src/java/org/apache/solr/handler/XmlUpdateRequestHandler.java?view=markup
The trunk version of the XmlUpdateRequestHandler is now based on StAX.
You may want to try whether that is working better.
Please try and report back.
salu2
--
Thorsten Scherler thorsten.at.apache.org
Open Source Java consulting, training and solutions
Re: Indexing very large files.
Posted by Brian Carmalt <bc...@contact.de>.
Hello again,
I run Solr on Tomcat under windows and use the tomcat monitor to start
the service. I have set the minimum heap
size to be 512MB and then maximum to be 1024mb. The system has 2 Gigs of
ram. The error that I get after sending
approximately 300 MB is:
java.lang.OutOfMemoryError: Java heap space
at org.xmlpull.mxp1.MXParser.fillBuf(MXParser.java:2947)
at org.xmlpull.mxp1.MXParser.more(MXParser.java:3026)
at org.xmlpull.mxp1.MXParser.nextImpl(MXParser.java:1384)
at org.xmlpull.mxp1.MXParser.next(MXParser.java:1093)
at org.xmlpull.mxp1.MXParser.nextText(MXParser.java:1058)
at
org.apache.solr.handler.XmlUpdateRequestHandler.readDoc(XmlUpdateRequestHandler.java:332)
at
org.apache.solr.handler.XmlUpdateRequestHandler.update(XmlUpdateRequestHandler.java:162)
at
org.apache.solr.handler.XmlUpdateRequestHandler.handleRequestBody(XmlUpdateRequestHandler.java:84)
at
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:77)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:658)
at
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:191)
at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:159)
at
org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235)
at
org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
at
org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:230)
at
org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:175)
at
org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:128)
at
org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:104)
at
org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109)
at
org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:261)
at
org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:844)
at
org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:581)
at
org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:447)
at java.lang.Thread.run(Thread.java:619)
After sleeping on the problem I see that it does not directly stem from
Solr, but from the
module org.xmlpull.mxp1.MXParser. Hmmm. I'm open to sugestions and ideas.
First is this doable?
If yes, will I have to modify the code to save the file to disk and then
read it back
in order to index it in chunks.
Or can I get it it working on a stock Solr install.
Thanks,
Brian
Norberto Meijome schrieb:
> On Wed, 05 Sep 2007 17:18:09 +0200
> Brian Carmalt <bc...@contact.de> wrote:
>
>
>> I've bin trying to index a 300MB file to solr 1.2. I keep getting out of
>> memory heap errors.
>> Even on an empty index with one Gig of vm memory it sill won't work.
>>
>
> Hi Brian,
>
> VM != heap memory.
>
> VM = OS memory
> heap memory = memory made available by the JavaVM to the Java process. Heap memory errors are hardly ever an issue of the app itself (other , of course, with bad programming... but it doesnt seem to be issue here so far)
>
>
> [betom@ayiin] [Thu Sep 6 14:59:21 2007]
> /usr/home/betom
> $ java -X
> [...]
> -Xms<size> set initial Java heap size
> -Xmx<size> set maximum Java heap size
> -Xss<size> set java thread stack size
> [...]
>
> For example, start solr as :
> java -Xms64m -Xmx512m -jar start.jar
>
> YMMV with respect to the actual values you use.
>
> Good luck,
> B
> _________________________
> {Beto|Norberto|Numard} Meijome
>
> Windows caters to everyone as though they are idiots. UNIX makes no such assumption.
> It assumes you know what you are doing, and presents the challenge of figuring it out for yourself if you don't.
>
> I speak for myself, not my employer. Contents may be hot. Slippery when wet. Reading disclaimers makes you go blind. Writing them is worse. You have been Warned.
>
>
Re: Indexing very large files.
Posted by Norberto Meijome <fr...@meijome.net>.
On Wed, 05 Sep 2007 17:18:09 +0200
Brian Carmalt <bc...@contact.de> wrote:
> I've bin trying to index a 300MB file to solr 1.2. I keep getting out of
> memory heap errors.
> Even on an empty index with one Gig of vm memory it sill won't work.
Hi Brian,
VM != heap memory.
VM = OS memory
heap memory = memory made available by the JavaVM to the Java process. Heap memory errors are hardly ever an issue of the app itself (other , of course, with bad programming... but it doesnt seem to be issue here so far)
[betom@ayiin] [Thu Sep 6 14:59:21 2007]
/usr/home/betom
$ java -X
[...]
-Xms<size> set initial Java heap size
-Xmx<size> set maximum Java heap size
-Xss<size> set java thread stack size
[...]
For example, start solr as :
java -Xms64m -Xmx512m -jar start.jar
YMMV with respect to the actual values you use.
Good luck,
B
_________________________
{Beto|Norberto|Numard} Meijome
Windows caters to everyone as though they are idiots. UNIX makes no such assumption.
It assumes you know what you are doing, and presents the challenge of figuring it out for yourself if you don't.
I speak for myself, not my employer. Contents may be hot. Slippery when wet. Reading disclaimers makes you go blind. Writing them is worse. You have been Warned.
Re: Indexing very large files.
Posted by Brian Carmalt <bc...@contact.de>.
Yonik Seeley schrieb:
> On 9/5/07, Brian Carmalt <bc...@contact.de> wrote:
>
>> I've bin trying to index a 300MB file to solr 1.2. I keep getting out of
>> memory heap errors.
>>
>
> 300MB of what... a single 300MB document? Or is that file represent
> multiple documents in XML or CSV format?
>
> -Yonik
>
Hello Yonik,
Thank you for your fast reply. It is one large document. If it was made up
of smaller docs, I would split it up and index them separately.
Can Solr be made to handle such large docs?
Thanks, Brian
Re: Indexing very large files.
Posted by Yonik Seeley <yo...@apache.org>.
On 9/5/07, Brian Carmalt <bc...@contact.de> wrote:
> I've bin trying to index a 300MB file to solr 1.2. I keep getting out of
> memory heap errors.
300MB of what... a single 300MB document? Or is that file represent
multiple documents in XML or CSV format?
-Yonik