You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Vitaliy Zhovtyuk (JIRA)" <ji...@apache.org> on 2014/12/01 18:29:17 UTC
[jira] [Updated] (SOLR-3881) frequent OOM in LanguageIdentifierUpdateProcessor

     [ https://issues.apache.org/jira/browse/SOLR-3881?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Vitaliy Zhovtyuk updated SOLR-3881:
-----------------------------------
    Attachment: SOLR-3881.patch

1. LangDetectLanguageIdentifierUpdateProcessor.detectLanguage() still uses concatFields(), but it shouldn't – that was the whole point about moving it to TikaLanguageIdentifierUpdateProcessor; instead, LangDetectLanguageIdentifierUpdateProcessor.detectLanguage() should loop over inputFields and call detector.append() (similarly to what concatFields() does).
[VZ] LangDetectLanguageIdentifierUpdateProcessor.detectLanguage() changed to use old flow with limit on field and max total on detector.
Each field value appended to detector.

2. concatFields() and getExpectedSize() should move to TikaLanguageIdentifierUpdateProcessor.
[VZ] Moved to TikaLanguageIdentifierUpdateProcessor. Tests using concatFields() moved to TikaLanguageIdentifierUpdateProcessorFactoryTest.

3. LanguageIdentifierUpdateProcessor.getExpectedSize() still takes a maxAppendSize, which didn't get renamed, but that param could be removed entirely, since maxFieldValueChars is available as a data member.
[VZ] Argument removed.

4. There are a bunch of whitespace changes in LanguageIdentifierUpdateProcessorFactoryTestCase.java - it makes reviewing patches significantly harder when they include changes like this. Your IDE should have settings that make it stop doing this.
[VZ] Whitespaces removed.

5. There is still some import reordering in TikaLanguageIdentifierUpdateProcessor.java.
[VZ] Fixed.

One last thing:
The total chars default should be its own setting; I was thinking we could make it double the per-value default?
[VZ] added default value to maxTotalChars and changed both to 10K like in com.cybozu.labs.langdetect.Detector.maxLength
Thanks for adding the total chars default, but you didn't make it double the field value chars default, as I suggested. Not sure if that's better - if the user specifies multiple fields and the first one is the only one that's used to determine the language because it's larger than the total char default, is that an issue? I was thinking that it would be better to visit at least one other field (hence the idea of total = 2 * per-field), but that wouldn't fully address the issue. What do you think?
[VZ] i think in most cases it will be only one field, but since both parameters are optional we should not restrict result if only per field specified more then 10K.
Updated total default value to 20K. 


> frequent OOM in LanguageIdentifierUpdateProcessor
> -------------------------------------------------
>
>                 Key: SOLR-3881
>                 URL: https://issues.apache.org/jira/browse/SOLR-3881
>             Project: Solr
>          Issue Type: Bug
>          Components: update
>    Affects Versions: 4.0
>         Environment: CentOS 6.x, JDK 1.6, (java -server -Xms2G -Xmx2G -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=....)
>            Reporter: Rob Tulloh
>             Fix For: 4.9, Trunk
>
>         Attachments: SOLR-3881.patch, SOLR-3881.patch, SOLR-3881.patch, SOLR-3881.patch, SOLR-3881.patch
>
>
> We are seeing frequent failures from Solr causing it to OOM. Here is the stack trace we observe when this happens:
> {noformat}
> Caused by: java.lang.OutOfMemoryError: Java heap space
>         at java.util.Arrays.copyOf(Arrays.java:2882)
>         at java.lang.AbstractStringBuilder.expandCapacity(AbstractStringBuilder.java:100)
>         at java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:390)
>         at java.lang.StringBuffer.append(StringBuffer.java:224)
>         at org.apache.solr.update.processor.LanguageIdentifierUpdateProcessor.concatFields(LanguageIdentifierUpdateProcessor.java:286)
>         at org.apache.solr.update.processor.LanguageIdentifierUpdateProcessor.process(LanguageIdentifierUpdateProcessor.java:189)
>         at org.apache.solr.update.processor.LanguageIdentifierUpdateProcessor.processAdd(LanguageIdentifierUpdateProcessor.java:171)
>         at org.apache.solr.handler.BinaryUpdateRequestHandler$2.update(BinaryUpdateRequestHandler.java:90)
>         at org.apache.solr.client.solrj.request.JavaBinUpdateRequestCodec$1.readOuterMostDocIterator(JavaBinUpdateRequestCodec.java:140)
>         at org.apache.solr.client.solrj.request.JavaBinUpdateRequestCodec$1.readIterator(JavaBinUpdateRequestCodec.java:120)
>         at org.apache.solr.common.util.JavaBinCodec.readVal(JavaBinCodec.java:221)
>         at org.apache.solr.client.solrj.request.JavaBinUpdateRequestCodec$1.readNamedList(JavaBinUpdateRequestCodec.java:105)
>         at org.apache.solr.common.util.JavaBinCodec.readVal(JavaBinCodec.java:186)
>         at org.apache.solr.common.util.JavaBinCodec.unmarshal(JavaBinCodec.java:112)
>         at org.apache.solr.client.solrj.request.JavaBinUpdateRequestCodec.unmarshal(JavaBinUpdateRequestCodec.java:147)
>         at org.apache.solr.handler.BinaryUpdateRequestHandler.parseAndLoadDocs(BinaryUpdateRequestHandler.java:100)
>         at org.apache.solr.handler.BinaryUpdateRequestHandler.access$000(BinaryUpdateRequestHandler.java:47)
>         at org.apache.solr.handler.BinaryUpdateRequestHandler$1.load(BinaryUpdateRequestHandler.java:58)
>         at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:59)
>         at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129)
>         at org.apache.solr.core.SolrCore.execute(SolrCore.java:1540)
>         at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:435)
>         at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:256)
>         at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1337)
>         at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:484)
>         at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:119)
>         at org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:524)
>         at org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:233)
>         at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1065)
>         at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:413)
>         at org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:192)
>         at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:999)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org