You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "ASF subversion and git services (JIRA)" <ji...@apache.org> on 2014/08/22 12:21:12 UTC

[jira] [Commented] (LUCENE-5400) Long text matching email local-part rule in UAX29URLEmailTokenizer causes extremely slow tokenization

    [ https://issues.apache.org/jira/browse/LUCENE-5400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14106685#comment-14106685 ] 

ASF subversion and git services commented on LUCENE-5400:
---------------------------------------------------------

Commit 1619730 from [~sarowe@syr.edu] in branch 'dev/trunk'
[ https://svn.apache.org/r1619730 ]

LUCENE-5897, LUCENE-5400: JFlex-based tokenizers StandardTokenizer and UAX29URLEmailTokenizer tokenize extremely slowly over long sequences of text partially matching certain grammar rules.  The scanner default buffer size was reduced, and scanner buffer growth was disabled, resulting in much, much faster tokenization for these text sequences.

> Long text matching email local-part rule in UAX29URLEmailTokenizer causes extremely slow tokenization
> -----------------------------------------------------------------------------------------------------
>
>                 Key: LUCENE-5400
>                 URL: https://issues.apache.org/jira/browse/LUCENE-5400
>             Project: Lucene - Core
>          Issue Type: Bug
>    Affects Versions: 4.5
>            Reporter: Chris Geeringh
>            Assignee: Steve Rowe
>
> This is a pretty nasty bug, and causes the cluster to stop accepting updates. I'm not sure how to consistently reproduce it but I have done so numerous times. Switching to a whitespace tokenizer improved indexing speed, and I never got the issue again.
> I'm running a 4.6 Snapshot - I had issues with deadlocks with numerous versions of Solr, and have finally narrowed down the problem to this code, which affects many/all(?) versions of Solr.
> When the thread hits this issue it uses 100% CPU, restarting the node which has the error allows indexing to continue until hit again. Here is thread dump:
> http-bio-8080-exec-45 (201)
>     org.apache.lucene.analysis.standard.UAX29URLEmailTokenizerImpl.getNextToken​(UAX29URLEmailTokenizerImpl.java:4343)
>     org.apache.lucene.analysis.standard.UAX29URLEmailTokenizer.incrementToken​(UAX29URLEmailTokenizer.java:147)
>     org.apache.lucene.analysis.util.FilteringTokenFilter.incrementToken​(FilteringTokenFilter.java:82)
>     org.apache.lucene.analysis.core.LowerCaseFilter.incrementToken​(LowerCaseFilter.java:54)
>     org.apache.lucene.index.DocInverterPerField.processFields​(DocInverterPerField.java:174)
>     org.apache.lucene.index.DocFieldProcessor.processDocument​(DocFieldProcessor.java:248)
>     org.apache.lucene.index.DocumentsWriterPerThread.updateDocument​(DocumentsWriterPerThread.java:253)
>     org.apache.lucene.index.DocumentsWriter.updateDocument​(DocumentsWriter.java:453)
>     org.apache.lucene.index.IndexWriter.updateDocument​(IndexWriter.java:1517)
>     org.apache.solr.update.DirectUpdateHandler2.addDoc​(DirectUpdateHandler2.java:217)
>     org.apache.solr.update.processor.RunUpdateProcessor.processAdd​(RunUpdateProcessorFactory.java:69)
>     org.apache.solr.update.processor.UpdateRequestProcessor.processAdd​(UpdateRequestProcessor.java:51)
>     org.apache.solr.update.processor.DistributedUpdateProcessor.doLocalAdd​(DistributedUpdateProcessor.java:583)
>     org.apache.solr.update.processor.DistributedUpdateProcessor.versionAdd​(DistributedUpdateProcessor.java:719)
>     org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd​(DistributedUpdateProcessor.java:449)
>     org.apache.solr.handler.loader.JavabinLoader$1.update​(JavabinLoader.java:89)
>     org.apache.solr.client.solrj.request.JavaBinUpdateRequestCodec$1.readOuterMostDocIterator​(JavaBinUpdateRequestCodec.java:151)
>     org.apache.solr.client.solrj.request.JavaBinUpdateRequestCodec$1.readIterator​(JavaBinUpdateRequestCodec.java:131)
>     org.apache.solr.common.util.JavaBinCodec.readVal​(JavaBinCodec.java:221)
>     org.apache.solr.client.solrj.request.JavaBinUpdateRequestCodec$1.readNamedList​(JavaBinUpdateRequestCodec.java:116)
>     org.apache.solr.common.util.JavaBinCodec.readVal​(JavaBinCodec.java:186)
>     org.apache.solr.common.util.JavaBinCodec.unmarshal​(JavaBinCodec.java:112)
>     org.apache.solr.client.solrj.request.JavaBinUpdateRequestCodec.unmarshal​(JavaBinUpdateRequestCodec.java:158)
>     org.apache.solr.handler.loader.JavabinLoader.parseAndLoadDocs​(JavabinLoader.java:99)
>     org.apache.solr.handler.loader.JavabinLoader.load​(JavabinLoader.java:58)
>     org.apache.solr.handler.UpdateRequestHandler$1.load​(UpdateRequestHandler.java:92)
>     org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody​(ContentStreamHandlerBase.java:74)
>     org.apache.solr.handler.RequestHandlerBase.handleRequest​(RequestHandlerBase.java:135)
>     org.apache.solr.core.SolrCore.execute​(SolrCore.java:1859)
>     org.apache.solr.servlet.SolrDispatchFilter.execute​(SolrDispatchFilter.java:703)
>     org.apache.solr.servlet.SolrDispatchFilter.doFilter​(SolrDispatchFilter.java:406)
>     org.apache.solr.servlet.SolrDispatchFilter.doFilter​(SolrDispatchFilter.java:195)
>     org.apache.catalina.core.ApplicationFilterChain.internalDoFilter​(ApplicationFilterChain.java:243)
>     org.apache.catalina.core.ApplicationFilterChain.doFilter​(ApplicationFilterChain.java:210)
>     org.apache.catalina.core.StandardWrapperValve.invoke​(StandardWrapperValve.java:222)
>     org.apache.catalina.core.StandardContextValve.invoke​(StandardContextValve.java:123)
>     org.apache.catalina.core.StandardHostValve.invoke​(StandardHostValve.java:171)
>     org.apache.catalina.valves.ErrorReportValve.invoke​(ErrorReportValve.java:99)
>     org.apache.catalina.valves.AccessLogValve.invoke​(AccessLogValve.java:953)
>     org.apache.catalina.core.StandardEngineValve.invoke​(StandardEngineValve.java:118)
>     org.apache.catalina.connector.CoyoteAdapter.service​(CoyoteAdapter.java:408)
>     org.apache.coyote.http11.AbstractHttp11Processor.process​(AbstractHttp11Processor.java:1023)
>     org.apache.coyote.AbstractProtocol$AbstractConnectionHandler.process​(AbstractProtocol.java:589)
>     org.apache.tomcat.util.net.JIoEndpoint$SocketProcessor.run​(JIoEndpoint.java:312)
>     java.util.concurrent.ThreadPoolExecutor.runWorker​(Unknown Source)
>     java.util.concurrent.ThreadPoolExecutor$Worker.run​(Unknown Source)
>     java.lang.Thread.run​(Unknown Source)



--
This message was sent by Atlassian JIRA
(v6.2#6252)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org