You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Steve Rowe (JIRA)" <ji...@apache.org> on 2014/01/14 08:40:55 UTC

[jira] [Commented] (SOLR-5440) UAX29URLEmailTokenizer thread hangs on getNextToken - causes cloud to stop accepting updates

    [ https://issues.apache.org/jira/browse/SOLR-5440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13870482#comment-13870482 ] 

Steve Rowe commented on SOLR-5440:
----------------------------------

[~bokkie] privately sent me a document that triggers this problem.  The document consists of an HTML snippet containing a {{<script>}} block, which contains a 3-megabyte-long URL-encoded string in single-quotes, given as a parameter to a javascript function defined elsewhere. (The purpose of the javascript function is to URL-decode the string.)

When I run this text through {{UAX29URLEmailTokenizer}}, it doesn't actually hang - it just tokenizes extremely slowly, consuming less than 100 characters per second on my laptop.  I didn't wait long enough to find out, but I estimate the average scan rate over the entire text is on the order of 200 characters per second, so it would probably take about 4 hours to finish.  (I also sent the same text through {{StandardTokenizer}}, which fortunately does not exhibit the slow tokenization behavior.)  To convince myself that this is not an endless loop of some kind, I ran shorter runs (hundreds of chars) of URL-encoded text through {{UAX29URLEmailTokenizer}}, and they successfully finished.

I guessed that the problem was with email addresses, so I commented out that part of the {{UAX29URLEmailTokenizer}} specification, and that caused the text to be scanned at the same speed as {{StandardTokenizer}}.

The email rule in {{UAX29URLEmailTokenizer}} is basically the sequence {{<local-part>, "@", <domain>}}. What's happening is that the entire 3-MB-long URL-encoded string matches {{<local-part>}} (the stuff before the "@" in an email address), so for each "%XX" URL-encoded byte, the scanner scans through most of the remaining text looking for a "@" character, then gives up when it reaches the end of the URL-encoded string without finding one, and finally falls back to tokenizing "XX" as {{<ALPHANUM>}}.  The scanner then starts again trying to match an email address over the remainder of the URL-encoded string, and so on.  So it's not much of a surprise that this is slow.

[RFC5321|http://tools.ietf.org/search/rfc5321] says:

{noformat}
4.5.3.1.  Size Limits and Minimums

   There are several objects that have required minimum/maximum sizes.
   Every implementation MUST be able to receive objects of at least
   these sizes.  Objects larger than these sizes SHOULD be avoided when
   possible.  However, some Internet mail constructs such as encoded
   X.400 addresses (RFC 2156 [35]) will often require larger objects.
   Clients MAY attempt to transmit these, but MUST be prepared for a
   server to reject them if they cannot be handled by it.  To the
   maximum extent possible, implementation techniques that impose no
   limits on the length of these objects should be used.

   Extensions to SMTP may involve the use of characters that occupy more
   than a single octet each.  This section therefore specifies lengths
   in octets where absolute lengths, rather than character counts, are
   intended.

4.5.3.1.1.  Local-part

   The maximum total length of a user name or other local-part is 64
   octets.
{noformat}

So local-parts of email addresses that are going to work everywhere are effectively limited to 64 bytes.  ([Section 3 of RFC3696|http://tools.ietf.org/html/rfc3696#section-3] says the same thing.)

One possible solution to this problem is to limit the allowable length of the local-part.  Currently the rule looks like:

{noformat}
EMAILquotedString = [\"] ([\u0001-\u0008\u000B\u000C\u000E-\u0021\u0023-\u005B\u005D-\u007E] | [\\] [\u0000-\u007F])* [\"]
EMAILatomText = [A-Za-z0-9!#$%&'*+-/=?\^_`{|}~]
EMAILlabel = {EMAILatomText}+ | {EMAILquotedString}
EMAILlocalPart = {EMAILlabel} ("." {EMAILlabel})*
{noformat}

When I try to limit {{EMAILlabel}} as follows, JFlex takes forever (minutes) trying to generate the scanner, but then eventually OOMs, even with env. var. {{ANT_OPT=-Xmx2g}} (I didn't try larger):

{noformat}
EMAILlabel = {EMAILatomText}{1,64} | {EMAILquotedString}
{noformat}

(Note that {{EMAILquotedString}} has the same unlimited length problem - really long quoted ASCII strings could result in the same extremely slow tokenization behavior.)

I think a solution could include a rule matching a fixed-length longer-than-maximum local-part, the action for which sets a lexical state where email addresses aren't allowed, and then pushes back the matched text onto the input stream.  I haven't figured out exactly how to do this yet, though.

I'd welcome other ideas :)


> UAX29URLEmailTokenizer thread hangs on getNextToken - causes cloud to stop accepting updates
> --------------------------------------------------------------------------------------------
>
>                 Key: SOLR-5440
>                 URL: https://issues.apache.org/jira/browse/SOLR-5440
>             Project: Solr
>          Issue Type: Bug
>    Affects Versions: 4.5
>            Reporter: Chris Geeringh
>
> This is a pretty nasty bug, and causes the cluster to stop accepting updates. I'm not sure how to consistently reproduce it but I have done so numerous times. Switching to a whitespace tokenizer improved indexing speed, and I never got the issue again.
> I'm running a 4.6 Snapshot - I had issues with deadlocks with numerous versions of Solr, and have finally narrowed down the problem to this code, which affects many/all(?) versions of Solr.
> When the thread hits this issue it uses 100% CPU, restarting the node which has the error allows indexing to continue until hit again. Here is thread dump:
> http-bio-8080-exec-45 (201)
>     org.apache.lucene.analysis.standard.UAX29URLEmailTokenizerImpl.getNextToken​(UAX29URLEmailTokenizerImpl.java:4343)
>     org.apache.lucene.analysis.standard.UAX29URLEmailTokenizer.incrementToken​(UAX29URLEmailTokenizer.java:147)
>     org.apache.lucene.analysis.util.FilteringTokenFilter.incrementToken​(FilteringTokenFilter.java:82)
>     org.apache.lucene.analysis.core.LowerCaseFilter.incrementToken​(LowerCaseFilter.java:54)
>     org.apache.lucene.index.DocInverterPerField.processFields​(DocInverterPerField.java:174)
>     org.apache.lucene.index.DocFieldProcessor.processDocument​(DocFieldProcessor.java:248)
>     org.apache.lucene.index.DocumentsWriterPerThread.updateDocument​(DocumentsWriterPerThread.java:253)
>     org.apache.lucene.index.DocumentsWriter.updateDocument​(DocumentsWriter.java:453)
>     org.apache.lucene.index.IndexWriter.updateDocument​(IndexWriter.java:1517)
>     org.apache.solr.update.DirectUpdateHandler2.addDoc​(DirectUpdateHandler2.java:217)
>     org.apache.solr.update.processor.RunUpdateProcessor.processAdd​(RunUpdateProcessorFactory.java:69)
>     org.apache.solr.update.processor.UpdateRequestProcessor.processAdd​(UpdateRequestProcessor.java:51)
>     org.apache.solr.update.processor.DistributedUpdateProcessor.doLocalAdd​(DistributedUpdateProcessor.java:583)
>     org.apache.solr.update.processor.DistributedUpdateProcessor.versionAdd​(DistributedUpdateProcessor.java:719)
>     org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd​(DistributedUpdateProcessor.java:449)
>     org.apache.solr.handler.loader.JavabinLoader$1.update​(JavabinLoader.java:89)
>     org.apache.solr.client.solrj.request.JavaBinUpdateRequestCodec$1.readOuterMostDocIterator​(JavaBinUpdateRequestCodec.java:151)
>     org.apache.solr.client.solrj.request.JavaBinUpdateRequestCodec$1.readIterator​(JavaBinUpdateRequestCodec.java:131)
>     org.apache.solr.common.util.JavaBinCodec.readVal​(JavaBinCodec.java:221)
>     org.apache.solr.client.solrj.request.JavaBinUpdateRequestCodec$1.readNamedList​(JavaBinUpdateRequestCodec.java:116)
>     org.apache.solr.common.util.JavaBinCodec.readVal​(JavaBinCodec.java:186)
>     org.apache.solr.common.util.JavaBinCodec.unmarshal​(JavaBinCodec.java:112)
>     org.apache.solr.client.solrj.request.JavaBinUpdateRequestCodec.unmarshal​(JavaBinUpdateRequestCodec.java:158)
>     org.apache.solr.handler.loader.JavabinLoader.parseAndLoadDocs​(JavabinLoader.java:99)
>     org.apache.solr.handler.loader.JavabinLoader.load​(JavabinLoader.java:58)
>     org.apache.solr.handler.UpdateRequestHandler$1.load​(UpdateRequestHandler.java:92)
>     org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody​(ContentStreamHandlerBase.java:74)
>     org.apache.solr.handler.RequestHandlerBase.handleRequest​(RequestHandlerBase.java:135)
>     org.apache.solr.core.SolrCore.execute​(SolrCore.java:1859)
>     org.apache.solr.servlet.SolrDispatchFilter.execute​(SolrDispatchFilter.java:703)
>     org.apache.solr.servlet.SolrDispatchFilter.doFilter​(SolrDispatchFilter.java:406)
>     org.apache.solr.servlet.SolrDispatchFilter.doFilter​(SolrDispatchFilter.java:195)
>     org.apache.catalina.core.ApplicationFilterChain.internalDoFilter​(ApplicationFilterChain.java:243)
>     org.apache.catalina.core.ApplicationFilterChain.doFilter​(ApplicationFilterChain.java:210)
>     org.apache.catalina.core.StandardWrapperValve.invoke​(StandardWrapperValve.java:222)
>     org.apache.catalina.core.StandardContextValve.invoke​(StandardContextValve.java:123)
>     org.apache.catalina.core.StandardHostValve.invoke​(StandardHostValve.java:171)
>     org.apache.catalina.valves.ErrorReportValve.invoke​(ErrorReportValve.java:99)
>     org.apache.catalina.valves.AccessLogValve.invoke​(AccessLogValve.java:953)
>     org.apache.catalina.core.StandardEngineValve.invoke​(StandardEngineValve.java:118)
>     org.apache.catalina.connector.CoyoteAdapter.service​(CoyoteAdapter.java:408)
>     org.apache.coyote.http11.AbstractHttp11Processor.process​(AbstractHttp11Processor.java:1023)
>     org.apache.coyote.AbstractProtocol$AbstractConnectionHandler.process​(AbstractProtocol.java:589)
>     org.apache.tomcat.util.net.JIoEndpoint$SocketProcessor.run​(JIoEndpoint.java:312)
>     java.util.concurrent.ThreadPoolExecutor.runWorker​(Unknown Source)
>     java.util.concurrent.ThreadPoolExecutor$Worker.run​(Unknown Source)
>     java.lang.Thread.run​(Unknown Source)



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org