You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Tom Fotherby (JIRA)" <ji...@apache.org> on 2016/07/14 10:50:20 UTC

[jira] [Commented] (LUCENE-7256) PatternReplaceCharFilter can make Lucene hang

    [ https://issues.apache.org/jira/browse/LUCENE-7256?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15376756#comment-15376756 ] 

Tom Fotherby commented on LUCENE-7256:
--------------------------------------

> We can't protect against the user being stupid.

Ooooh Burn!  Thanks for the link (http://mail.openjdk.java.net/pipermail/core-libs-dev/2016-March/039269.html) - that's helpful. I'll try to re-write the Regexp to be less "stupid", lol.


> PatternReplaceCharFilter can make Lucene hang
> ---------------------------------------------
>
>                 Key: LUCENE-7256
>                 URL: https://issues.apache.org/jira/browse/LUCENE-7256
>             Project: Lucene - Core
>          Issue Type: Bug
>          Components: modules/analysis
>    Affects Versions: 5.4.1
>         Environment: alpine linux v3.3
>            Reporter: Tom Fotherby
>            Priority: Minor
>
> I'm using ElasticSearch (v2.2.0 , Lucene v5.4.1) and it's [Pattern Replace Char Filter|https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-pattern-replace-charfilter.html] (Lucenes PatternReplaceCharFilter) . I need to filter out urls from my query text before it is tokenised. But I found that some input strings cause ElasticSearch to "hang" (slowly eating more CPU and memory) until the system crashes.
> ----
> *Example*
> {code}
> // Character filters are used to "tidy up" a string *before* it is tokenized.
> 'char_filter' => [
>     'url_removal_pattern' => [
>         'type'        => 'pattern_replace',
>         'pattern'     => '(?mi)\b((?:[a-z][\w-]+:(?:\/{1,3}|[a-z0-9%])|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}\/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:\'".,<>?«»""'']))',
>         'replacement' => '',
>     ],
> {code}
> This filter was working fine for some weeks until suddenly ElasticSearch started crashing. We found someone was trying to do a javascript injection attack in our search box.
> I pasted the regex and the attack string into https://regex101.com 
> * Regexp: 
>  * {code}(?mi)\b((?:[a-z][\w-]+:(?:\/{1,3}|[a-z0-9%])|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}\/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s!()\[\]{};:\'".,<>?«»""''])){code}
> * Test string: 
>  * {code}twitter.com/widgets.js\";fjs.parentNode.insertBefore(js,fjs);}}(document,\"script\",\"twitter-wjs\"{code}
> https://regex101.com shows the problem to be "Catastrophic backtracking"
> bq. Catastrophic backtracking has been detected and the execution of your expression has been halted. To find out more what this is, please read the following article: [Runaway Regular Expressions|http://www.regular-expressions.info/catastrophic.html].
> It would be great if Lucene could detect "Catastrophic backtracking" and throw a error or return null.
> ----
> As an aside, I created a unit test for our PHP application that uses the same regexp and test string. (PHP can understand the same regexp, even though it's obviously for Java in the ElasticSearch case) . Interestingly in php, the regex results in `null` which is the documented response of [preg_replace|http://php.net/manual/en/function.preg-replace.php] when a error occurs. If PHP can return a error rather than crashing - surely Lucene / Java can too :trollface: ?
> {code}
> namespace app\tests\unit;
> use \yii\codeception\TestCase;
> class TagsControllerTest extends TestCase
> {
>     public function testRegexForURLDetection()
>     {
>         $regex = '(?mi)\b((?:[a-z][\w-]+:(?:\/{1,3}|[a-z0-9%])|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}\/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:\'".,<>?«»""'']))';
>         // Test the Catastrophic backtracking problem
>         $testString = "twitter.com/widgets.js\";fjs.parentNode.insertBefore(js,fjs);}}(document,\"script\",\"twitter-wjs\"";
>         // This shows the regex is not working for our test string - it gives null but should give 'hello '
>         $this->assertEquals(null, preg_replace("/$regex/", '', "hello $testString"));
>     }
> }
> {code}
> ----
> (I originally [opened a ticket|https://github.com/elastic/elasticsearch/issues/17934] to the ElasticSearch project but got told opening it here would be more appropriate - sorry if I'm wrong)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org