You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "ASF GitHub Bot (JIRA)" <ji...@apache.org> on 2019/01/22 14:14:00 UTC

[jira] [Commented] (NUTCH-2689) Speed up urlfilter-regex and urlfilter-automaton

    [ https://issues.apache.org/jira/browse/NUTCH-2689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16748757#comment-16748757 ] 

ASF GitHub Bot commented on NUTCH-2689:
---------------------------------------

sebastian-nagel commented on pull request #432: NUTCH-2689 Speed up urlfilter-regex and urlfilter-automaton
URL: https://github.com/apache/nutch/pull/432
 
 
   - do not extract host and domain name from the URL if not needed
   - speed up regular expressions:
     - use non-capturing groups if possible
     - use (?i) to make the patterns case insensitive and remove uppercase variants to keep alternations shorter
 
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


> Speed up urlfilter-regex and urlfilter-automaton
> ------------------------------------------------
>
>                 Key: NUTCH-2689
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2689
>             Project: Nutch
>          Issue Type: Improvement
>          Components: plugin
>    Affects Versions: 1.15
>            Reporter: Sebastian Nagel
>            Priority: Minor
>             Fix For: 1.16
>
>
> The unit tests of urlfilter-regex and urlfilter-automaton include a benchmark. After playing and benchmarking modifications the following changes seem to significantly improve the performance:
> - do not extract host and domain name from the URL if not needed (no host/domain-specific rules used, cf. NUTCH-1838)
> - use non-capturing groups if possible
> - use {{(?i)}} to make the patterns case insensitive and remove uppercase variants 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)