You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Kirby Bohling (JIRA)" <ji...@apache.org> on 2011/07/25 18:34:09 UTC

[jira] [Created] (NUTCH-1068) Automaton performance improvements based on Lucene code base

Automaton performance improvements based on Lucene code base
------------------------------------------------------------

                 Key: NUTCH-1068
                 URL: https://issues.apache.org/jira/browse/NUTCH-1068
             Project: Nutch
          Issue Type: Improvement
            Reporter: Kirby Bohling


The Lucene team maintains a modified Automaton library cut down to precisely what they need.  It can have significant performance enhancements.

I am attempting to backport and shepherd a patch for the original Automaton library.

The original Lucene code is here:

http://svn.apache.org/viewvc/lucene/dev/trunk/lucene/src/java/org/apache/lucene/util/automaton/

The Lucene code is likely slightly faster, as it includes several micro optimizations I removed to avoid having to request re-license permission.  I would definitely performance test using the Lucene RegEx vs. the patched code.  The Lucene code also uses code points not characters, which might make a difference for UTF-16 vs. UTF-32 in obscure cases (I believe the Lucene code builds a UTF-32 clean DFA for accuracy, and then translates it to a UTF-8 DFA for performance but I'm not 100% sure.  I don't need/use any of that code, and currently really only worried about ASCII DFAs).

When making heavy use of the NFA-to-DFA transformation, I see a 4x speed up.  It likely has a 1.5-2x speed up for regular expression execution from what I can tell.  The Nutch backend uses this code in a couple of places, and it likely would lead to performance benefits for those areas.

I will attach my backported version for the Automaton 1.11-7 release.  While I don't own any of the copyright, all of the code is copyrighted under the BSD license, or the ASF 2.0 license.  It is pretty obviously approved for ASF usage.  I am not checking that the patch is usable as I'm not the copyright holder.  If that is an issue, I'll say "yes", I just don't believe I have any legal standing to do so.  I don't want to create licensing issues for the ASF.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (NUTCH-1068) Automaton performance improvements based on Lucene code base

Posted by "Kirby Bohling (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-1068?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Kirby Bohling updated NUTCH-1068:
---------------------------------

    Attachment: automaton.diff

I am not the copyright holder, so I don't believe I can grant a license.  This is all based upon code used or written by the Lucene project.  Thus I believe it is eligible for inclusion in the ASF projects.

> Automaton performance improvements based on Lucene code base
> ------------------------------------------------------------
>
>                 Key: NUTCH-1068
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1068
>             Project: Nutch
>          Issue Type: Improvement
>            Reporter: Kirby Bohling
>         Attachments: automaton.diff
>
>
> The Lucene team maintains a modified Automaton library cut down to precisely what they need.  It can have significant performance enhancements.
> I am attempting to backport and shepherd a patch for the original Automaton library.
> The original Lucene code is here:
> http://svn.apache.org/viewvc/lucene/dev/trunk/lucene/src/java/org/apache/lucene/util/automaton/
> The Lucene code is likely slightly faster, as it includes several micro optimizations I removed to avoid having to request re-license permission.  I would definitely performance test using the Lucene RegEx vs. the patched code.  The Lucene code also uses code points not characters, which might make a difference for UTF-16 vs. UTF-32 in obscure cases (I believe the Lucene code builds a UTF-32 clean DFA for accuracy, and then translates it to a UTF-8 DFA for performance but I'm not 100% sure.  I don't need/use any of that code, and currently really only worried about ASCII DFAs).
> When making heavy use of the NFA-to-DFA transformation, I see a 4x speed up.  It likely has a 1.5-2x speed up for regular expression execution from what I can tell.  The Nutch backend uses this code in a couple of places, and it likely would lead to performance benefits for those areas.
> I will attach my backported version for the Automaton 1.11-7 release.  While I don't own any of the copyright, all of the code is copyrighted under the BSD license, or the ASF 2.0 license.  It is pretty obviously approved for ASF usage.  I am not checking that the patch is usable as I'm not the copyright holder.  If that is an issue, I'll say "yes", I just don't believe I have any legal standing to do so.  I don't want to create licensing issues for the ASF.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (NUTCH-1068) Automaton performance improvements based on Lucene code base

Posted by "Lewis John McGibbney (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-1068?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13149214#comment-13149214 ] 

Lewis John McGibbney commented on NUTCH-1068:
---------------------------------------------

Hi Kirby. I understand that this was a while ago now but as no-one has commented I thought we may as well keep something moving after our conversation of dev lists. Can you explain how you propose to integrate this into Nutch code? I am unsure where to start as it is a github patch. It's also a huge patch. The performance stuff you mention sounds appealing but I really don't know enough just now, especially as I can't use this patch with trunk code. Thank you
                
> Automaton performance improvements based on Lucene code base
> ------------------------------------------------------------
>
>                 Key: NUTCH-1068
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1068
>             Project: Nutch
>          Issue Type: Improvement
>            Reporter: Kirby Bohling
>         Attachments: automaton.diff
>
>
> The Lucene team maintains a modified Automaton library cut down to precisely what they need.  It can have significant performance enhancements.
> I am attempting to backport and shepherd a patch for the original Automaton library.
> The original Lucene code is here:
> http://svn.apache.org/viewvc/lucene/dev/trunk/lucene/src/java/org/apache/lucene/util/automaton/
> The Lucene code is likely slightly faster, as it includes several micro optimizations I removed to avoid having to request re-license permission.  I would definitely performance test using the Lucene RegEx vs. the patched code.  The Lucene code also uses code points not characters, which might make a difference for UTF-16 vs. UTF-32 in obscure cases (I believe the Lucene code builds a UTF-32 clean DFA for accuracy, and then translates it to a UTF-8 DFA for performance but I'm not 100% sure.  I don't need/use any of that code, and currently really only worried about ASCII DFAs).
> When making heavy use of the NFA-to-DFA transformation, I see a 4x speed up.  It likely has a 1.5-2x speed up for regular expression execution from what I can tell.  The Nutch backend uses this code in a couple of places, and it likely would lead to performance benefits for those areas.
> I will attach my backported version for the Automaton 1.11-7 release.  While I don't own any of the copyright, all of the code is copyrighted under the BSD license, or the ASF 2.0 license.  It is pretty obviously approved for ASF usage.  I am not checking that the patch is usable as I'm not the copyright holder.  If that is an issue, I'll say "yes", I just don't believe I have any legal standing to do so.  I don't want to create licensing issues for the ASF.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira