You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Markus Jelsma (Created) (JIRA)" <ji...@apache.org> on 2012/03/27 14:58:25 UTC

[jira] [Created] (NUTCH-1320) IndexChecker and ParseChecker choke on IDN's

IndexChecker and ParseChecker choke on IDN's
--------------------------------------------

                 Key: NUTCH-1320
                 URL: https://issues.apache.org/jira/browse/NUTCH-1320
             Project: Nutch
          Issue Type: Bug
    Affects Versions: 1.4
            Reporter: Markus Jelsma
            Assignee: Markus Jelsma
             Fix For: 1.5


These handy debug tools do not handle IDN's and throw an NPE

bin/nutch parsechecker http://例子.測試/%E9%A6%96%E9%A0%81

{code}
Exception in thread "main" java.lang.NullPointerException
        at org.apache.nutch.indexer.IndexingFiltersChecker.run(IndexingFiltersChecker.java:71)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
        at org.apache.nutch.indexer.IndexingFiltersChecker.main(IndexingFiltersChecker.java:116)
{code}


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

[jira] [Commented] (NUTCH-1320) IndexChecker and ParseChecker choke on IDN's

Posted by "Hudson (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-1320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13291683#comment-13291683 ] 

Hudson commented on NUTCH-1320:
-------------------------------

Integrated in Nutch-trunk #1865 (See [https://builds.apache.org/job/Nutch-trunk/1865/])
    NUTCH-1320 IndexChecker and ParseChecker choke on IDN's (Revision 1347755)

     Result = SUCCESS
markus : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1347755
Files : 
* /nutch/trunk/CHANGES.txt
* /nutch/trunk/src/java/org/apache/nutch/indexer/IndexingFiltersChecker.java
* /nutch/trunk/src/java/org/apache/nutch/parse/ParserChecker.java
* /nutch/trunk/src/java/org/apache/nutch/util/URLUtil.java

                
> IndexChecker and ParseChecker choke on IDN's
> --------------------------------------------
>
>                 Key: NUTCH-1320
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1320
>             Project: Nutch
>          Issue Type: Bug
>    Affects Versions: 1.4
>            Reporter: Markus Jelsma
>            Assignee: Markus Jelsma
>             Fix For: 1.6
>
>         Attachments: NUTCH-1320-1.5-1.patch
>
>
> These handy debug tools do not handle IDN's and throw an NPE
> bin/nutch parsechecker http://例子.測試/%E9%A6%96%E9%A0%81
> {code}
> Exception in thread "main" java.lang.NullPointerException
>         at org.apache.nutch.indexer.IndexingFiltersChecker.run(IndexingFiltersChecker.java:71)
>         at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>         at org.apache.nutch.indexer.IndexingFiltersChecker.main(IndexingFiltersChecker.java:116)
> {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

[jira] [Commented] (NUTCH-1320) IndexChecker and ParseChecker choke on IDN's

Posted by "Hudson (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-1320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13291219#comment-13291219 ] 

Hudson commented on NUTCH-1320:
-------------------------------

Integrated in nutch-trunk-maven #299 (See [https://builds.apache.org/job/nutch-trunk-maven/299/])
    NUTCH-1320 IndexChecker and ParseChecker choke on IDN's (Revision 1347755)

     Result = SUCCESS
markus : 
Files : 
* /nutch/trunk/CHANGES.txt
* /nutch/trunk/src/java/org/apache/nutch/indexer/IndexingFiltersChecker.java
* /nutch/trunk/src/java/org/apache/nutch/parse/ParserChecker.java
* /nutch/trunk/src/java/org/apache/nutch/util/URLUtil.java

                
> IndexChecker and ParseChecker choke on IDN's
> --------------------------------------------
>
>                 Key: NUTCH-1320
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1320
>             Project: Nutch
>          Issue Type: Bug
>    Affects Versions: 1.4
>            Reporter: Markus Jelsma
>            Assignee: Markus Jelsma
>             Fix For: 1.6
>
>         Attachments: NUTCH-1320-1.5-1.patch
>
>
> These handy debug tools do not handle IDN's and throw an NPE
> bin/nutch parsechecker http://例子.測試/%E9%A6%96%E9%A0%81
> {code}
> Exception in thread "main" java.lang.NullPointerException
>         at org.apache.nutch.indexer.IndexingFiltersChecker.run(IndexingFiltersChecker.java:71)
>         at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>         at org.apache.nutch.indexer.IndexingFiltersChecker.main(IndexingFiltersChecker.java:116)
> {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

[jira] [Commented] (NUTCH-1320) IndexChecker and ParseChecker choke on IDN's

Posted by "Lewis John McGibbney (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-1320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13241212#comment-13241212 ] 

Lewis John McGibbney commented on NUTCH-1320:
---------------------------------------------

Nice Markus. +1. Is there scope for this to be applied elsewhere, or is parserchecker the only instance (so far) where you've encountered the problem?
                
> IndexChecker and ParseChecker choke on IDN's
> --------------------------------------------
>
>                 Key: NUTCH-1320
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1320
>             Project: Nutch
>          Issue Type: Bug
>    Affects Versions: 1.4
>            Reporter: Markus Jelsma
>            Assignee: Markus Jelsma
>             Fix For: 1.5
>
>         Attachments: NUTCH-1320-1.5-1.patch
>
>
> These handy debug tools do not handle IDN's and throw an NPE
> bin/nutch parsechecker http://例子.測試/%E9%A6%96%E9%A0%81
> {code}
> Exception in thread "main" java.lang.NullPointerException
>         at org.apache.nutch.indexer.IndexingFiltersChecker.run(IndexingFiltersChecker.java:71)
>         at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>         at org.apache.nutch.indexer.IndexingFiltersChecker.main(IndexingFiltersChecker.java:116)
> {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

[jira] [Updated] (NUTCH-1320) IndexChecker and ParseChecker choke on IDN's

Posted by "Markus Jelsma (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-1320?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Markus Jelsma updated NUTCH-1320:
---------------------------------

    Attachment: NUTCH-1320-1.5-1.patch

Patch for 1.5. URLUtil now has a toASCII and toUnicode method wrapping the java.net.IDN methods. These take an URL and return a normalized one.
                
> IndexChecker and ParseChecker choke on IDN's
> --------------------------------------------
>
>                 Key: NUTCH-1320
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1320
>             Project: Nutch
>          Issue Type: Bug
>    Affects Versions: 1.4
>            Reporter: Markus Jelsma
>            Assignee: Markus Jelsma
>             Fix For: 1.5
>
>         Attachments: NUTCH-1320-1.5-1.patch
>
>
> These handy debug tools do not handle IDN's and throw an NPE
> bin/nutch parsechecker http://例子.測試/%E9%A6%96%E9%A0%81
> {code}
> Exception in thread "main" java.lang.NullPointerException
>         at org.apache.nutch.indexer.IndexingFiltersChecker.run(IndexingFiltersChecker.java:71)
>         at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>         at org.apache.nutch.indexer.IndexingFiltersChecker.main(IndexingFiltersChecker.java:116)
> {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

[jira] [Updated] (NUTCH-1320) IndexChecker and ParseChecker choke on IDN's

Posted by "Markus Jelsma (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-1320?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Markus Jelsma updated NUTCH-1320:
---------------------------------

    Fix Version/s:     (was: 1.5)
                   1.6

20120304-push-1.6
                
> IndexChecker and ParseChecker choke on IDN's
> --------------------------------------------
>
>                 Key: NUTCH-1320
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1320
>             Project: Nutch
>          Issue Type: Bug
>    Affects Versions: 1.4
>            Reporter: Markus Jelsma
>            Assignee: Markus Jelsma
>             Fix For: 1.6
>
>         Attachments: NUTCH-1320-1.5-1.patch
>
>
> These handy debug tools do not handle IDN's and throw an NPE
> bin/nutch parsechecker http://例子.測試/%E9%A6%96%E9%A0%81
> {code}
> Exception in thread "main" java.lang.NullPointerException
>         at org.apache.nutch.indexer.IndexingFiltersChecker.run(IndexingFiltersChecker.java:71)
>         at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>         at org.apache.nutch.indexer.IndexingFiltersChecker.main(IndexingFiltersChecker.java:116)
> {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

[jira] [Commented] (NUTCH-1320) IndexChecker and ParseChecker choke on IDN's

Posted by "Markus Jelsma (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-1320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13241225#comment-13241225 ] 

Markus Jelsma commented on NUTCH-1320:
--------------------------------------

Somewhere down the line IDN's enter the CrawlDB in ASCII so there is no problem there but these tools lack conversion. The filter and normalizer checker tools would also benefit. This also suggests the need of an IDNNormalizer that does toUnicode when indexing, you don't want http://xn--*/ URL's in your index.
                
> IndexChecker and ParseChecker choke on IDN's
> --------------------------------------------
>
>                 Key: NUTCH-1320
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1320
>             Project: Nutch
>          Issue Type: Bug
>    Affects Versions: 1.4
>            Reporter: Markus Jelsma
>            Assignee: Markus Jelsma
>             Fix For: 1.5
>
>         Attachments: NUTCH-1320-1.5-1.patch
>
>
> These handy debug tools do not handle IDN's and throw an NPE
> bin/nutch parsechecker http://例子.測試/%E9%A6%96%E9%A0%81
> {code}
> Exception in thread "main" java.lang.NullPointerException
>         at org.apache.nutch.indexer.IndexingFiltersChecker.run(IndexingFiltersChecker.java:71)
>         at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>         at org.apache.nutch.indexer.IndexingFiltersChecker.main(IndexingFiltersChecker.java:116)
> {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

[jira] [Resolved] (NUTCH-1320) IndexChecker and ParseChecker choke on IDN's

Posted by "Markus Jelsma (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-1320?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Markus Jelsma resolved NUTCH-1320.
----------------------------------

    Resolution: Fixed

Committed for 1.6 in rev. 1347755.
Thanks Lewis
                
> IndexChecker and ParseChecker choke on IDN's
> --------------------------------------------
>
>                 Key: NUTCH-1320
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1320
>             Project: Nutch
>          Issue Type: Bug
>    Affects Versions: 1.4
>            Reporter: Markus Jelsma
>            Assignee: Markus Jelsma
>             Fix For: 1.6
>
>         Attachments: NUTCH-1320-1.5-1.patch
>
>
> These handy debug tools do not handle IDN's and throw an NPE
> bin/nutch parsechecker http://例子.測試/%E9%A6%96%E9%A0%81
> {code}
> Exception in thread "main" java.lang.NullPointerException
>         at org.apache.nutch.indexer.IndexingFiltersChecker.run(IndexingFiltersChecker.java:71)
>         at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>         at org.apache.nutch.indexer.IndexingFiltersChecker.main(IndexingFiltersChecker.java:116)
> {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira