You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Markus Jelsma (Created) (JIRA)" <ji...@apache.org> on 2012/03/27 14:58:25 UTC
[jira] [Created] (NUTCH-1320) IndexChecker and ParseChecker choke
on IDN's
IndexChecker and ParseChecker choke on IDN's
--------------------------------------------
Key: NUTCH-1320
URL: https://issues.apache.org/jira/browse/NUTCH-1320
Project: Nutch
Issue Type: Bug
Affects Versions: 1.4
Reporter: Markus Jelsma
Assignee: Markus Jelsma
Fix For: 1.5
These handy debug tools do not handle IDN's and throw an NPE
bin/nutch parsechecker http://例子.測試/%E9%A6%96%E9%A0%81
{code}
Exception in thread "main" java.lang.NullPointerException
at org.apache.nutch.indexer.IndexingFiltersChecker.run(IndexingFiltersChecker.java:71)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.nutch.indexer.IndexingFiltersChecker.main(IndexingFiltersChecker.java:116)
{code}
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1320) IndexChecker and ParseChecker choke
on IDN's
Posted by "Hudson (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/NUTCH-1320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13291683#comment-13291683 ]
Hudson commented on NUTCH-1320:
-------------------------------
Integrated in Nutch-trunk #1865 (See [https://builds.apache.org/job/Nutch-trunk/1865/])
NUTCH-1320 IndexChecker and ParseChecker choke on IDN's (Revision 1347755)
Result = SUCCESS
markus : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1347755
Files :
* /nutch/trunk/CHANGES.txt
* /nutch/trunk/src/java/org/apache/nutch/indexer/IndexingFiltersChecker.java
* /nutch/trunk/src/java/org/apache/nutch/parse/ParserChecker.java
* /nutch/trunk/src/java/org/apache/nutch/util/URLUtil.java
> IndexChecker and ParseChecker choke on IDN's
> --------------------------------------------
>
> Key: NUTCH-1320
> URL: https://issues.apache.org/jira/browse/NUTCH-1320
> Project: Nutch
> Issue Type: Bug
> Affects Versions: 1.4
> Reporter: Markus Jelsma
> Assignee: Markus Jelsma
> Fix For: 1.6
>
> Attachments: NUTCH-1320-1.5-1.patch
>
>
> These handy debug tools do not handle IDN's and throw an NPE
> bin/nutch parsechecker http://例子.測試/%E9%A6%96%E9%A0%81
> {code}
> Exception in thread "main" java.lang.NullPointerException
> at org.apache.nutch.indexer.IndexingFiltersChecker.run(IndexingFiltersChecker.java:71)
> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
> at org.apache.nutch.indexer.IndexingFiltersChecker.main(IndexingFiltersChecker.java:116)
> {code}
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1320) IndexChecker and ParseChecker choke
on IDN's
Posted by "Hudson (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/NUTCH-1320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13291219#comment-13291219 ]
Hudson commented on NUTCH-1320:
-------------------------------
Integrated in nutch-trunk-maven #299 (See [https://builds.apache.org/job/nutch-trunk-maven/299/])
NUTCH-1320 IndexChecker and ParseChecker choke on IDN's (Revision 1347755)
Result = SUCCESS
markus :
Files :
* /nutch/trunk/CHANGES.txt
* /nutch/trunk/src/java/org/apache/nutch/indexer/IndexingFiltersChecker.java
* /nutch/trunk/src/java/org/apache/nutch/parse/ParserChecker.java
* /nutch/trunk/src/java/org/apache/nutch/util/URLUtil.java
> IndexChecker and ParseChecker choke on IDN's
> --------------------------------------------
>
> Key: NUTCH-1320
> URL: https://issues.apache.org/jira/browse/NUTCH-1320
> Project: Nutch
> Issue Type: Bug
> Affects Versions: 1.4
> Reporter: Markus Jelsma
> Assignee: Markus Jelsma
> Fix For: 1.6
>
> Attachments: NUTCH-1320-1.5-1.patch
>
>
> These handy debug tools do not handle IDN's and throw an NPE
> bin/nutch parsechecker http://例子.測試/%E9%A6%96%E9%A0%81
> {code}
> Exception in thread "main" java.lang.NullPointerException
> at org.apache.nutch.indexer.IndexingFiltersChecker.run(IndexingFiltersChecker.java:71)
> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
> at org.apache.nutch.indexer.IndexingFiltersChecker.main(IndexingFiltersChecker.java:116)
> {code}
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1320) IndexChecker and ParseChecker choke
on IDN's
Posted by "Lewis John McGibbney (Commented) (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/NUTCH-1320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13241212#comment-13241212 ]
Lewis John McGibbney commented on NUTCH-1320:
---------------------------------------------
Nice Markus. +1. Is there scope for this to be applied elsewhere, or is parserchecker the only instance (so far) where you've encountered the problem?
> IndexChecker and ParseChecker choke on IDN's
> --------------------------------------------
>
> Key: NUTCH-1320
> URL: https://issues.apache.org/jira/browse/NUTCH-1320
> Project: Nutch
> Issue Type: Bug
> Affects Versions: 1.4
> Reporter: Markus Jelsma
> Assignee: Markus Jelsma
> Fix For: 1.5
>
> Attachments: NUTCH-1320-1.5-1.patch
>
>
> These handy debug tools do not handle IDN's and throw an NPE
> bin/nutch parsechecker http://例子.測試/%E9%A6%96%E9%A0%81
> {code}
> Exception in thread "main" java.lang.NullPointerException
> at org.apache.nutch.indexer.IndexingFiltersChecker.run(IndexingFiltersChecker.java:71)
> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
> at org.apache.nutch.indexer.IndexingFiltersChecker.main(IndexingFiltersChecker.java:116)
> {code}
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1320) IndexChecker and ParseChecker choke
on IDN's
Posted by "Markus Jelsma (Updated) (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/NUTCH-1320?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Markus Jelsma updated NUTCH-1320:
---------------------------------
Attachment: NUTCH-1320-1.5-1.patch
Patch for 1.5. URLUtil now has a toASCII and toUnicode method wrapping the java.net.IDN methods. These take an URL and return a normalized one.
> IndexChecker and ParseChecker choke on IDN's
> --------------------------------------------
>
> Key: NUTCH-1320
> URL: https://issues.apache.org/jira/browse/NUTCH-1320
> Project: Nutch
> Issue Type: Bug
> Affects Versions: 1.4
> Reporter: Markus Jelsma
> Assignee: Markus Jelsma
> Fix For: 1.5
>
> Attachments: NUTCH-1320-1.5-1.patch
>
>
> These handy debug tools do not handle IDN's and throw an NPE
> bin/nutch parsechecker http://例子.測試/%E9%A6%96%E9%A0%81
> {code}
> Exception in thread "main" java.lang.NullPointerException
> at org.apache.nutch.indexer.IndexingFiltersChecker.run(IndexingFiltersChecker.java:71)
> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
> at org.apache.nutch.indexer.IndexingFiltersChecker.main(IndexingFiltersChecker.java:116)
> {code}
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1320) IndexChecker and ParseChecker choke
on IDN's
Posted by "Markus Jelsma (Updated) (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/NUTCH-1320?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Markus Jelsma updated NUTCH-1320:
---------------------------------
Fix Version/s: (was: 1.5)
1.6
20120304-push-1.6
> IndexChecker and ParseChecker choke on IDN's
> --------------------------------------------
>
> Key: NUTCH-1320
> URL: https://issues.apache.org/jira/browse/NUTCH-1320
> Project: Nutch
> Issue Type: Bug
> Affects Versions: 1.4
> Reporter: Markus Jelsma
> Assignee: Markus Jelsma
> Fix For: 1.6
>
> Attachments: NUTCH-1320-1.5-1.patch
>
>
> These handy debug tools do not handle IDN's and throw an NPE
> bin/nutch parsechecker http://例子.測試/%E9%A6%96%E9%A0%81
> {code}
> Exception in thread "main" java.lang.NullPointerException
> at org.apache.nutch.indexer.IndexingFiltersChecker.run(IndexingFiltersChecker.java:71)
> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
> at org.apache.nutch.indexer.IndexingFiltersChecker.main(IndexingFiltersChecker.java:116)
> {code}
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1320) IndexChecker and ParseChecker choke
on IDN's
Posted by "Markus Jelsma (Commented) (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/NUTCH-1320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13241225#comment-13241225 ]
Markus Jelsma commented on NUTCH-1320:
--------------------------------------
Somewhere down the line IDN's enter the CrawlDB in ASCII so there is no problem there but these tools lack conversion. The filter and normalizer checker tools would also benefit. This also suggests the need of an IDNNormalizer that does toUnicode when indexing, you don't want http://xn--*/ URL's in your index.
> IndexChecker and ParseChecker choke on IDN's
> --------------------------------------------
>
> Key: NUTCH-1320
> URL: https://issues.apache.org/jira/browse/NUTCH-1320
> Project: Nutch
> Issue Type: Bug
> Affects Versions: 1.4
> Reporter: Markus Jelsma
> Assignee: Markus Jelsma
> Fix For: 1.5
>
> Attachments: NUTCH-1320-1.5-1.patch
>
>
> These handy debug tools do not handle IDN's and throw an NPE
> bin/nutch parsechecker http://例子.測試/%E9%A6%96%E9%A0%81
> {code}
> Exception in thread "main" java.lang.NullPointerException
> at org.apache.nutch.indexer.IndexingFiltersChecker.run(IndexingFiltersChecker.java:71)
> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
> at org.apache.nutch.indexer.IndexingFiltersChecker.main(IndexingFiltersChecker.java:116)
> {code}
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Resolved] (NUTCH-1320) IndexChecker and ParseChecker choke
on IDN's
Posted by "Markus Jelsma (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/NUTCH-1320?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Markus Jelsma resolved NUTCH-1320.
----------------------------------
Resolution: Fixed
Committed for 1.6 in rev. 1347755.
Thanks Lewis
> IndexChecker and ParseChecker choke on IDN's
> --------------------------------------------
>
> Key: NUTCH-1320
> URL: https://issues.apache.org/jira/browse/NUTCH-1320
> Project: Nutch
> Issue Type: Bug
> Affects Versions: 1.4
> Reporter: Markus Jelsma
> Assignee: Markus Jelsma
> Fix For: 1.6
>
> Attachments: NUTCH-1320-1.5-1.patch
>
>
> These handy debug tools do not handle IDN's and throw an NPE
> bin/nutch parsechecker http://例子.測試/%E9%A6%96%E9%A0%81
> {code}
> Exception in thread "main" java.lang.NullPointerException
> at org.apache.nutch.indexer.IndexingFiltersChecker.run(IndexingFiltersChecker.java:71)
> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
> at org.apache.nutch.indexer.IndexingFiltersChecker.main(IndexingFiltersChecker.java:116)
> {code}
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira