You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Christian Johnsson (JIRA)" <ji...@apache.org> on 2012/08/28 02:49:11 UTC

[jira] [Comment Edited] (NUTCH-1448) Redirected urls should be handled more cleanly (more like an outlink url)

    [ https://issues.apache.org/jira/browse/NUTCH-1448?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13442827#comment-13442827 ] 

Christian Johnsson edited comment on NUTCH-1448 at 8/28/12 11:48 AM:
---------------------------------------------------------------------

Is this related to this problem im starting to get?
Figure there are some bad input in hbase but i cant find it :-(

2012-08-28 01:48:10,871 WARN org.apache.hadoop.mapred.Child: Error running child
java.lang.ArrayIndexOutOfBoundsException: 1
	at org.apache.nutch.util.TableUtil.unreverseUrl(TableUtil.java:98)
	at org.apache.nutch.parse.ParserJob$ParserMapper.map(ParserJob.java:102)
	at org.apache.nutch.parse.ParserJob$ParserMapper.map(ParserJob.java:76)
	at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
	at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:647)
	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:323)
	at org.apache.hadoop.mapred.Child$4.run(Child.java:266)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:396)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1278)
	at org.apache.hadoop.mapred.Child.main(Child.java:260)
2012-08-28 01:48:10,875 INFO org.apache.hadoop.mapred.Task: Runnning cleanup for the task
                
      was (Author: mr.johnsson):
    Is there a patch yet to fix this?
Im starting to get failures couse of this sucker :-)
2012-08-28 01:48:10,871 WARN org.apache.hadoop.mapred.Child: Error running child
java.lang.ArrayIndexOutOfBoundsException: 1
	at org.apache.nutch.util.TableUtil.unreverseUrl(TableUtil.java:98)
	at org.apache.nutch.parse.ParserJob$ParserMapper.map(ParserJob.java:102)
	at org.apache.nutch.parse.ParserJob$ParserMapper.map(ParserJob.java:76)
	at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
	at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:647)
	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:323)
	at org.apache.hadoop.mapred.Child$4.run(Child.java:266)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:396)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1278)
	at org.apache.hadoop.mapred.Child.main(Child.java:260)
2012-08-28 01:48:10,875 INFO org.apache.hadoop.mapred.Task: Runnning cleanup for the task
                  
> Redirected urls should be handled more cleanly (more like an outlink url)
> -------------------------------------------------------------------------
>
>                 Key: NUTCH-1448
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1448
>             Project: Nutch
>          Issue Type: Improvement
>            Reporter: Ferdy Galema
>             Fix For: 2.1
>
>
> This is specifically for Nutch2.x. Handling a redirects url like an outlink is much more cleaner because this makes it more simple to trace how new urls are added to the webpage database. Instant fetching of redirects won't work, but this is a small price to pay. (Note that this currently does not work at all, because the http.max.redirect property has no effect). Will be attaching a patch in the upcoming days.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira