You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Ferdy Galema (JIRA)" <ji...@apache.org> on 2012/08/13 10:47:37 UTC

[jira] [Created] (NUTCH-1448) Redirected urls should be handled more cleanly (more like an outlink url)

Ferdy Galema created NUTCH-1448:
-----------------------------------

             Summary: Redirected urls should be handled more cleanly (more like an outlink url)
                 Key: NUTCH-1448
                 URL: https://issues.apache.org/jira/browse/NUTCH-1448
             Project: Nutch
          Issue Type: Improvement
            Reporter: Ferdy Galema




--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (NUTCH-1448) Redirected urls should be handled more cleanly (more like an outlink url)

Posted by "Christian Johnsson (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-1448?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Christian Johnsson updated NUTCH-1448:
--------------------------------------

    Comment: was deleted

(was: Is this related to this problem im starting to get?
Figure there are some bad input in hbase but i cant find it :-(

2012-08-28 01:48:10,871 WARN org.apache.hadoop.mapred.Child: Error running child
java.lang.ArrayIndexOutOfBoundsException: 1
	at org.apache.nutch.util.TableUtil.unreverseUrl(TableUtil.java:98)
	at org.apache.nutch.parse.ParserJob$ParserMapper.map(ParserJob.java:102)
	at org.apache.nutch.parse.ParserJob$ParserMapper.map(ParserJob.java:76)
	at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
	at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:647)
	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:323)
	at org.apache.hadoop.mapred.Child$4.run(Child.java:266)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:396)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1278)
	at org.apache.hadoop.mapred.Child.main(Child.java:260)
2012-08-28 01:48:10,875 INFO org.apache.hadoop.mapred.Task: Runnning cleanup for the task)
    
> Redirected urls should be handled more cleanly (more like an outlink url)
> -------------------------------------------------------------------------
>
>                 Key: NUTCH-1448
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1448
>             Project: Nutch
>          Issue Type: Improvement
>            Reporter: Ferdy Galema
>             Fix For: 2.1
>
>
> This is specifically for Nutch2.x. Handling a redirects url like an outlink is much more cleaner because this makes it more simple to trace how new urls are added to the webpage database. Instant fetching of redirects won't work, but this is a small price to pay. (Note that this currently does not work at all, because the http.max.redirect property has no effect). Will be attaching a patch in the upcoming days.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Closed] (NUTCH-1448) Redirected urls should be handled more cleanly (more like an outlink url)

Posted by "Ferdy Galema (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-1448?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ferdy Galema closed NUTCH-1448.
-------------------------------

    Resolution: Fixed

Committed.
                
> Redirected urls should be handled more cleanly (more like an outlink url)
> -------------------------------------------------------------------------
>
>                 Key: NUTCH-1448
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1448
>             Project: Nutch
>          Issue Type: Improvement
>            Reporter: Ferdy Galema
>             Fix For: 2.1
>
>         Attachments: nutch-1448.txt
>
>
> This is specifically for Nutch2.x. Handling a redirects url like an outlink is much more cleaner because this makes it more simple to trace how new urls are added to the webpage database. Instant fetching of redirects won't work, but this is a small price to pay. (Note that this currently does not work at all, because the http.max.redirect property has no effect). Will be attaching a patch in the upcoming days.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Comment Edited] (NUTCH-1448) Redirected urls should be handled more cleanly (more like an outlink url)

Posted by "Christian Johnsson (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-1448?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13442827#comment-13442827 ] 

Christian Johnsson edited comment on NUTCH-1448 at 8/28/12 11:48 AM:
---------------------------------------------------------------------

Is this related to this problem im starting to get?
Figure there are some bad input in hbase but i cant find it :-(

2012-08-28 01:48:10,871 WARN org.apache.hadoop.mapred.Child: Error running child
java.lang.ArrayIndexOutOfBoundsException: 1
	at org.apache.nutch.util.TableUtil.unreverseUrl(TableUtil.java:98)
	at org.apache.nutch.parse.ParserJob$ParserMapper.map(ParserJob.java:102)
	at org.apache.nutch.parse.ParserJob$ParserMapper.map(ParserJob.java:76)
	at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
	at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:647)
	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:323)
	at org.apache.hadoop.mapred.Child$4.run(Child.java:266)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:396)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1278)
	at org.apache.hadoop.mapred.Child.main(Child.java:260)
2012-08-28 01:48:10,875 INFO org.apache.hadoop.mapred.Task: Runnning cleanup for the task
                
      was (Author: mr.johnsson):
    Is there a patch yet to fix this?
Im starting to get failures couse of this sucker :-)
2012-08-28 01:48:10,871 WARN org.apache.hadoop.mapred.Child: Error running child
java.lang.ArrayIndexOutOfBoundsException: 1
	at org.apache.nutch.util.TableUtil.unreverseUrl(TableUtil.java:98)
	at org.apache.nutch.parse.ParserJob$ParserMapper.map(ParserJob.java:102)
	at org.apache.nutch.parse.ParserJob$ParserMapper.map(ParserJob.java:76)
	at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
	at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:647)
	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:323)
	at org.apache.hadoop.mapred.Child$4.run(Child.java:266)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:396)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1278)
	at org.apache.hadoop.mapred.Child.main(Child.java:260)
2012-08-28 01:48:10,875 INFO org.apache.hadoop.mapred.Task: Runnning cleanup for the task
                  
> Redirected urls should be handled more cleanly (more like an outlink url)
> -------------------------------------------------------------------------
>
>                 Key: NUTCH-1448
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1448
>             Project: Nutch
>          Issue Type: Improvement
>            Reporter: Ferdy Galema
>             Fix For: 2.1
>
>
> This is specifically for Nutch2.x. Handling a redirects url like an outlink is much more cleaner because this makes it more simple to trace how new urls are added to the webpage database. Instant fetching of redirects won't work, but this is a small price to pay. (Note that this currently does not work at all, because the http.max.redirect property has no effect). Will be attaching a patch in the upcoming days.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-1448) Redirected urls should be handled more cleanly (more like an outlink url)

Posted by "Christian Johnsson (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-1448?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13442827#comment-13442827 ] 

Christian Johnsson commented on NUTCH-1448:
-------------------------------------------

Is there a patch yet to fix this?
Im starting to get failures couse of this sucker :-)
2012-08-28 01:48:10,871 WARN org.apache.hadoop.mapred.Child: Error running child
java.lang.ArrayIndexOutOfBoundsException: 1
	at org.apache.nutch.util.TableUtil.unreverseUrl(TableUtil.java:98)
	at org.apache.nutch.parse.ParserJob$ParserMapper.map(ParserJob.java:102)
	at org.apache.nutch.parse.ParserJob$ParserMapper.map(ParserJob.java:76)
	at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
	at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:647)
	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:323)
	at org.apache.hadoop.mapred.Child$4.run(Child.java:266)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:396)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1278)
	at org.apache.hadoop.mapred.Child.main(Child.java:260)
2012-08-28 01:48:10,875 INFO org.apache.hadoop.mapred.Task: Runnning cleanup for the task
                
> Redirected urls should be handled more cleanly (more like an outlink url)
> -------------------------------------------------------------------------
>
>                 Key: NUTCH-1448
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1448
>             Project: Nutch
>          Issue Type: Improvement
>            Reporter: Ferdy Galema
>             Fix For: 2.1
>
>
> This is specifically for Nutch2.x. Handling a redirects url like an outlink is much more cleaner because this makes it more simple to trace how new urls are added to the webpage database. Instant fetching of redirects won't work, but this is a small price to pay. (Note that this currently does not work at all, because the http.max.redirect property has no effect). Will be attaching a patch in the upcoming days.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (NUTCH-1448) Redirected urls should be handled more cleanly (more like an outlink url)

Posted by "Ferdy Galema (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-1448?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ferdy Galema updated NUTCH-1448:
--------------------------------

    Attachment: nutch-1448.txt

Thank you for you interest Christian. This issue should indeed prevent that problem. (Note it does not fix already present corrupt entries in the table, you should remove those by hand or solve them otherwise).

Here is the patch. I have been running this functionality for quite some time now. If anyone has suggestions let them know.
                
> Redirected urls should be handled more cleanly (more like an outlink url)
> -------------------------------------------------------------------------
>
>                 Key: NUTCH-1448
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1448
>             Project: Nutch
>          Issue Type: Improvement
>            Reporter: Ferdy Galema
>             Fix For: 2.1
>
>         Attachments: nutch-1448.txt
>
>
> This is specifically for Nutch2.x. Handling a redirects url like an outlink is much more cleaner because this makes it more simple to trace how new urls are added to the webpage database. Instant fetching of redirects won't work, but this is a small price to pay. (Note that this currently does not work at all, because the http.max.redirect property has no effect). Will be attaching a patch in the upcoming days.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-1448) Redirected urls should be handled more cleanly (more like an outlink url)

Posted by "Ferdy Galema (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-1448?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13446515#comment-13446515 ] 

Ferdy Galema commented on NUTCH-1448:
-------------------------------------

Yes it does show up as an outlink.

About your issue 1461: That is a duplicate indeed. It fixes the problem when it already is corrupt, but this issue is for preventing it from happening in the first place. I think 1461 should be closed in favor to this one. (But it is a nice workaround for present corrupt databases).
                
> Redirected urls should be handled more cleanly (more like an outlink url)
> -------------------------------------------------------------------------
>
>                 Key: NUTCH-1448
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1448
>             Project: Nutch
>          Issue Type: Improvement
>            Reporter: Ferdy Galema
>             Fix For: 2.1
>
>         Attachments: nutch-1448.txt
>
>
> This is specifically for Nutch2.x. Handling a redirects url like an outlink is much more cleaner because this makes it more simple to trace how new urls are added to the webpage database. Instant fetching of redirects won't work, but this is a small price to pay. (Note that this currently does not work at all, because the http.max.redirect property has no effect). Will be attaching a patch in the upcoming days.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (NUTCH-1448) Redirected urls should be handled more cleanly (more like an outlink url)

Posted by "Ferdy Galema (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-1448?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ferdy Galema updated NUTCH-1448:
--------------------------------

      Description: This is specifically for Nutch2.x. Handling a redirects url like an outlink is much more cleaner because this makes it more simple to trace how new urls are added to the webpage database. Instant fetching of redirects won't work, but this is a small price to pay. (Note that this currently does not work at all, because the http.max.redirect property has no effect). Will be attaching a patch in the upcoming days.
    Fix Version/s: 2.1
    
> Redirected urls should be handled more cleanly (more like an outlink url)
> -------------------------------------------------------------------------
>
>                 Key: NUTCH-1448
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1448
>             Project: Nutch
>          Issue Type: Improvement
>            Reporter: Ferdy Galema
>             Fix For: 2.1
>
>
> This is specifically for Nutch2.x. Handling a redirects url like an outlink is much more cleaner because this makes it more simple to trace how new urls are added to the webpage database. Instant fetching of redirects won't work, but this is a small price to pay. (Note that this currently does not work at all, because the http.max.redirect property has no effect). Will be attaching a patch in the upcoming days.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (NUTCH-1448) Redirected urls should be handled more cleanly (more like an outlink url)

Posted by "Christian Johnsson (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-1448?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13446380#comment-13446380 ] 

Christian Johnsson commented on NUTCH-1448:
-------------------------------------------

Will this affect the outlink and inlinks in the database too? So that a redirect will show up as a outlink and a inlink?

I did a quick and ugly fix for my problem. https://issues.apache.org/jira/browse/NUTCH-1461 :-)

                
> Redirected urls should be handled more cleanly (more like an outlink url)
> -------------------------------------------------------------------------
>
>                 Key: NUTCH-1448
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1448
>             Project: Nutch
>          Issue Type: Improvement
>            Reporter: Ferdy Galema
>             Fix For: 2.1
>
>         Attachments: nutch-1448.txt
>
>
> This is specifically for Nutch2.x. Handling a redirects url like an outlink is much more cleaner because this makes it more simple to trace how new urls are added to the webpage database. Instant fetching of redirects won't work, but this is a small price to pay. (Note that this currently does not work at all, because the http.max.redirect property has no effect). Will be attaching a patch in the upcoming days.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-1448) Redirected urls should be handled more cleanly (more like an outlink url)

Posted by "Christian Johnsson (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-1448?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13446570#comment-13446570 ] 

Christian Johnsson commented on NUTCH-1448:
-------------------------------------------

Thank you for the information.
Yes the 1461 is just quick and ugly fix so it doesn't crash. Good to have until it's properly fixed. Saves allot of time searching for corrupt stuff :-)
                
> Redirected urls should be handled more cleanly (more like an outlink url)
> -------------------------------------------------------------------------
>
>                 Key: NUTCH-1448
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1448
>             Project: Nutch
>          Issue Type: Improvement
>            Reporter: Ferdy Galema
>             Fix For: 2.1
>
>         Attachments: nutch-1448.txt
>
>
> This is specifically for Nutch2.x. Handling a redirects url like an outlink is much more cleaner because this makes it more simple to trace how new urls are added to the webpage database. Instant fetching of redirects won't work, but this is a small price to pay. (Note that this currently does not work at all, because the http.max.redirect property has no effect). Will be attaching a patch in the upcoming days.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-1448) Redirected urls should be handled more cleanly (more like an outlink url)

Posted by "Hudson (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-1448?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13446589#comment-13446589 ] 

Hudson commented on NUTCH-1448:
-------------------------------

Integrated in Nutch-nutchgora #334 (See [https://builds.apache.org/job/Nutch-nutchgora/334/])
    NUTCH-1448 Redirected urls should be handled more cleanly (more like an outlink url) (Revision 1379438)

     Result = FAILURE
ferdy : 
Files : 
* /nutch/branches/2.x/CHANGES.txt
* /nutch/branches/2.x/conf/nutch-default.xml
* /nutch/branches/2.x/src/java/org/apache/nutch/crawl/DbUpdateReducer.java
* /nutch/branches/2.x/src/java/org/apache/nutch/fetcher/FetcherReducer.java
* /nutch/branches/2.x/src/java/org/apache/nutch/parse/ParseUtil.java
* /nutch/branches/2.x/src/java/org/apache/nutch/parse/ParserJob.java

                
> Redirected urls should be handled more cleanly (more like an outlink url)
> -------------------------------------------------------------------------
>
>                 Key: NUTCH-1448
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1448
>             Project: Nutch
>          Issue Type: Improvement
>            Reporter: Ferdy Galema
>             Fix For: 2.1
>
>         Attachments: nutch-1448.txt
>
>
> This is specifically for Nutch2.x. Handling a redirects url like an outlink is much more cleaner because this makes it more simple to trace how new urls are added to the webpage database. Instant fetching of redirects won't work, but this is a small price to pay. (Note that this currently does not work at all, because the http.max.redirect property has no effect). Will be attaching a patch in the upcoming days.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira