You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Ferdy Galema (Created) (JIRA)" <ji...@apache.org> on 2012/03/16 17:41:40 UTC

[jira] [Created] (NUTCH-1314) Impose a limit on the length of outlink target urls

Impose a limit on the length of outlink target urls
---------------------------------------------------

                 Key: NUTCH-1314
                 URL: https://issues.apache.org/jira/browse/NUTCH-1314
             Project: Nutch
          Issue Type: Improvement
            Reporter: Ferdy Galema
         Attachments: NUTCH-1314.patch

In the past we have encountered situations where crawling specific broken sites resulted in ridiciously long urls that caused the stalling of tasks. The regex plugins (normalizing/filtering) processed single urls for hours, if not indefinitely hanging.

My suggestion is to limit the outlink url target length as soon possible. It is a configurable limit, the default is 3000. This should be reasonably long enough for most uses. But sufficienly strict enough to make sure regex plugins do not choke on urls that are too long. Please see attached patch for the Nutchgora implementation.

I'd like to hear what you think about this.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (NUTCH-1314) Impose a limit on the length of outlink target urls

Posted by "Ferdy Galema (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-1314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13256438#comment-13256438 ] 

Ferdy Galema commented on NUTCH-1314:
-------------------------------------

Exactly. Until that merge is properly implemented we can rely on this quickfix.
                
> Impose a limit on the length of outlink target urls
> ---------------------------------------------------
>
>                 Key: NUTCH-1314
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1314
>             Project: Nutch
>          Issue Type: Improvement
>            Reporter: Ferdy Galema
>         Attachments: NUTCH-1314.patch
>
>
> In the past we have encountered situations where crawling specific broken sites resulted in ridiciously long urls that caused the stalling of tasks. The regex plugins (normalizing/filtering) processed single urls for hours, if not indefinitely hanging.
> My suggestion is to limit the outlink url target length as soon possible. It is a configurable limit, the default is 3000. This should be reasonably long enough for most uses. But sufficienly strict enough to make sure regex plugins do not choke on urls that are too long. Please see attached patch for the Nutchgora implementation.
> I'd like to hear what you think about this.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (NUTCH-1314) Impose a limit on the length of outlink target urls

Posted by "Julien Nioche (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-1314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13256423#comment-13256423 ] 

Julien Nioche commented on NUTCH-1314:
--------------------------------------

I was under the impression that the patch did not remove the URL but substituted it with a shorter version. If the idea is to remove the URL altogether (which makes perfect sense) then yes it should be a URLFilter instead 
                
> Impose a limit on the length of outlink target urls
> ---------------------------------------------------
>
>                 Key: NUTCH-1314
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1314
>             Project: Nutch
>          Issue Type: Improvement
>            Reporter: Ferdy Galema
>         Attachments: NUTCH-1314.patch
>
>
> In the past we have encountered situations where crawling specific broken sites resulted in ridiciously long urls that caused the stalling of tasks. The regex plugins (normalizing/filtering) processed single urls for hours, if not indefinitely hanging.
> My suggestion is to limit the outlink url target length as soon possible. It is a configurable limit, the default is 3000. This should be reasonably long enough for most uses. But sufficienly strict enough to make sure regex plugins do not choke on urls that are too long. Please see attached patch for the Nutchgora implementation.
> I'd like to hear what you think about this.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (NUTCH-1314) Impose a limit on the length of outlink target urls

Posted by "Markus Jelsma (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-1314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13231368#comment-13231368 ] 

Markus Jelsma commented on NUTCH-1314:
--------------------------------------

This should then also work for the Tika parser and the OutlinkExtractor i think. Parse-html is similar to parse-tika, it there are no outlinks obtain by getOutlinks in Domcontentutils then the outlink extractor is used.
                
> Impose a limit on the length of outlink target urls
> ---------------------------------------------------
>
>                 Key: NUTCH-1314
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1314
>             Project: Nutch
>          Issue Type: Improvement
>            Reporter: Ferdy Galema
>         Attachments: NUTCH-1314.patch
>
>
> In the past we have encountered situations where crawling specific broken sites resulted in ridiciously long urls that caused the stalling of tasks. The regex plugins (normalizing/filtering) processed single urls for hours, if not indefinitely hanging.
> My suggestion is to limit the outlink url target length as soon possible. It is a configurable limit, the default is 3000. This should be reasonably long enough for most uses. But sufficienly strict enough to make sure regex plugins do not choke on urls that are too long. Please see attached patch for the Nutchgora implementation.
> I'd like to hear what you think about this.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (NUTCH-1314) Impose a limit on the length of outlink target urls

Posted by "Julien Nioche (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-1314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13256393#comment-13256393 ] 

Julien Nioche commented on NUTCH-1314:
--------------------------------------

What about doing this with a URLNormalizer (and make it the first to be called)? 
                
> Impose a limit on the length of outlink target urls
> ---------------------------------------------------
>
>                 Key: NUTCH-1314
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1314
>             Project: Nutch
>          Issue Type: Improvement
>            Reporter: Ferdy Galema
>         Attachments: NUTCH-1314.patch
>
>
> In the past we have encountered situations where crawling specific broken sites resulted in ridiciously long urls that caused the stalling of tasks. The regex plugins (normalizing/filtering) processed single urls for hours, if not indefinitely hanging.
> My suggestion is to limit the outlink url target length as soon possible. It is a configurable limit, the default is 3000. This should be reasonably long enough for most uses. But sufficienly strict enough to make sure regex plugins do not choke on urls that are too long. Please see attached patch for the Nutchgora implementation.
> I'd like to hear what you think about this.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (NUTCH-1314) Impose a limit on the length of outlink target urls

Posted by "Ferdy Galema (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-1314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13231373#comment-13231373 ] 

Ferdy Galema commented on NUTCH-1314:
-------------------------------------

Good one, I overlooked those but they should definitely be treated the same way.
                
> Impose a limit on the length of outlink target urls
> ---------------------------------------------------
>
>                 Key: NUTCH-1314
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1314
>             Project: Nutch
>          Issue Type: Improvement
>            Reporter: Ferdy Galema
>         Attachments: NUTCH-1314.patch
>
>
> In the past we have encountered situations where crawling specific broken sites resulted in ridiciously long urls that caused the stalling of tasks. The regex plugins (normalizing/filtering) processed single urls for hours, if not indefinitely hanging.
> My suggestion is to limit the outlink url target length as soon possible. It is a configurable limit, the default is 3000. This should be reasonably long enough for most uses. But sufficienly strict enough to make sure regex plugins do not choke on urls that are too long. Please see attached patch for the Nutchgora implementation.
> I'd like to hear what you think about this.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (NUTCH-1314) Impose a limit on the length of outlink target urls

Posted by "Ferdy Galema (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-1314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13256398#comment-13256398 ] 

Ferdy Galema commented on NUTCH-1314:
-------------------------------------

I assume you mean an URLFilter? Or do you want to correct the length by cutting off the excessive part? I think the urls should be rejected, because they probably were malformed anyway.
                
> Impose a limit on the length of outlink target urls
> ---------------------------------------------------
>
>                 Key: NUTCH-1314
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1314
>             Project: Nutch
>          Issue Type: Improvement
>            Reporter: Ferdy Galema
>         Attachments: NUTCH-1314.patch
>
>
> In the past we have encountered situations where crawling specific broken sites resulted in ridiciously long urls that caused the stalling of tasks. The regex plugins (normalizing/filtering) processed single urls for hours, if not indefinitely hanging.
> My suggestion is to limit the outlink url target length as soon possible. It is a configurable limit, the default is 3000. This should be reasonably long enough for most uses. But sufficienly strict enough to make sure regex plugins do not choke on urls that are too long. Please see attached patch for the Nutchgora implementation.
> I'd like to hear what you think about this.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (NUTCH-1314) Impose a limit on the length of outlink target urls

Posted by "Ferdy Galema (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-1314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13256431#comment-13256431 ] 

Ferdy Galema commented on NUTCH-1314:
-------------------------------------

I understand. I think the problem with implementing it with an urlfilter is that some parts of Nutch run the normalizers first. In the ParseUtil this is the case. Thus with malformed outlinks (of course this is where the majority of new urls are found) this will still be problematic. It makes sense to run normalizers first. Some urls still have a chance to be fixed (normalized) before they are filtered out.

Therefore the scope of this issue is to apply a very crude (but effective) filter before normalizing/filtering code is run.
                
> Impose a limit on the length of outlink target urls
> ---------------------------------------------------
>
>                 Key: NUTCH-1314
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1314
>             Project: Nutch
>          Issue Type: Improvement
>            Reporter: Ferdy Galema
>         Attachments: NUTCH-1314.patch
>
>
> In the past we have encountered situations where crawling specific broken sites resulted in ridiciously long urls that caused the stalling of tasks. The regex plugins (normalizing/filtering) processed single urls for hours, if not indefinitely hanging.
> My suggestion is to limit the outlink url target length as soon possible. It is a configurable limit, the default is 3000. This should be reasonably long enough for most uses. But sufficienly strict enough to make sure regex plugins do not choke on urls that are too long. Please see attached patch for the Nutchgora implementation.
> I'd like to hear what you think about this.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (NUTCH-1314) Impose a limit on the length of outlink target urls

Posted by "Ferdy Galema (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-1314?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ferdy Galema updated NUTCH-1314:
--------------------------------

    Attachment: NUTCH-1314.patch
    
> Impose a limit on the length of outlink target urls
> ---------------------------------------------------
>
>                 Key: NUTCH-1314
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1314
>             Project: Nutch
>          Issue Type: Improvement
>            Reporter: Ferdy Galema
>         Attachments: NUTCH-1314.patch
>
>
> In the past we have encountered situations where crawling specific broken sites resulted in ridiciously long urls that caused the stalling of tasks. The regex plugins (normalizing/filtering) processed single urls for hours, if not indefinitely hanging.
> My suggestion is to limit the outlink url target length as soon possible. It is a configurable limit, the default is 3000. This should be reasonably long enough for most uses. But sufficienly strict enough to make sure regex plugins do not choke on urls that are too long. Please see attached patch for the Nutchgora implementation.
> I'd like to hear what you think about this.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (NUTCH-1314) Impose a limit on the length of outlink target urls

Posted by "Julien Nioche (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-1314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13256437#comment-13256437 ] 

Julien Nioche commented on NUTCH-1314:
--------------------------------------

This makes a good case for the merging of URL filters and normalizers (I think there is a JIRA on this) - we wouldn't need to worry about whether the the normalizer is called first etc... 
                
> Impose a limit on the length of outlink target urls
> ---------------------------------------------------
>
>                 Key: NUTCH-1314
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1314
>             Project: Nutch
>          Issue Type: Improvement
>            Reporter: Ferdy Galema
>         Attachments: NUTCH-1314.patch
>
>
> In the past we have encountered situations where crawling specific broken sites resulted in ridiciously long urls that caused the stalling of tasks. The regex plugins (normalizing/filtering) processed single urls for hours, if not indefinitely hanging.
> My suggestion is to limit the outlink url target length as soon possible. It is a configurable limit, the default is 3000. This should be reasonably long enough for most uses. But sufficienly strict enough to make sure regex plugins do not choke on urls that are too long. Please see attached patch for the Nutchgora implementation.
> I'd like to hear what you think about this.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira