You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Markus Jelsma (Created) (JIRA)" <ji...@apache.org> on 2012/03/20 20:19:37 UTC

[jira] [Created] (NUTCH-1317) Max content length by MIME-type

Max content length by MIME-type
-------------------------------

                 Key: NUTCH-1317
                 URL: https://issues.apache.org/jira/browse/NUTCH-1317
             Project: Nutch
          Issue Type: Improvement
          Components: parser
    Affects Versions: 1.4
            Reporter: Markus Jelsma
            Assignee: Markus Jelsma
             Fix For: 1.5


The good old http.content.length directive is not sufficient in large internet crawls. For example, a 5MB PDF file may be parsed without issues but a 5MB HTML file may time out.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (NUTCH-1317) Max content length by MIME-type

Posted by "Markus Jelsma (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-1317?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13260390#comment-13260390 ] 

Markus Jelsma commented on NUTCH-1317:
--------------------------------------

The code in HttpResponse is alright as it is.
                
> Max content length by MIME-type
> -------------------------------
>
>                 Key: NUTCH-1317
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1317
>             Project: Nutch
>          Issue Type: Improvement
>          Components: parser
>    Affects Versions: 1.4
>            Reporter: Markus Jelsma
>            Assignee: Markus Jelsma
>             Fix For: 1.6
>
>
> The good old http.content.length directive is not sufficient in large internet crawls. For example, a 5MB PDF file may be parsed without issues but a 5MB HTML file may time out.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (NUTCH-1317) Max content length by MIME-type

Posted by "Markus Jelsma (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-1317?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13233970#comment-13233970 ] 

Markus Jelsma commented on NUTCH-1317:
--------------------------------------

I am not sure about the root of the problem. We only use Tika for parsing PDF and (X)HTML and rely on Boilerpipe. Some HTML pages are quite a thing, full of stuff or endless tables. You'll press page down over a hundred times to scroll to the bottom. I've not tested all bad URL's but i think Tika does the job eventually, if not i'll file a ticket. Most i tested work, given enough time.
HTML pages that take more than one second to parse are considered bad, it should be less than 50ms on average. Those that are bad usually contain too much elements and are large in size.
                
> Max content length by MIME-type
> -------------------------------
>
>                 Key: NUTCH-1317
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1317
>             Project: Nutch
>          Issue Type: Improvement
>          Components: parser
>    Affects Versions: 1.4
>            Reporter: Markus Jelsma
>            Assignee: Markus Jelsma
>             Fix For: 1.5
>
>
> The good old http.content.length directive is not sufficient in large internet crawls. For example, a 5MB PDF file may be parsed without issues but a 5MB HTML file may time out.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (NUTCH-1317) Max content length by MIME-type

Posted by "Markus Jelsma (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-1317?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13260329#comment-13260329 ] 

Markus Jelsma commented on NUTCH-1317:
--------------------------------------

Here is one excellent example of an normal HTML page that will certainly slow down your browser and will almost for sure trash a running parsing fetcher with few threads. It eats memory but is only about 3MB in size. 

http://www.wohnung-mieten.de/alles.php

In short, do not attempt to parse HTML pages that are over 1000kB or even 500kB.
                
> Max content length by MIME-type
> -------------------------------
>
>                 Key: NUTCH-1317
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1317
>             Project: Nutch
>          Issue Type: Improvement
>          Components: parser
>    Affects Versions: 1.4
>            Reporter: Markus Jelsma
>            Assignee: Markus Jelsma
>             Fix For: 1.6
>
>
> The good old http.content.length directive is not sufficient in large internet crawls. For example, a 5MB PDF file may be parsed without issues but a 5MB HTML file may time out.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (NUTCH-1317) Max content length by MIME-type

Posted by "Markus Jelsma (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-1317?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13260331#comment-13260331 ] 

Markus Jelsma commented on NUTCH-1317:
--------------------------------------

Oh, this is the wrong URL, this one is 18MB large! It seems it did pass our http.content.limit setting! This page does not return a ContentLength HTTP respone header, if that's the reason it passed the limit we need an additional issue to handle those cases.
                
> Max content length by MIME-type
> -------------------------------
>
>                 Key: NUTCH-1317
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1317
>             Project: Nutch
>          Issue Type: Improvement
>          Components: parser
>    Affects Versions: 1.4
>            Reporter: Markus Jelsma
>            Assignee: Markus Jelsma
>             Fix For: 1.6
>
>
> The good old http.content.length directive is not sufficient in large internet crawls. For example, a 5MB PDF file may be parsed without issues but a 5MB HTML file may time out.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (NUTCH-1317) Max content length by MIME-type

Posted by "Lewis John McGibbney (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-1317?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13260364#comment-13260364 ] 

Lewis John McGibbney commented on NUTCH-1317:
---------------------------------------------

Do you have the original example to compare against?
                
> Max content length by MIME-type
> -------------------------------
>
>                 Key: NUTCH-1317
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1317
>             Project: Nutch
>          Issue Type: Improvement
>          Components: parser
>    Affects Versions: 1.4
>            Reporter: Markus Jelsma
>            Assignee: Markus Jelsma
>             Fix For: 1.6
>
>
> The good old http.content.length directive is not sufficient in large internet crawls. For example, a 5MB PDF file may be parsed without issues but a 5MB HTML file may time out.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (NUTCH-1317) Max content length by MIME-type

Posted by "Lewis John McGibbney (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-1317?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13233828#comment-13233828 ] 

Lewis John McGibbney commented on NUTCH-1317:
---------------------------------------------

Do you have any indication as to why this is Markus? Which plugin are you using to parse your html?
                
> Max content length by MIME-type
> -------------------------------
>
>                 Key: NUTCH-1317
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1317
>             Project: Nutch
>          Issue Type: Improvement
>          Components: parser
>    Affects Versions: 1.4
>            Reporter: Markus Jelsma
>            Assignee: Markus Jelsma
>             Fix For: 1.5
>
>
> The good old http.content.length directive is not sufficient in large internet crawls. For example, a 5MB PDF file may be parsed without issues but a 5MB HTML file may time out.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (NUTCH-1317) Max content length by MIME-type

Posted by "Markus Jelsma (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-1317?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Markus Jelsma updated NUTCH-1317:
---------------------------------

    Fix Version/s:     (was: 1.5)
                   1.6

20120304-push-1.6
                
> Max content length by MIME-type
> -------------------------------
>
>                 Key: NUTCH-1317
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1317
>             Project: Nutch
>          Issue Type: Improvement
>          Components: parser
>    Affects Versions: 1.4
>            Reporter: Markus Jelsma
>            Assignee: Markus Jelsma
>             Fix For: 1.6
>
>
> The good old http.content.length directive is not sufficient in large internet crawls. For example, a 5MB PDF file may be parsed without issues but a 5MB HTML file may time out.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (NUTCH-1317) Max content length by MIME-type

Posted by "Lewis John McGibbney (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-1317?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13260363#comment-13260363 ] 

Lewis John McGibbney commented on NUTCH-1317:
---------------------------------------------

Yes this page almost crashed my browser right enough! What can you rely on if the ContentLength HTTP response header is not present?
                
> Max content length by MIME-type
> -------------------------------
>
>                 Key: NUTCH-1317
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1317
>             Project: Nutch
>          Issue Type: Improvement
>          Components: parser
>    Affects Versions: 1.4
>            Reporter: Markus Jelsma
>            Assignee: Markus Jelsma
>             Fix For: 1.6
>
>
> The good old http.content.length directive is not sufficient in large internet crawls. For example, a 5MB PDF file may be parsed without issues but a 5MB HTML file may time out.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (NUTCH-1317) Max content length by MIME-type

Posted by "Markus Jelsma (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-1317?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13260372#comment-13260372 ] 

Markus Jelsma commented on NUTCH-1317:
--------------------------------------

That URL is the original example.

I think it's fairly easy to fix. The code in HttpResponse of protocol-http sets the contentLength to Integer.MAX_VALUE and it is not being set to maxLength if there is no ContentLength HTTP response header.

{code}
    int contentLength = Integer.MAX_VALUE;    // get content length
    String contentLengthString = headers.get(Response.CONTENT_LENGTH);
    if (contentLengthString != null) {
      contentLengthString = contentLengthString.trim();
      try {
        if (!contentLengthString.isEmpty()) 
          contentLength = Integer.parseInt(contentLengthString);
      } catch (NumberFormatException e) {
        throw new HttpException("bad content length: "+contentLengthString);
      }
    }
    if (http.getMaxContent() >= 0
      && contentLength > http.getMaxContent())   // limit download size
      contentLength  = http.getMaxContent();
{code}

I believe this code is alright. But when reading the data we might want to add a check to never exceed maxLength where it is currently not checked if it is not set properly.

{code}
    for (int i = in.read(bytes); i != -1 && length + i <= contentLength; i = in.read(bytes)) {

      out.write(bytes, 0, i);
      length += i;
    }
{code}

If we add a check here we still allow downloads without ContentLength HTTP resonse header but read then only up to maxLength bytes. I'll check it out and likely open a new issue.
                
> Max content length by MIME-type
> -------------------------------
>
>                 Key: NUTCH-1317
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1317
>             Project: Nutch
>          Issue Type: Improvement
>          Components: parser
>    Affects Versions: 1.4
>            Reporter: Markus Jelsma
>            Assignee: Markus Jelsma
>             Fix For: 1.6
>
>
> The good old http.content.length directive is not sufficient in large internet crawls. For example, a 5MB PDF file may be parsed without issues but a 5MB HTML file may time out.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira