You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@nutch.apache.org by "Lewis John McGibbney (JIRA)" <ji...@apache.org> on 2013/10/27 19:58:34 UTC

[jira] [Commented] (NUTCH-1643) Unnecessary fetching with http.content.limit when using protocol-http

    [ https://issues.apache.org/jira/browse/NUTCH-1643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13806424#comment-13806424 ] 

Lewis John McGibbney commented on NUTCH-1643:
---------------------------------------------

[~talat] Thanks for the patch. 
Two things

* I've also looked into the other protocol plugins. There is more which can be added to this issue. protocol-httpclient is a definite as it seems to suffer from the same problem. Do you wish to have a look and see where else improvements can be made? This is of course up to you.
* I am not entirely sure about storing the content as null. My justification here is as follows; say I was to have an http.content.limit set, but also parser.skip.truncated value to false then there would be no content at all to parse as the value is null (NPE in the back of my mind).

Is there some other solution to find the balance here?

> Unnecessary fetching with http.content.limit when using protocol-http
> ---------------------------------------------------------------------
>
>                 Key: NUTCH-1643
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1643
>             Project: Nutch
>          Issue Type: Bug
>          Components: protocol
>    Affects Versions: 2.1, 2.2, 2.2.1
>            Reporter: Talat UYARER
>            Priority: Minor
>             Fix For: 2.3
>
>         Attachments: NUTCH-1643.patch
>
>
> In protocol-http, Even If I have http.content.limit value set, protocol-http fetches files of all sizes (larger files are fetched until limit allows). 
> But when Parsing, parser skips incomplete files (if parser.skip.truncated configuration is true). It seems like an unnecessary effort to partially fetch contents larger than limit if they are not gonna be parsed.



--
This message was sent by Atlassian JIRA
(v6.1#6144)