You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Fabio Ricci <fa...@gmail.com> on 2017/04/16 13:34:34 UTC

Length of downloaded pages

Hi

is there somebody here ;) - Don’t expect you on Easter…

NUTCH 1.13 stores in the dump incomplete websites. 

Is there a way to instruct it to download all content of a website, from <html> to </html> ?

Thank you very much in advance

Regards
Fabio

Re: Length of downloaded pages

Posted by Fabio Ricci <fa...@gmail.com>.

Hello Sazedul

Thank you for your hint - indeed I was hoping it would be so as you said.
I am using the url http://amwmg.com/ <http://amwmg.com/> for tests, this is a quite long page.

Unfortunately even after having changed - in nutch-site.xml -  the value of http.content.limit to -1 a truncation occur. 
The same happened even with a value of 5000000 …
(So it seems I have to download url contents by myself… )

Thanks a lot anyway!
Fabio

> On 16 Apr 2017, at 15:50, Sazedul Islam <sa...@gmail.com> wrote:
> 
> Yes, there is a way to download webpages without truncating. Just put
> http.content.limit in the nutch-site.xml file with the value -1.
> 
> <property>  <name>http.content.limit</name>  <value>-1</value>
> <description>The length limit for downloaded content, in bytes.  If
> this value is nonnegative (>=0), content longer than it will be
> truncated;  otherwise, no truncation at all.
> </description></property>
> 
> 
> On Sun, Apr 16, 2017 at 7:34 PM Fabio Ricci <fa...@gmail.com>
> wrote:
> 
>> Hi
>> 
>> is there somebody here ;) - Don’t expect you on Easter…
>> 
>> NUTCH 1.13 stores in the dump incomplete websites.
>> 
>> Is there a way to instruct it to download all content of a website, from
>> <html> to </html> ?
>> 
>> Thank you very much in advance
>> 
>> Regards
>> Fabio

Re: Length of downloaded pages

Posted by Sazedul Islam <sa...@gmail.com>.

Yes, there is a way to download webpages without truncating. Just put
http.content.limit in the nutch-site.xml file with the value -1.

<property>  <name>http.content.limit</name>  <value>-1</value>
<description>The length limit for downloaded content, in bytes.  If
this value is nonnegative (>=0), content longer than it will be
truncated;  otherwise, no truncation at all.
</description></property>

On Sun, Apr 16, 2017 at 7:34 PM Fabio Ricci <fa...@gmail.com>
wrote:

> Hi
>
> is there somebody here ;) - Don’t expect you on Easter…
>
> NUTCH 1.13 stores in the dump incomplete websites.
>
> Is there a way to instruct it to download all content of a website, from
> <html> to </html> ?
>
> Thank you very much in advance
>
> Regards
> Fabio