You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Fabio Ricci <fa...@gmail.com> on 2017/04/16 13:34:34 UTC
Length of downloaded pages
Hi
is there somebody here ;) - Don’t expect you on Easter…
NUTCH 1.13 stores in the dump incomplete websites.
Is there a way to instruct it to download all content of a website, from <html> to </html> ?
Thank you very much in advance
Regards
Fabio
Re: Length of downloaded pages
Posted by Fabio Ricci <fa...@gmail.com>.
Hello Sazedul
Thank you for your hint - indeed I was hoping it would be so as you said.
I am using the url http://amwmg.com/ <http://amwmg.com/> for tests, this is a quite long page.
Unfortunately even after having changed - in nutch-site.xml - the value of http.content.limit to -1 a truncation occur.
The same happened even with a value of 5000000 …
(So it seems I have to download url contents by myself… )
Thanks a lot anyway!
Fabio
> On 16 Apr 2017, at 15:50, Sazedul Islam <sa...@gmail.com> wrote:
>
> Yes, there is a way to download webpages without truncating. Just put
> http.content.limit in the nutch-site.xml file with the value -1.
>
> <property> <name>http.content.limit</name> <value>-1</value>
> <description>The length limit for downloaded content, in bytes. If
> this value is nonnegative (>=0), content longer than it will be
> truncated; otherwise, no truncation at all.
> </description></property>
>
>
> On Sun, Apr 16, 2017 at 7:34 PM Fabio Ricci <fa...@gmail.com>
> wrote:
>
>> Hi
>>
>> is there somebody here ;) - Don’t expect you on Easter…
>>
>> NUTCH 1.13 stores in the dump incomplete websites.
>>
>> Is there a way to instruct it to download all content of a website, from
>> <html> to </html> ?
>>
>> Thank you very much in advance
>>
>> Regards
>> Fabio
Re: Length of downloaded pages
Posted by Sazedul Islam <sa...@gmail.com>.
Yes, there is a way to download webpages without truncating. Just put
http.content.limit in the nutch-site.xml file with the value -1.
<property> <name>http.content.limit</name> <value>-1</value>
<description>The length limit for downloaded content, in bytes. If
this value is nonnegative (>=0), content longer than it will be
truncated; otherwise, no truncation at all.
</description></property>
On Sun, Apr 16, 2017 at 7:34 PM Fabio Ricci <fa...@gmail.com>
wrote:
> Hi
>
> is there somebody here ;) - Don’t expect you on Easter…
>
> NUTCH 1.13 stores in the dump incomplete websites.
>
> Is there a way to instruct it to download all content of a website, from
> <html> to </html> ?
>
> Thank you very much in advance
>
> Regards
> Fabio