You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Amit Sela <am...@infolinks.com> on 2013/11/30 22:43:29 UTC

Anyone managed to execute large scale crawl with Nutch 1.7

I get OOM exception in parse phase.
I think it's related to https://issues.apache.org/jira/browse/NUTCH-1640
Did anyone succeed in fetching and parsing hundreds of thousands or even
millions of pages with Nutch 1.7 ?

Re: Anyone managed to execute large scale crawl with Nutch 1.7

Posted by Talat UYARER <ta...@agmlab.com>.
Hi Amit,

I execute  whole internet crawling in Nutch 2.x. Parse phrase is alwyas 
problem. I found base64 image information was embeded in url. That cause 
some OOM exception. May be you have some issue. Can you share log of 
parse. May be We can think about that.

Talat

01-12-2013 22:47 tarihinde, Amit Sela yazdı:
> I'm using a long running production cluster so I don't think the machine
> configuration is the issue, and if so, I'd expect it in the fetch phase,
> wouldn't you ?
> On Dec 1, 2013 9:41 PM, "S.L" <si...@gmail.com> wrote:
>
>> I was able to execute a crawl of couple of hundred thousand URLs in local
>> mode , I did not get any OOM exceptions , what  machine configuration do
>> you use  ?
>>
>>
>> On Sat, Nov 30, 2013 at 4:43 PM, Amit Sela <am...@infolinks.com> wrote:
>>
>>> I get OOM exception in parse phase.
>>> I think it's related to https://issues.apache.org/jira/browse/NUTCH-1640
>>> Did anyone succeed in fetching and parsing hundreds of thousands or even
>>> millions of pages with Nutch 1.7 ?
>>>
>>
>


Re: Anyone managed to execute large scale crawl with Nutch 1.7

Posted by Amit Sela <am...@infolinks.com>.
I'm using a long running production cluster so I don't think the machine
configuration is the issue, and if so, I'd expect it in the fetch phase,
wouldn't you ?
On Dec 1, 2013 9:41 PM, "S.L" <si...@gmail.com> wrote:

> I was able to execute a crawl of couple of hundred thousand URLs in local
> mode , I did not get any OOM exceptions , what  machine configuration do
> you use  ?
>
>
> On Sat, Nov 30, 2013 at 4:43 PM, Amit Sela <am...@infolinks.com> wrote:
>
> > I get OOM exception in parse phase.
> > I think it's related to https://issues.apache.org/jira/browse/NUTCH-1640
> > Did anyone succeed in fetching and parsing hundreds of thousands or even
> > millions of pages with Nutch 1.7 ?
> >
>

Re: Anyone managed to execute large scale crawl with Nutch 1.7

Posted by "S.L" <si...@gmail.com>.
I was able to execute a crawl of couple of hundred thousand URLs in local
mode , I did not get any OOM exceptions , what  machine configuration do
you use  ?


On Sat, Nov 30, 2013 at 4:43 PM, Amit Sela <am...@infolinks.com> wrote:

> I get OOM exception in parse phase.
> I think it's related to https://issues.apache.org/jira/browse/NUTCH-1640
> Did anyone succeed in fetching and parsing hundreds of thousands or even
> millions of pages with Nutch 1.7 ?
>