You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by John Mendenhall <jo...@surfutopia.net> on 2008/01/19 23:40:21 UTC

nutch 0.9, multiple nodes, not fetching topN links to fetch

Hello,

I am running nutch 0.9 currently.
I am running on 4 nodes, one is the
master, in addition to being a slave.

I have injected 100k urls into nutch.
All urls are on the same host.

I am running a generate/fetch/update
cycle with topN set at 100k.

However, after each cycle, it only
fetches between 2588 and 2914 urls
each time.  I have run this over 8
times, all with the same result.

I have tried using nutch fetch and
nutch fetch2.

My hypothesis is, this is due to all
urls being on same host (www.example.com/some/path).

Do I need to set the fetcher.threads.per.host
to something higher than the default of 2?

Is there something in the logs I should
look for to determine the exact cause of
this problem?

Thank you in advance for any assistance
that can be provided.

If you need any additional information,
please let me know and I'll send it.

Thanks!

JohnM

-- 
john mendenhall
john@surfutopia.net
surf utopia
internet services

Re: nutch 0.9, multiple nodes, not fetching topN links to fetch

Posted by Siddhartha Reddy <si...@grok.in>.

Although only one of the machines will be used for the fetch task (because
all your urls are from a single host), the other tasks do not have any such
requirements and can run on multiple machines. So running in the distributed
might still benefit you.

To 'turn off' the 3 slaves, you can simply remove them from the conf/slaves
file. You might also want to to change the other dfs parameters
correspondingly. I would suggest that you totally turn off dfs in this case
by setting 'fs.default.name' to 'file:///' and 'mapred.job.tracker' to
'local'.

Best,
Siddhartha

On Jan 27, 2008 5:32 AM, John Mendenhall <jo...@surfutopia.net> wrote:

> Andrzej Bialecki,
>
> > >All hosts are the same.  Everyone of them.
> > >
> > >If there is no way to split them up, this seems to
> > >imply the distributed nature of nutch is lost on
> > >attempting to build an index for a single large
> > >site.  Please correct me if I am wrong with this
> > >presumption.
> >
> > It doesn't matter whether you use a distributed crawl or not - you still
> > are expected to crawl politely, meaning that you should not exceed
> > certain rate of requests / sec to any given host. Since all your urls
> > come from the same host, then no matter how many machines you trow at
> > it, you will still be crawling at a rate of 1 page / 5 seconds (or
> > whatever you set in the nutch-site.xml). So, a single machine can manage
> > this just fine.
>
> Currently, I have 4 machines running nutch, one master/slave,
> and 3 pure slaves.  What is the best procedure for turning off
> the 3 slaves?
>
> Should I go back to a "local" setup only, without the overhead
> of hadoop dfs?
>
> What is the best recommendation?
>
> Thanks!
>
> JohnM
>
> --
> john mendenhall
> john@surfutopia.net
> surf utopia
> internet services
>

-- 
http://sids.in
"If you are not having fun, you are not doing it right."

Re: nutch 0.9, multiple nodes, not fetching topN links to fetch

Posted by John Mendenhall <jo...@surfutopia.net>.

Andrzej Bialecki,

> >All hosts are the same.  Everyone of them.
> >
> >If there is no way to split them up, this seems to
> >imply the distributed nature of nutch is lost on
> >attempting to build an index for a single large
> >site.  Please correct me if I am wrong with this
> >presumption.
> 
> It doesn't matter whether you use a distributed crawl or not - you still 
> are expected to crawl politely, meaning that you should not exceed 
> certain rate of requests / sec to any given host. Since all your urls 
> come from the same host, then no matter how many machines you trow at 
> it, you will still be crawling at a rate of 1 page / 5 seconds (or 
> whatever you set in the nutch-site.xml). So, a single machine can manage 
> this just fine.

Currently, I have 4 machines running nutch, one master/slave,
and 3 pure slaves.  What is the best procedure for turning off
the 3 slaves?

Should I go back to a "local" setup only, without the overhead
of hadoop dfs?

What is the best recommendation?

Thanks!

JohnM

-- 
john mendenhall
john@surfutopia.net
surf utopia
internet services

Re: nutch 0.9, multiple nodes, not fetching topN links to fetch

Posted by John Mendenhall <jo...@surfutopia.net>.

On Fri, 25 Jan 2008, Dennis Kubes wrote:

> Yes you would need to run parsing after fetching and before updatedb.

Thanks!

JohnM


> John Mendenhall wrote:
> >On Fri, 25 Jan 2008, Dennis Kubes wrote:
> >
> >>>Is the recommendation to run fetcher in parsing mode?
> >>>If so, when should the parse be done?  After the updatedb?
> >>>Before the indexing?
> >>You would run the parsing after the fetch process.  But this way the 
> >>fetch would complete the download and if the parsing failed you would 
> >>still have the page content and be able to try again without refetching.
> >
> >To clarify, run the parsing after the fetch process
> >and before the updatedb process, correct?
> >
> >Thanks!
> >
> >JohnM

-- 
john mendenhall
john@surfutopia.net
surf utopia
internet services

Re: nutch 0.9, multiple nodes, not fetching topN links to fetch

Posted by Dennis Kubes <ku...@apache.org>.

Yes you would need to run parsing after fetching and before updatedb.

John Mendenhall wrote:
> On Fri, 25 Jan 2008, Dennis Kubes wrote:
> 
>>> Is the recommendation to run fetcher in parsing mode?
>>> If so, when should the parse be done?  After the updatedb?
>>> Before the indexing?
>> You would run the parsing after the fetch process.  But this way the 
>> fetch would complete the download and if the parsing failed you would 
>> still have the page content and be able to try again without refetching.
> 
> To clarify, run the parsing after the fetch process
> and before the updatedb process, correct?
> 
> Thanks!
> 
> JohnM
>

Re: nutch 0.9, multiple nodes, not fetching topN links to fetch

Posted by John Mendenhall <jo...@surfutopia.net>.

On Fri, 25 Jan 2008, Dennis Kubes wrote:

> >Is the recommendation to run fetcher in parsing mode?
> >If so, when should the parse be done?  After the updatedb?
> >Before the indexing?
> 
> You would run the parsing after the fetch process.  But this way the 
> fetch would complete the download and if the parsing failed you would 
> still have the page content and be able to try again without refetching.

To clarify, run the parsing after the fetch process
and before the updatedb process, correct?

Thanks!

JohnM

-- 
john mendenhall
john@surfutopia.net
surf utopia
internet services

Re: nutch 0.9, multiple nodes, not fetching topN links to fetch

Posted by Siddhartha Reddy <si...@grok.in>.

Do you have the Java heap space options set in the 'mapred.child.java.opts'
property (in conf/hadoop-site.xml)? For a machine with 1gb ram and 1gb swap
space, I set this to '-Xms1024m -Xmx2048m'.

Best,
Siddhartha

On Jan 31, 2008 3:23 AM, John Mendenhall <jo...@surfutopia.net> wrote:

> > >>>The one task crawls about 3% of my topN and stops
> > >>>eventually with java.lang.OutOfMemoryError: Java heap space
> > >>>errors.
> > >>Are you running Fetcher in parsing mode? Try to use the -noParsing
> > >>option, and then parse the content in a separate step.
>
> I am now running generate/fetch/parse/updatedb.
> The fetch process still only gets about 3%-4% of
> the URLs in the topN of the generate.
> The fetch process logs similar messages as before:
>
> -----
> fetch of http://www.example.com/public/page.asp/85491 failed with:
> java.lang.OutOfMemoryError: Java heap space
> fetch of http://www.example.com/public/page.asp/16154 failed with:
> java.lang.OutOfMemoryError: Java heap space
> fetch of http://www.example.com/public/page.asp/20208 failed with:
> java.lang.OutOfMemoryError: Java heap space
> fetch of http://www.example.com/public/page.asp/15411 failed with:
> java.lang.OutOfMemoryError: Java heap space
> fetch of http://www.example.com/public/page.asp/178293 failed with:
> java.lang.OutOfMemoryError: Java heap space
> fetch of http://www.example.com/public/page.asp/843060 failed with:
> java.lang.OutOfMemoryError: Java heap space
> fetch of http://www.example.com/public/page.asp/967264 failed with:
> java.lang.OutOfMemoryError: Java heap space
> java.lang.OutOfMemoryError: Java heap space
> fetcher caught:java.lang.OutOfMemoryError: Java heap space
> fetch of http://www.example.com/public/page.asp/97401 failed with:
> java.lang.OutOfMemoryError: Java heap space
> java.lang.OutOfMemoryError: Java heap space
> fetcher caught:java.lang.OutOfMemoryError: Java heap space
> fetch of http://www.example.com/public/page.asp/1585146 failed with:
> java.lang.OutOfMemoryError: Java heap space
> java.lang.OutOfMemoryError: Java heap space
> fetcher caught:java.lang.OutOfMemoryError: Java heap space
> fetch of http://www.example.com/public/page.asp/11 failed with:
> java.lang.OutOfMemoryError: Java heap space
> java.lang.OutOfMemoryError: Java heap space
> fetcher caught:java.lang.OutOfMemoryError: Java heap space
> -----
>
> The first few entries are just fetch of X failed with: Y
> After a few of these, it changes to a set of 3 error messages
> like 'fetcher caught: java.lang... ; java.lang... ; fetch of X
> failed with: java.lang...'.
>
> I am not seeing any errors in the parse process.
>
> How do I hunt down the java heap space error
> further?  This only occurs in the fetch process.
> Do I have too many threads?
>
> I have it set to 24 threads, 32 max on a single
> host.
>
> I have the std memory option on the java runs.
> Every java process has the -Xmx1000m option.
> Should this be increased?
>
> How do you deal with slaves that have different
> amounts of memory.  I have some with 1.5gb ram,
> and others with 4gb ram.
>
> Sorry for all the questions.  The fetch issue is
> my current wall I am trying to overcome.
>
> Should this be debugged in the fetch process or
> is it possible the generate process is only
> outputting 3%-4% of the topN value?
>
> Thanks in advance for any pointers.
>
> JohnM
>
> --
> john mendenhall
> john@surfutopia.net
> surf utopia
> internet services
>



-- 
http://sids.in
"If you are not having fun, you are not doing it right."

Re: nutch 0.9, multiple nodes, not fetching topN links to fetch

Posted by John Mendenhall <jo...@surfutopia.net>.

> >>>The one task crawls about 3% of my topN and stops
> >>>eventually with java.lang.OutOfMemoryError: Java heap space
> >>>errors.
> >>Are you running Fetcher in parsing mode? Try to use the -noParsing
> >>option, and then parse the content in a separate step.

I am now running generate/fetch/parse/updatedb.
The fetch process still only gets about 3%-4% of
the URLs in the topN of the generate.
The fetch process logs similar messages as before:

-----
fetch of http://www.example.com/public/page.asp/85491 failed with: java.lang.OutOfMemoryError: Java heap space
fetch of http://www.example.com/public/page.asp/16154 failed with: java.lang.OutOfMemoryError: Java heap space
fetch of http://www.example.com/public/page.asp/20208 failed with: java.lang.OutOfMemoryError: Java heap space
fetch of http://www.example.com/public/page.asp/15411 failed with: java.lang.OutOfMemoryError: Java heap space
fetch of http://www.example.com/public/page.asp/178293 failed with: java.lang.OutOfMemoryError: Java heap space
fetch of http://www.example.com/public/page.asp/843060 failed with: java.lang.OutOfMemoryError: Java heap space
fetch of http://www.example.com/public/page.asp/967264 failed with: java.lang.OutOfMemoryError: Java heap space
java.lang.OutOfMemoryError: Java heap space
fetcher caught:java.lang.OutOfMemoryError: Java heap space
fetch of http://www.example.com/public/page.asp/97401 failed with: java.lang.OutOfMemoryError: Java heap space
java.lang.OutOfMemoryError: Java heap space
fetcher caught:java.lang.OutOfMemoryError: Java heap space
fetch of http://www.example.com/public/page.asp/1585146 failed with: java.lang.OutOfMemoryError: Java heap space
java.lang.OutOfMemoryError: Java heap space
fetcher caught:java.lang.OutOfMemoryError: Java heap space
fetch of http://www.example.com/public/page.asp/11 failed with: java.lang.OutOfMemoryError: Java heap space
java.lang.OutOfMemoryError: Java heap space
fetcher caught:java.lang.OutOfMemoryError: Java heap space
-----

The first few entries are just fetch of X failed with: Y
After a few of these, it changes to a set of 3 error messages
like 'fetcher caught: java.lang... ; java.lang... ; fetch of X
failed with: java.lang...'.

I am not seeing any errors in the parse process.

How do I hunt down the java heap space error
further? This only occurs in the fetch process.
Do I have too many threads?

I have it set to 24 threads, 32 max on a single
host.

I have the std memory option on the java runs.
Every java process has the -Xmx1000m option.
Should this be increased?

How do you deal with slaves that have different
amounts of memory. I have some with 1.5gb ram,
and others with 4gb ram.

Sorry for all the questions. The fetch issue is
my current wall I am trying to overcome.

Should this be debugged in the fetch process or
is it possible the generate process is only
outputting 3%-4% of the topN value?

Thanks in advance for any pointers.

JohnM

--
john mendenhall
john@surfutopia.net
surf utopia
internet services

Re: nutch 0.9, multiple nodes, not fetching topN links to fetch

Posted by Andrzej Bialecki <ab...@getopt.org>.

John Mendenhall wrote:

>> What is the host distribution of your fetchlist? I.e. how many unique 
>> hosts do you have among all the URLs in the fetchlist? If it's just 1 
>> (or few) it could happen that they are mapped to a single map task. This 
>> is done on purpose - there is no central lock manager in Nutch / Hadoop, 
>>  and Nutch needs a way to control the rate of access to any single 
>> host, for politeness reasons. Nutch can do this only if all urls from 
>> the same host are assigned to the same map task.
> 
> All hosts are the same.  Everyone of them.
> 
> If there is no way to split them up, this seems to
> imply the distributed nature of nutch is lost on
> attempting to build an index for a single large
> site.  Please correct me if I am wrong with this
> presumption.

It doesn't matter whether you use a distributed crawl or not - you still 
are expected to crawl politely, meaning that you should not exceed 
certain rate of requests / sec to any given host. Since all your urls 
come from the same host, then no matter how many machines you trow at 
it, you will still be crawling at a rate of 1 page / 5 seconds (or 
whatever you set in the nutch-site.xml). So, a single machine can manage 
this just fine.

-- 
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Re: nutch 0.9, multiple nodes, not fetching topN links to fetch

Posted by Dennis Kubes <ku...@apache.org>.


John Mendenhall wrote:
> On Fri, 25 Jan 2008, Andrzej Bialecki wrote:
> 
>>> I am using nutch 0.9, with 1 master, 4 slaves.
>>> I am crawling a single site with 1.4 million urls.
>>>
>>> I am running the std generate/fetch/updatedb cycle
>>> with topN at 100000.
>>> It appears all 97 tasks get mapped.  Only one task
>>> sees any action.
>>> The one task crawls about 3% of my topN and stops
>>> eventually with java.lang.OutOfMemoryError: Java heap space
>>> errors.
>> Are you running Fetcher in parsing mode? Try to use the -noParsing 
>> option, and then parse the content in a separate step.
> 
> I am running fetcher in parsing mode.
> Is this possibly taking up too much memory?
> Is that most likely the problem?

Yes that is most likely the problem.

> 
> Is the recommendation to run fetcher in parsing mode?
> If so, when should the parse be done?  After the updatedb?
> Before the indexing?

You would run the parsing after the fetch process.  But this way the 
fetch would complete the download and if the parsing failed you would 
still have the page content and be able to try again without refetching.

> 
> 
>>> What settings do I need to modify to get the generated
>>> topN (100000) urls to be spread out amongst all map
>>> task slots?
>> What is the host distribution of your fetchlist? I.e. how many unique 
>> hosts do you have among all the URLs in the fetchlist? If it's just 1 
>> (or few) it could happen that they are mapped to a single map task. This 
>> is done on purpose - there is no central lock manager in Nutch / Hadoop, 
>>  and Nutch needs a way to control the rate of access to any single 
>> host, for politeness reasons. Nutch can do this only if all urls from 
>> the same host are assigned to the same map task.
> 
> All hosts are the same.  Everyone of them.
> 
> If there is no way to split them up, this seems to
> imply the distributed nature of nutch is lost on
> attempting to build an index for a single large
> site.  Please correct me if I am wrong with this
> presumption.
> 
> Thanks!
> 
> JohnM
>

Re: nutch 0.9, multiple nodes, not fetching topN links to fetch

Posted by John Mendenhall <jo...@surfutopia.net>.

On Fri, 25 Jan 2008, Andrzej Bialecki wrote:

> >I am using nutch 0.9, with 1 master, 4 slaves.
> >I am crawling a single site with 1.4 million urls.
> >
> >I am running the std generate/fetch/updatedb cycle
> >with topN at 100000.
> >It appears all 97 tasks get mapped.  Only one task
> >sees any action.
> >The one task crawls about 3% of my topN and stops
> >eventually with java.lang.OutOfMemoryError: Java heap space
> >errors.
> 
> Are you running Fetcher in parsing mode? Try to use the -noParsing 
> option, and then parse the content in a separate step.

I am running fetcher in parsing mode.
Is this possibly taking up too much memory?
Is that most likely the problem?

Is the recommendation to run fetcher in parsing mode?
If so, when should the parse be done?  After the updatedb?
Before the indexing?

> >What settings do I need to modify to get the generated
> >topN (100000) urls to be spread out amongst all map
> >task slots?
> 
> What is the host distribution of your fetchlist? I.e. how many unique 
> hosts do you have among all the URLs in the fetchlist? If it's just 1 
> (or few) it could happen that they are mapped to a single map task. This 
> is done on purpose - there is no central lock manager in Nutch / Hadoop, 
>  and Nutch needs a way to control the rate of access to any single 
> host, for politeness reasons. Nutch can do this only if all urls from 
> the same host are assigned to the same map task.

All hosts are the same.  Everyone of them.

If there is no way to split them up, this seems to
imply the distributed nature of nutch is lost on
attempting to build an index for a single large
site.  Please correct me if I am wrong with this
presumption.

Thanks!

JohnM

-- 
john mendenhall
john@surfutopia.net
surf utopia
internet services

Re: nutch 0.9, multiple nodes, not fetching topN links to fetch

Posted by Andrzej Bialecki <ab...@getopt.org>.

John Mendenhall wrote:
> Thank you in advance for any assistance you can
> provide, or pointers at where I should look.
> 
> I am using nutch 0.9, with 1 master, 4 slaves.
> I am crawling a single site with 1.4 million urls.
> 
> I am running the std generate/fetch/updatedb cycle
> with topN at 100000.
> It appears all 97 tasks get mapped.  Only one task
> sees any action.
> The one task crawls about 3% of my topN and stops
> eventually with java.lang.OutOfMemoryError: Java heap space
> errors.

Are you running Fetcher in parsing mode? Try to use the -noParsing 
option, and then parse the content in a separate step.

> 
> I believe I have two problems.  One is the heap space
> issue.  The other is the mapping is not spreading out
> all the urls to multiple map task slots.
> 
> What settings do I need to modify to get the generated
> topN (100000) urls to be spread out amongst all map
> task slots?

What is the host distribution of your fetchlist? I.e. how many unique 
hosts do you have among all the URLs in the fetchlist? If it's just 1 
(or few) it could happen that they are mapped to a single map task. This 
is done on purpose - there is no central lock manager in Nutch / Hadoop, 
  and Nutch needs a way to control the rate of access to any single 
host, for politeness reasons. Nutch can do this only if all urls from 
the same host are assigned to the same map task.

-- 
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Re: nutch 0.9, multiple nodes, not fetching topN links to fetch

Posted by John Mendenhall <jo...@surfutopia.net>.

Thank you in advance for any assistance you can
provide, or pointers at where I should look.

I am using nutch 0.9, with 1 master, 4 slaves.
I am crawling a single site with 1.4 million urls.

I am running the std generate/fetch/updatedb cycle
with topN at 100000.
It appears all 97 tasks get mapped.  Only one task
sees any action.
The one task crawls about 3% of my topN and stops
eventually with java.lang.OutOfMemoryError: Java heap space
errors.

I believe I have two problems.  One is the heap space
issue.  The other is the mapping is not spreading out
all the urls to multiple map task slots.

What settings do I need to modify to get the generated
topN (100000) urls to be spread out amongst all map
task slots?

Thanks!

JohnM

-- 
john mendenhall
john@surfutopia.net
surf utopia
internet services

Re: nutch 0.9, multiple nodes, not fetching topN links to fetch

Posted by John Mendenhall <jo...@surfutopia.net>.

> > You would need to click on the map link on jobdetails.jsp and each task 
> > will say something like this:
> > 
> > 11337 pages, 3748 errors, 1.7 pages/s, 329 kb/s,
> 
> Okay.  Now I see it.
> 
> There are 97 map tasks.
> 
> 96 map tasks state:
> 0 threads, 0 pages, 0 errors, 0.0 pages/s, 0 kb/s,
> All the above 96 map tasks state the same thing.
> 
> 1 map task states:
> 0 threads, 2888 pages, 54 errors, 3.3 pages/s, 1681 kb/s,
> 
> There is definitely a problem here.
> 
> How do I spread out the map tasks to all available slots?
> 
> I found the errors by clicking on the task.
> Then, I clicked on the task logs all link.
> 
> It appeared to be running fine, until near the
> end, I am seeing error messages like this:
> 
> -----
> -activeThreads=16, spinWaiting=6, fetchQueues.totalSize=800
> fetch of http://www.example.com/public/page.asp/1618613 failed with: java.lang.OutOfMemoryError: Java heap space
> -activeThreads=16, spinWaiting=9, fetchQueues.totalSize=800
> -activeThreads=16, spinWaiting=0, fetchQueues.totalSize=800
> fetch of http://www.example.com/public/page.asp/118288 failed with: java.lang.OutOfMemoryError: Java heap space
> fetch of http://www.example.com/public/page.asp/74343 failed with: java.lang.OutOfMemoryError: Java heap space
> -activeThreads=16, spinWaiting=2, fetchQueues.totalSize=800
> fetch of http://www.example.com/public/page.asp/971779 failed with: java.lang.OutOfMemoryError: Java heap space
> -activeThreads=16, spinWaiting=2, fetchQueues.totalSize=800
> fetch of http://www.example.com/public/page.asp/1585170 failed with: java.lang.OutOfMemoryError: Java heap space
> -activeThreads=16, spinWaiting=1, fetchQueues.totalSize=800
> fetch of http://www.example.com/public/page.asp/82747 failed with: java.lang.OutOfMemoryError: Java heap space
> fetch of http://www.example.com/public/page.asp/687356 failed with: java.lang.OutOfMemoryError: Java heap space
> -activeThreads=16, spinWaiting=0, fetchQueues.totalSize=798
> fetch of http://www.example.com/public/page.asp/1425403 failed with: java.lang.OutOfMemoryError: Java heap space
> fetch of http://www.example.com/public/page.asp/1461659 failed with: java.lang.OutOfMemoryError: Java heap space
> -activeThreads=16, spinWaiting=3, fetchQueues.totalSize=800
> fetch of http://www.example.com/public/page.asp/1484234 failed with: java.lang.OutOfMemoryError: Java heap space
> fetch of http://www.example.com/public/page.asp/1096086 failed with: java.lang.OutOfMemoryError: Java heap space
> -activeThreads=16, spinWaiting=5, fetchQueues.totalSize=800
> fetch of http://www.example.com/public/page.asp/12074 failed with: java.lang.OutOfMemoryError: Java heap space
> fetch of http://www.example.com/public/page.asp/565789 failed with: java.lang.OutOfMemoryError: Java heap space
> fetch of http://www.example.com/public/page.asp/542157 failed with: java.lang.OutOfMemoryError: Java heap space
> fetch of http://www.example.com/public/page.asp/522730 failed with: java.lang.OutOfMemoryError: Java heap space
> fetch of http://www.example.com/public/page.asp/1617438 failed with: java.lang.OutOfMemoryError: Java heap space
> -----
> 
> How do I resolve this error?
> The master has 4gb, this particular slave has 1.5gb.
> Another slave has 1gb, another slave has 4gb.
> Do I just need to add physical memory?
> Or, is this something else in the configuration?
> 
> Is this error the cause of only doing 3% of the 100k
> urls I requested to be done?
> 
> Or, is it a problem with the other 96 map tasks not doing
> anything?
> 
> Thanks again for all of your help.
> 
> JohnM

Does anyone have any thoughts on how I can begin
addressing the issues I am experiencing above?

Thanks in advance for any pointers anyone can
provide.

JohnM

-- 
john mendenhall
john@surfutopia.net
surf utopia
internet services

Re: nutch 0.9, multiple nodes, not fetching topN links to fetch

Posted by John Mendenhall <jo...@surfutopia.net>.

> You would need to click on the map link on jobdetails.jsp and each task 
> will say something like this:
> 
> 11337 pages, 3748 errors, 1.7 pages/s, 329 kb/s,

Okay.  Now I see it.

There are 97 map tasks.

96 map tasks state:
0 threads, 0 pages, 0 errors, 0.0 pages/s, 0 kb/s,
All the above 96 map tasks state the same thing.

1 map task states:
0 threads, 2888 pages, 54 errors, 3.3 pages/s, 1681 kb/s,

There is definitely a problem here.

How do I spread out the map tasks to all available slots?

I found the errors by clicking on the task.
Then, I clicked on the task logs all link.

It appeared to be running fine, until near the
end, I am seeing error messages like this:

-----
-activeThreads=16, spinWaiting=6, fetchQueues.totalSize=800
fetch of http://www.example.com/public/page.asp/1618613 failed with: java.lang.OutOfMemoryError: Java heap space
-activeThreads=16, spinWaiting=9, fetchQueues.totalSize=800
-activeThreads=16, spinWaiting=0, fetchQueues.totalSize=800
fetch of http://www.example.com/public/page.asp/118288 failed with: java.lang.OutOfMemoryError: Java heap space
fetch of http://www.example.com/public/page.asp/74343 failed with: java.lang.OutOfMemoryError: Java heap space
-activeThreads=16, spinWaiting=2, fetchQueues.totalSize=800
fetch of http://www.example.com/public/page.asp/971779 failed with: java.lang.OutOfMemoryError: Java heap space
-activeThreads=16, spinWaiting=2, fetchQueues.totalSize=800
fetch of http://www.example.com/public/page.asp/1585170 failed with: java.lang.OutOfMemoryError: Java heap space
-activeThreads=16, spinWaiting=1, fetchQueues.totalSize=800
fetch of http://www.example.com/public/page.asp/82747 failed with: java.lang.OutOfMemoryError: Java heap space
fetch of http://www.example.com/public/page.asp/687356 failed with: java.lang.OutOfMemoryError: Java heap space
-activeThreads=16, spinWaiting=0, fetchQueues.totalSize=798
fetch of http://www.example.com/public/page.asp/1425403 failed with: java.lang.OutOfMemoryError: Java heap space
fetch of http://www.example.com/public/page.asp/1461659 failed with: java.lang.OutOfMemoryError: Java heap space
-activeThreads=16, spinWaiting=3, fetchQueues.totalSize=800
fetch of http://www.example.com/public/page.asp/1484234 failed with: java.lang.OutOfMemoryError: Java heap space
fetch of http://www.example.com/public/page.asp/1096086 failed with: java.lang.OutOfMemoryError: Java heap space
-activeThreads=16, spinWaiting=5, fetchQueues.totalSize=800
fetch of http://www.example.com/public/page.asp/12074 failed with: java.lang.OutOfMemoryError: Java heap space
fetch of http://www.example.com/public/page.asp/565789 failed with: java.lang.OutOfMemoryError: Java heap space
fetch of http://www.example.com/public/page.asp/542157 failed with: java.lang.OutOfMemoryError: Java heap space
fetch of http://www.example.com/public/page.asp/522730 failed with: java.lang.OutOfMemoryError: Java heap space
fetch of http://www.example.com/public/page.asp/1617438 failed with: java.lang.OutOfMemoryError: Java heap space
-----

How do I resolve this error?
The master has 4gb, this particular slave has 1.5gb.
Another slave has 1gb, another slave has 4gb.
Do I just need to add physical memory?
Or, is this something else in the configuration?

Is this error the cause of only doing 3% of the 100k
urls I requested to be done?

Or, is it a problem with the other 96 map tasks not doing
anything?

Thanks again for all of your help.

JohnM

-- 
john mendenhall
john@surfutopia.net
surf utopia
internet services

Re: nutch 0.9, multiple nodes, not fetching topN links to fetch

Posted by Dennis Kubes <ku...@apache.org>.

You would need to click on the map link on jobdetails.jsp and each task 
will say something like this:

11337 pages, 3748 errors, 1.7 pages/s, 329 kb/s,

Dennis Kubes

John Mendenhall wrote:
>>>> Three, you could be maxing out your bandwidth and only 1/10th or urls 
>>>> are actually getting through before timeout or the site is blocking most 
>>>> of the urls you are trying to fetch through robots.txt.  Look at the 
>>>> JobTracker admin screen for the fetch job and see how many errors are in 
>>>> each fetch task.
>>> We work with the site, and robots.txt is allowing us
>>> through.  It is definitely getting different pages
>>> each time.  We have 100000 urls in the crawldb.
>>> It is only getting about 3% new pages each generate-
>>> fetch-update cycle.
>>>
>>> The most recent completed run had 97 map tasks and
>>> 17 reduce tasks, all completed fine, with 0 failures.
>> Check the number of errors in the fetcher tasks themselves.  I 
>> understand the task will complete but the fetcher screen should show 
>> number of fetching errors.  My guess is that this is high.
> 
> I am going to the jobtracker url, at default port 50030.
> I find the most recent fetch task, which is listed at
> 
>   fetch /var/nutch/crawl/segments/20080121075010
> 
> I click on the job link (job_0183).
> It sends me the jobdetails.jsp page, which is what I
> reported on.
> 
> It seems to me you are referring to another interface.
> Can you please let me know where I should be looking
> for the errors in the fetcher tasks themselves?
> 
> Thanks!
> 
> JohnM
>

Re: nutch 0.9, multiple nodes, not fetching topN links to fetch

Posted by John Mendenhall <jo...@surfutopia.net>.

> >>Three, you could be maxing out your bandwidth and only 1/10th or urls 
> >>are actually getting through before timeout or the site is blocking most 
> >>of the urls you are trying to fetch through robots.txt.  Look at the 
> >>JobTracker admin screen for the fetch job and see how many errors are in 
> >>each fetch task.
> >
> >We work with the site, and robots.txt is allowing us
> >through.  It is definitely getting different pages
> >each time.  We have 100000 urls in the crawldb.
> >It is only getting about 3% new pages each generate-
> >fetch-update cycle.
> >
> >The most recent completed run had 97 map tasks and
> >17 reduce tasks, all completed fine, with 0 failures.
> 
> Check the number of errors in the fetcher tasks themselves.  I 
> understand the task will complete but the fetcher screen should show 
> number of fetching errors.  My guess is that this is high.

I am going to the jobtracker url, at default port 50030.
I find the most recent fetch task, which is listed at

  fetch /var/nutch/crawl/segments/20080121075010

I click on the job link (job_0183).
It sends me the jobdetails.jsp page, which is what I
reported on.

It seems to me you are referring to another interface.
Can you please let me know where I should be looking
for the errors in the fetcher tasks themselves?

Thanks!

JohnM

-- 
john mendenhall
john@surfutopia.net
surf utopia
internet services

Re: nutch 0.9, multiple nodes, not fetching topN links to fetch

Posted by Dennis Kubes <ku...@apache.org>.


John Mendenhall wrote:
> On Sat, 19 Jan 2008, Dennis Kubes wrote:
>> There are a few different things that could be causing this.
> 
> Thanks for the response!
> 
>> One, there is a variable called generate.max.per.host in the 
>> nutch-default.xml file.  If this is set to a value instead of -1 then it 
>> will limit the number of urls from that host.
> 
> Variable generate.max.per.host is set to -1.
> 
>> Two, have you set the http.agent.name? If you didn't it probably 
>> wouldn't have fetched anything at all.  The job would complete but the 
>> output would be 0.
> 
> Variable http.agent.name is set.  Nutch definitely
> fetches documents.  No problem there.
> 
>> Three, you could be maxing out your bandwidth and only 1/10th or urls 
>> are actually getting through before timeout or the site is blocking most 
>> of the urls you are trying to fetch through robots.txt.  Look at the 
>> JobTracker admin screen for the fetch job and see how many errors are in 
>> each fetch task.
> 
> We work with the site, and robots.txt is allowing us
> through.  It is definitely getting different pages
> each time.  We have 100000 urls in the crawldb.
> It is only getting about 3% new pages each generate-
> fetch-update cycle.
> 
> The most recent completed run had 97 map tasks and
> 17 reduce tasks, all completed fine, with 0 failures.

Check the number of errors in the fetcher tasks themselves.  I 
understand the task will complete but the fetcher screen should show 
number of fetching errors.  My guess is that this is high.

Dennis
> 
>> It could also be a url-filter problem with a bad regex filter.
> 
> I doubt this is a problem.  Each cycle run allows new
> urls in.  It just seems limited for each run.
> 
>> My guess would be from the info you have given that you are maxing your 
>> bandwidth.  This would cause the number fetched to fluctuate some but be 
>> about the same.  What is your bandwidth for fetching and what do you 
>> have mapred.map.tasks set to and fetcher.threads.fetch set to?
> 
> I will have to check on the bandwidth available
> for fetching.
> 
> Variable mapred.map.tasks is set to 97.
> Variable mapred.reduce.tasks is set to 17.
> 
> Variable fetcher.threads.fetch is set to 10.
> 
> Thanks again for any pointers you can provide.
> 
> JohnM
> 
> 
> 
> 
>> John Mendenhall wrote:
>>> Hello,
>>>
>>> I am running nutch 0.9 currently.
>>> I am running on 4 nodes, one is the
>>> master, in addition to being a slave.
>>>
>>> I have injected 100k urls into nutch.
>>> All urls are on the same host.
>>>
>>> I am running a generate/fetch/update
>>> cycle with topN set at 100k.
>>>
>>> However, after each cycle, it only
>>> fetches between 2588 and 2914 urls
>>> each time.  I have run this over 8
>>> times, all with the same result.
>>>
>>> I have tried using nutch fetch and
>>> nutch fetch2.
>>>
>>> My hypothesis is, this is due to all
>>> urls being on same host (www.example.com/some/path).
>>>
>>> Do I need to set the fetcher.threads.per.host
>>> to something higher than the default of 2?
>> The fetcher.threads.per.host variable is just the number of threads 
>> (fetchers) that can fetch a single host at a given time.  If you own/run 
>> the domain it is okay to crawl it faster, if not the default politeness 
>> settings are best as to not overwhelm the server you are crawling.
>>
>>> Is there something in the logs I should
>>> look for to determine the exact cause of
>>> this problem?
>>>
>>> Thank you in advance for any assistance
>>> that can be provided.
>>>
>>> If you need any additional information,
>>> please let me know and I'll send it.
>>>
>>> Thanks!
>>>
>>> JohnM
>

Re: nutch 0.9, multiple nodes, not fetching topN links to fetch

Posted by John Mendenhall <jo...@surfutopia.net>.

On Sat, 19 Jan 2008, Dennis Kubes wrote:
> There are a few different things that could be causing this.

Thanks for the response!

> One, there is a variable called generate.max.per.host in the 
> nutch-default.xml file.  If this is set to a value instead of -1 then it 
> will limit the number of urls from that host.

Variable generate.max.per.host is set to -1.

> Two, have you set the http.agent.name? If you didn't it probably 
> wouldn't have fetched anything at all.  The job would complete but the 
> output would be 0.

Variable http.agent.name is set.  Nutch definitely
fetches documents.  No problem there.

> Three, you could be maxing out your bandwidth and only 1/10th or urls 
> are actually getting through before timeout or the site is blocking most 
> of the urls you are trying to fetch through robots.txt.  Look at the 
> JobTracker admin screen for the fetch job and see how many errors are in 
> each fetch task.

We work with the site, and robots.txt is allowing us
through.  It is definitely getting different pages
each time.  We have 100000 urls in the crawldb.
It is only getting about 3% new pages each generate-
fetch-update cycle.

The most recent completed run had 97 map tasks and
17 reduce tasks, all completed fine, with 0 failures.

> It could also be a url-filter problem with a bad regex filter.

I doubt this is a problem.  Each cycle run allows new
urls in.  It just seems limited for each run.

> My guess would be from the info you have given that you are maxing your 
> bandwidth.  This would cause the number fetched to fluctuate some but be 
> about the same.  What is your bandwidth for fetching and what do you 
> have mapred.map.tasks set to and fetcher.threads.fetch set to?

I will have to check on the bandwidth available
for fetching.

Variable mapred.map.tasks is set to 97.
Variable mapred.reduce.tasks is set to 17.

Variable fetcher.threads.fetch is set to 10.

Thanks again for any pointers you can provide.

JohnM




> John Mendenhall wrote:
> >Hello,
> >
> >I am running nutch 0.9 currently.
> >I am running on 4 nodes, one is the
> >master, in addition to being a slave.
> >
> >I have injected 100k urls into nutch.
> >All urls are on the same host.
> >
> >I am running a generate/fetch/update
> >cycle with topN set at 100k.
> >
> >However, after each cycle, it only
> >fetches between 2588 and 2914 urls
> >each time.  I have run this over 8
> >times, all with the same result.
> >
> >I have tried using nutch fetch and
> >nutch fetch2.
> >
> >My hypothesis is, this is due to all
> >urls being on same host (www.example.com/some/path).
> >
> >Do I need to set the fetcher.threads.per.host
> >to something higher than the default of 2?
> 
> The fetcher.threads.per.host variable is just the number of threads 
> (fetchers) that can fetch a single host at a given time.  If you own/run 
> the domain it is okay to crawl it faster, if not the default politeness 
> settings are best as to not overwhelm the server you are crawling.
> 
> >
> >Is there something in the logs I should
> >look for to determine the exact cause of
> >this problem?
> >
> >Thank you in advance for any assistance
> >that can be provided.
> >
> >If you need any additional information,
> >please let me know and I'll send it.
> >
> >Thanks!
> >
> >JohnM

-- 
john mendenhall
john@surfutopia.net
surf utopia
internet services

Re: nutch 0.9, multiple nodes, not fetching topN links to fetch

Posted by Dennis Kubes <ku...@apache.org>.

There are a few different things that could be causing this.

One, there is a variable called generate.max.per.host in the 
nutch-default.xml file.  If this is set to a value instead of -1 then it 
will limit the number of urls from that host.

Two, have you set the http.agent.name? If you didn't it probably 
wouldn't have fetched anything at all.  The job would complete but the 
output would be 0.

Three, you could be maxing out your bandwidth and only 1/10th or urls 
are actually getting through before timeout or the site is blocking most 
of the urls you are trying to fetch through robots.txt.  Look at the 
JobTracker admin screen for the fetch job and see how many errors are in 
each fetch task.

It could also be a url-filter problem with a bad regex filter.

My guess would be from the info you have given that you are maxing your 
bandwidth.  This would cause the number fetched to fluctuate some but be 
about the same.  What is your bandwidth for fetching and what do you 
have mapred.map.tasks set to and fetcher.threads.fetch set to?

Dennis Kubes

Three,
John Mendenhall wrote:
> Hello,
> 
> I am running nutch 0.9 currently.
> I am running on 4 nodes, one is the
> master, in addition to being a slave.
> 
> I have injected 100k urls into nutch.
> All urls are on the same host.
> 
> I am running a generate/fetch/update
> cycle with topN set at 100k.
> 
> However, after each cycle, it only
> fetches between 2588 and 2914 urls
> each time.  I have run this over 8
> times, all with the same result.
> 
> I have tried using nutch fetch and
> nutch fetch2.
> 
> My hypothesis is, this is due to all
> urls being on same host (www.example.com/some/path).
> 
> Do I need to set the fetcher.threads.per.host
> to something higher than the default of 2?

The fetcher.threads.per.host variable is just the number of threads 
(fetchers) that can fetch a single host at a given time.  If you own/run 
the domain it is okay to crawl it faster, if not the default politeness 
settings are best as to not overwhelm the server you are crawling.

> 
> Is there something in the logs I should
> look for to determine the exact cause of
> this problem?
> 
> Thank you in advance for any assistance
> that can be provided.
> 
> If you need any additional information,
> please let me know and I'll send it.
> 
> Thanks!
> 
> JohnM
>