You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by "Ordway, Ryan" <Ry...@oregonstate.edu> on 2005/09/20 22:52:51 UTC

NDFS java.io.IOException

	I'm doing some experimenting with NDFS to see if it will work
for my nutch cluster. It seems to do fine until my data nodes start
to run out of memory. This is a pilot project, so I'm using some
older systems -- P4/1.9ghz/512MB/20GB. Right now they're booting off
of a ramdisk image, but I'm going to be moving to a disk-based system
to see if that helps things. 

	Anyhow, my database node was able to generate the database and
import about 4 million URLs just fine. It generated segments from the
database just fine. I generated it for 4 fetchers and ran a fetcher
on each of my compute nodes.

	To clarify, I've got one master node that maintains the
database,
and acts as the NDFS name node. I've got four compute nodes that perform
fetches, and act as data nodes.

	Things went fine for a good 36 hours or so, then I noticed that
the systems were starting to swap and their performance started to tank.
After awhile, each node's fetch process started to do die off with
errors like:

SEVERE error writing output:java.io.IOException: Could not obtain new
output block for file /user/root/segments/20050919121742-2/content/data

	... stack trace ...

SEVERE error writing output:java.io.IOException: key out of order:
312670 after 312670

	... stack trace ...

... A few more of these errors ...

Exception in thread "main" java.lang.RunTimeException: SEVERE error
logged. Exiting fetcher.

	And then the fetch dies.

	Is there anything that can be done to prevent this, short of
adding
more RAM to these systems?

	Thanks,

	Ryan

--
Ryan Ordway                           
Unix Systems Administrator            E-mail:
ryan.ordway@oregonstate.edu
Oregon State University Libraries
rordway@library.oregonstate.edu
121 The Valley Library
Corvallis, OR 97331                   Desk: 541.737.8972 


Re: Maintaining only one FAQ

Posted by Paul van Brouwershaven <pa...@vanbrouwershaven.com>.
My comments:

Are there any mailing lists available?

It should be easier if the answer listed within a table here.

Also comebine the questions: Are there any mailing lists available? & Is 
there a mail archive?

Like:

Listname | Subscribe | Unsubscribe | Online Archive | Download Archive


Gal Nitzan wrote:
> I have enhanced the FAQ http://wiki.apache.org/nutch/FAQ?action=show  to 
> contain the questions from the FAQ in nutchs' home page 
> http://lucene.apache.org/nutch/faq.html
> 
> I propose to replace the current home page FAQ with the one in the wiki. 
> I believe there should be only one FAQ and it is easier to maintain.
Yes, mutch better!

Re: Maintaining only one FAQ

Posted by Jérôme Charron <je...@gmail.com>.
+1

On 9/28/05, Stefan Groschupf <sg...@media-style.com> wrote:
>
> 1+
>
> Am 28.09.2005 um 12:12 schrieb Gal Nitzan:
>
> > Hi,
> >
> > I have enhanced the FAQ http://wiki.apache.org/nutch/FAQ?
> > action=show to contain the questions from the FAQ in nutchs' home
> > page http://lucene.apache.org/nutch/faq.html
> >
> > I propose to replace the current home page FAQ with the one in the
> > wiki. I believe there should be only one FAQ and it is easier to
> > maintain.
> >
> > Regards,
> >
> > Gal
> >
> >
>
>


--
http://motrech.free.fr/
http://www.frutch.org/

Re: Maintaining only one FAQ

Posted by Stefan Groschupf <sg...@media-style.com>.
1+

Am 28.09.2005 um 12:12 schrieb Gal Nitzan:

> Hi,
>
> I have enhanced the FAQ http://wiki.apache.org/nutch/FAQ? 
> action=show  to contain the questions from the FAQ in nutchs' home  
> page http://lucene.apache.org/nutch/faq.html
>
> I propose to replace the current home page FAQ with the one in the  
> wiki. I believe there should be only one FAQ and it is easier to  
> maintain.
>
> Regards,
>
> Gal
>
>


Re: Maintaining only one FAQ - I can not do it only webmaster

Posted by Gal Nitzan <gn...@usa.net>.
Doug Cutting wrote:
> +1
>
> Gal Nitzan wrote:
>> Hi,
>>
>> I have enhanced the FAQ http://wiki.apache.org/nutch/FAQ?action=show  
>> to contain the questions from the FAQ in nutchs' home page 
>> http://lucene.apache.org/nutch/faq.html
>>
>> I propose to replace the current home page FAQ with the one in the 
>> wiki. I believe there should be only one FAQ and it is easier to 
>> maintain.
>>
>> Regards,
>>
>> Gal
>
> .
>


Re: Doug - FAQ - Re: Maintaining only one FAQ

Posted by Piotr Kosiorowski <pk...@gmail.com>.
I will redeploy the site to point to Wiki - I am in process of 0.7.1 
release but it takes much longer than I expected because of lack of 
time. I will do this change during release preparation - I hope I will 
manage to do it today or over the weekend finally.
Regards
Piotr


Gal Nitzan wrote:
> Jon Shoberg wrote:
> 
>> Doug Cutting wrote:
>>
>>> +1
>>>
>>> Gal Nitzan wrote:
>>>
>>>> Hi,
>>>>
>>>> I have enhanced the FAQ 
>>>> http://wiki.apache.org/nutch/FAQ?action=show  to contain the 
>>>> questions from the FAQ in nutchs' home page 
>>>> http://lucene.apache.org/nutch/faq.html
>>>>
>>>> I propose to replace the current home page FAQ with the one in the 
>>>> wiki. I believe there should be only one FAQ and it is easier to 
>>>> maintain.
>>>>
>>>> Regards,
>>>>
>>>> Gal
>>
>>
>> Doug,
>>
>>   I would be glad to assist with the FAQ.  Is there a way to start?
>>
>>   BTW .. We're using Nutch for search on our site :)
>>
>>   http://fisher.osu.edu
>>
>> Best,
>>
>> Jon Shoberg
>> Systems Developer
>> Fisher College of Business
>>
>>
>>
>> .
>>
> Hi Doug, yes thank you.
> 
> I do not have an access to the main site of Nutch on apache.org
> 
> The link on our home page points currently to: 
> http://lucene.apache.org/nutch/faq.html
> 
> It should point to: http://wiki.apache.org/nutch/FAQ
> 
> Regards,
> 
> Gal
> 


Re: Doug - FAQ - Re: Maintaining only one FAQ

Posted by Gal Nitzan <gn...@usa.net>.
Jon Shoberg wrote:
> Doug Cutting wrote:
>> +1
>>
>> Gal Nitzan wrote:
>>
>>> Hi,
>>>
>>> I have enhanced the FAQ 
>>> http://wiki.apache.org/nutch/FAQ?action=show  to contain the 
>>> questions from the FAQ in nutchs' home page 
>>> http://lucene.apache.org/nutch/faq.html
>>>
>>> I propose to replace the current home page FAQ with the one in the 
>>> wiki. I believe there should be only one FAQ and it is easier to 
>>> maintain.
>>>
>>> Regards,
>>>
>>> Gal
>
> Doug,
>
>   I would be glad to assist with the FAQ.  Is there a way to start?
>
>   BTW .. We're using Nutch for search on our site :)
>
>   http://fisher.osu.edu
>
> Best,
>
> Jon Shoberg
> Systems Developer
> Fisher College of Business
>
>
>
> .
>
Hi Doug, yes thank you.

I do not have an access to the main site of Nutch on apache.org

The link on our home page points currently to: 
http://lucene.apache.org/nutch/faq.html

It should point to: http://wiki.apache.org/nutch/FAQ

Regards,

Gal

Doug - FAQ - Re: Maintaining only one FAQ

Posted by Jon Shoberg <jo...@shoberg.net>.
Doug Cutting wrote:
> +1
> 
> Gal Nitzan wrote:
> 
>> Hi,
>>
>> I have enhanced the FAQ http://wiki.apache.org/nutch/FAQ?action=show  
>> to contain the questions from the FAQ in nutchs' home page 
>> http://lucene.apache.org/nutch/faq.html
>>
>> I propose to replace the current home page FAQ with the one in the 
>> wiki. I believe there should be only one FAQ and it is easier to 
>> maintain.
>>
>> Regards,
>>
>> Gal

Doug,

   I would be glad to assist with the FAQ.  Is there a way to start?

   BTW .. We're using Nutch for search on our site :)

   http://fisher.osu.edu

Best,

Jon Shoberg
Systems Developer
Fisher College of Business



Re: Maintaining only one FAQ

Posted by Doug Cutting <cu...@nutch.org>.
+1

Gal Nitzan wrote:
> Hi,
> 
> I have enhanced the FAQ http://wiki.apache.org/nutch/FAQ?action=show  to 
> contain the questions from the FAQ in nutchs' home page 
> http://lucene.apache.org/nutch/faq.html
> 
> I propose to replace the current home page FAQ with the one in the wiki. 
> I believe there should be only one FAQ and it is easier to maintain.
> 
> Regards,
> 
> Gal

Re: [Nutch-general] Maintaining only one FAQ

Posted by og...@yahoo.com.
+1
And +1 for making it look like the existing Lucene FAQ.

Otis

--- Gal Nitzan <gn...@usa.net> wrote:

> Hi,
> 
> I have enhanced the FAQ http://wiki.apache.org/nutch/FAQ?action=show 
> to 
> contain the questions from the FAQ in nutchs' home page 
> http://lucene.apache.org/nutch/faq.html
> 
> I propose to replace the current home page FAQ with the one in the
> wiki. 
> I believe there should be only one FAQ and it is easier to maintain.
> 
> Regards,
> 
> Gal
> 
> 
> -------------------------------------------------------
> This SF.Net email is sponsored by:
> Power Architecture Resource Center: Free content, downloads,
> discussions,
> and more. http://solutions.newsforge.com/ibmarch.tmpl
> _______________________________________________
> Nutch-general mailing list
> Nutch-general@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/nutch-general
> 


Maintaining only one FAQ

Posted by Gal Nitzan <gn...@usa.net>.
Hi,

I have enhanced the FAQ http://wiki.apache.org/nutch/FAQ?action=show  to 
contain the questions from the FAQ in nutchs' home page 
http://lucene.apache.org/nutch/faq.html

I propose to replace the current home page FAQ with the one in the wiki. 
I believe there should be only one FAQ and it is easier to maintain.

Regards,

Gal

HTTP ERROR: 500

Posted by Gal Nitzan <gn...@usa.net>.
Hi,

I connected to jetty on port: 7845

when clicking: jobdetails.jsp


    HTTP ERROR: 500

Internal Server Error

RequestURI=/jobdetails.jsp

/Powered by Jetty:/ <http://jetty.mortbay.org>

/The job tracker works perfectly.

Regards,

Gal

Re: NDFS java.io.IOException

Posted by Gal Nitzan <gn...@usa.net>.
Doug Cutting wrote:
> What version of Nutch are you using?
>
> The version of NDFS in the mapred branch is much improved.  The 
> crawling code in that branch has also been re-written to be 
> MapReduce-based, and will automatically manage multi-machine fetching, 
> db updates, indexing, etc.
>
> There's not yet much documentation for this version however.  Probably 
> the best documentation is in this pdf, and it is spartan:
>
> http://wiki.apache.org/nutch-data/attachments/Presentations/attachments/oscon05.pdf 
>
>
> Here's a quick cheat sheet:
>
> svn co https://svn.apache.org/repos/asf/lucene/nutch/branches/mapred
> cd mapred
> ant
>
> emacs conf/nutch-site.xml
> # define fs.default.name to be masterHost:XXXX
> # define mapred.job.tracker to be masterHost:YYYY
>
> emacs conf/mapred-default.xml
> # define mapred.map.tasks to be multiple of # of slave hosts
> # define mapred.reduce tasks to be # of slave hosts
>
> # make a file with slave host names
> echo slave1 >> ~/.slaves
> echo slave2 >> ~/.slaves
> echo slave3 >> ~/.slaves
>
> # start all ndfs & mapred daemons
> bin/start-all.sh
>
> # make a directory with seed list file
> mkdir seeds
> echo http://lucene.apache.org/nutch/ > seeds/urls
>
> # put seed directory in ndfs
> bin/nutch ndfs -put seeds seeds
>
> # crawl a bit
> bin/nutch crawl seeds -depth 3
>
> # monitor things from adminstrative interface
> firefox masterHost:7845
>
> If you try this, please tell us how it goes.
>
> Doug
>
> .
>

Hi,

This cheat sheet worked perfectly !!! first time !!!

And all I can say is wow. Looks great.

Gal.

Re: NDFS java.io.IOException

Posted by Rod Taylor <rb...@sitesell.com>.
> > Secondly, will it still be possible to get the output dumped (ie.
> > segread -dump) to a flat file in large chunks?
> 
> In principle, yes, but I have not tested the segread code in the mapred 
> branch, and it may need to be updated, as the structure of segments has 
> changed a bit.

I'm not a Java programmer nor do I really understand what is going on,
but I took a crack at reimplementing the most basic version of the
segread code (full output with -dump to stdout).

It appears to function correctly with a single Nutch backend. I am sure
it is not correct to  send data to STDOUT from the reduce() function,
but I'm not sure what other location is more appropriate.

I am hoping that this will encourage someone to either finish it off or
tell me about the logic issues.

The attached SegmentReader.java goes into org.apache.nutch.crawl and you
may need to fiddle with the bin/nutch shell script to use it.

-- 
Rod Taylor <rb...@sitesell.com>

Re: NDFS java.io.IOException

Posted by Doug Cutting <cu...@nutch.org>.
Rod Taylor wrote:
> I haven't looked at it and wasn't concerned until I saw the
> "automatically" but will we still be able to crawl and not index?

Yes, but the sequence of commands has changed slightly.  Look at Crawl.java.

> Secondly, will it still be possible to get the output dumped (ie.
> segread -dump) to a flat file in large chunks?

In principle, yes, but I have not tested the segread code in the mapred 
branch, and it may need to be updated, as the structure of segments has 
changed a bit.

Doug

Re: NDFS java.io.IOException

Posted by Rod Taylor <rb...@sitesell.com>.
On Tue, 2005-09-20 at 19:07 -0700, Doug Cutting wrote:
> What version of Nutch are you using?
> 
> The version of NDFS in the mapred branch is much improved.  The crawling 
> code in that branch has also been re-written to be MapReduce-based, and 
> will automatically manage multi-machine fetching, db updates, indexing, etc.

I haven't looked at it and wasn't concerned until I saw the
"automatically" but will we still be able to crawl and not index?

Secondly, will it still be possible to get the output dumped (ie.
segread -dump) to a flat file in large chunks?


We use Nutch as a crawler only, then after taking a dump of the data we
remove the segment from the filesystem. This means we only have a couple
hundred GB of data around at any given time.

We do our crawling, db updates, etc. in one environment then
post-process the HTML retrieved in large chunks (segments of about 200k
pages) within another environment. 

-- 
Rod Taylor <rb...@sitesell.com>


Re: NDFS java.io.IOException

Posted by Doug Cutting <cu...@nutch.org>.
What version of Nutch are you using?

The version of NDFS in the mapred branch is much improved.  The crawling 
code in that branch has also been re-written to be MapReduce-based, and 
will automatically manage multi-machine fetching, db updates, indexing, etc.

There's not yet much documentation for this version however.  Probably 
the best documentation is in this pdf, and it is spartan:

http://wiki.apache.org/nutch-data/attachments/Presentations/attachments/oscon05.pdf

Here's a quick cheat sheet:

svn co https://svn.apache.org/repos/asf/lucene/nutch/branches/mapred
cd mapred
ant

emacs conf/nutch-site.xml
# define fs.default.name to be masterHost:XXXX
# define mapred.job.tracker to be masterHost:YYYY

emacs conf/mapred-default.xml
# define mapred.map.tasks to be multiple of # of slave hosts
# define mapred.reduce tasks to be # of slave hosts

# make a file with slave host names
echo slave1 >> ~/.slaves
echo slave2 >> ~/.slaves
echo slave3 >> ~/.slaves

# start all ndfs & mapred daemons
bin/start-all.sh

# make a directory with seed list file
mkdir seeds
echo http://lucene.apache.org/nutch/ > seeds/urls

# put seed directory in ndfs
bin/nutch ndfs -put seeds seeds

# crawl a bit
bin/nutch crawl seeds -depth 3

# monitor things from adminstrative interface
firefox masterHost:7845

If you try this, please tell us how it goes.

Doug