You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by "Ordway, Ryan" <Ry...@oregonstate.edu> on 2005/09/20 22:52:51 UTC
NDFS java.io.IOException
I'm doing some experimenting with NDFS to see if it will work
for my nutch cluster. It seems to do fine until my data nodes start
to run out of memory. This is a pilot project, so I'm using some
older systems -- P4/1.9ghz/512MB/20GB. Right now they're booting off
of a ramdisk image, but I'm going to be moving to a disk-based system
to see if that helps things.
Anyhow, my database node was able to generate the database and
import about 4 million URLs just fine. It generated segments from the
database just fine. I generated it for 4 fetchers and ran a fetcher
on each of my compute nodes.
To clarify, I've got one master node that maintains the
database,
and acts as the NDFS name node. I've got four compute nodes that perform
fetches, and act as data nodes.
Things went fine for a good 36 hours or so, then I noticed that
the systems were starting to swap and their performance started to tank.
After awhile, each node's fetch process started to do die off with
errors like:
SEVERE error writing output:java.io.IOException: Could not obtain new
output block for file /user/root/segments/20050919121742-2/content/data
... stack trace ...
SEVERE error writing output:java.io.IOException: key out of order:
312670 after 312670
... stack trace ...
... A few more of these errors ...
Exception in thread "main" java.lang.RunTimeException: SEVERE error
logged. Exiting fetcher.
And then the fetch dies.
Is there anything that can be done to prevent this, short of
adding
more RAM to these systems?
Thanks,
Ryan
--
Ryan Ordway
Unix Systems Administrator E-mail:
ryan.ordway@oregonstate.edu
Oregon State University Libraries
rordway@library.oregonstate.edu
121 The Valley Library
Corvallis, OR 97331 Desk: 541.737.8972
Re: Maintaining only one FAQ
Posted by Paul van Brouwershaven <pa...@vanbrouwershaven.com>.
My comments:
Are there any mailing lists available?
It should be easier if the answer listed within a table here.
Also comebine the questions: Are there any mailing lists available? & Is
there a mail archive?
Like:
Listname | Subscribe | Unsubscribe | Online Archive | Download Archive
Gal Nitzan wrote:
> I have enhanced the FAQ http://wiki.apache.org/nutch/FAQ?action=show to
> contain the questions from the FAQ in nutchs' home page
> http://lucene.apache.org/nutch/faq.html
>
> I propose to replace the current home page FAQ with the one in the wiki.
> I believe there should be only one FAQ and it is easier to maintain.
Yes, mutch better!
Re: Maintaining only one FAQ
Posted by Jérôme Charron <je...@gmail.com>.
+1
On 9/28/05, Stefan Groschupf <sg...@media-style.com> wrote:
>
> 1+
>
> Am 28.09.2005 um 12:12 schrieb Gal Nitzan:
>
> > Hi,
> >
> > I have enhanced the FAQ http://wiki.apache.org/nutch/FAQ?
> > action=show to contain the questions from the FAQ in nutchs' home
> > page http://lucene.apache.org/nutch/faq.html
> >
> > I propose to replace the current home page FAQ with the one in the
> > wiki. I believe there should be only one FAQ and it is easier to
> > maintain.
> >
> > Regards,
> >
> > Gal
> >
> >
>
>
--
http://motrech.free.fr/
http://www.frutch.org/
Re: Maintaining only one FAQ
Posted by Stefan Groschupf <sg...@media-style.com>.
1+
Am 28.09.2005 um 12:12 schrieb Gal Nitzan:
> Hi,
>
> I have enhanced the FAQ http://wiki.apache.org/nutch/FAQ?
> action=show to contain the questions from the FAQ in nutchs' home
> page http://lucene.apache.org/nutch/faq.html
>
> I propose to replace the current home page FAQ with the one in the
> wiki. I believe there should be only one FAQ and it is easier to
> maintain.
>
> Regards,
>
> Gal
>
>
Re: Maintaining only one FAQ - I can not do it only webmaster
Posted by Gal Nitzan <gn...@usa.net>.
Doug Cutting wrote:
> +1
>
> Gal Nitzan wrote:
>> Hi,
>>
>> I have enhanced the FAQ http://wiki.apache.org/nutch/FAQ?action=show
>> to contain the questions from the FAQ in nutchs' home page
>> http://lucene.apache.org/nutch/faq.html
>>
>> I propose to replace the current home page FAQ with the one in the
>> wiki. I believe there should be only one FAQ and it is easier to
>> maintain.
>>
>> Regards,
>>
>> Gal
>
> .
>
Re: Doug - FAQ - Re: Maintaining only one FAQ
Posted by Piotr Kosiorowski <pk...@gmail.com>.
I will redeploy the site to point to Wiki - I am in process of 0.7.1
release but it takes much longer than I expected because of lack of
time. I will do this change during release preparation - I hope I will
manage to do it today or over the weekend finally.
Regards
Piotr
Gal Nitzan wrote:
> Jon Shoberg wrote:
>
>> Doug Cutting wrote:
>>
>>> +1
>>>
>>> Gal Nitzan wrote:
>>>
>>>> Hi,
>>>>
>>>> I have enhanced the FAQ
>>>> http://wiki.apache.org/nutch/FAQ?action=show to contain the
>>>> questions from the FAQ in nutchs' home page
>>>> http://lucene.apache.org/nutch/faq.html
>>>>
>>>> I propose to replace the current home page FAQ with the one in the
>>>> wiki. I believe there should be only one FAQ and it is easier to
>>>> maintain.
>>>>
>>>> Regards,
>>>>
>>>> Gal
>>
>>
>> Doug,
>>
>> I would be glad to assist with the FAQ. Is there a way to start?
>>
>> BTW .. We're using Nutch for search on our site :)
>>
>> http://fisher.osu.edu
>>
>> Best,
>>
>> Jon Shoberg
>> Systems Developer
>> Fisher College of Business
>>
>>
>>
>> .
>>
> Hi Doug, yes thank you.
>
> I do not have an access to the main site of Nutch on apache.org
>
> The link on our home page points currently to:
> http://lucene.apache.org/nutch/faq.html
>
> It should point to: http://wiki.apache.org/nutch/FAQ
>
> Regards,
>
> Gal
>
Re: Doug - FAQ - Re: Maintaining only one FAQ
Posted by Gal Nitzan <gn...@usa.net>.
Jon Shoberg wrote:
> Doug Cutting wrote:
>> +1
>>
>> Gal Nitzan wrote:
>>
>>> Hi,
>>>
>>> I have enhanced the FAQ
>>> http://wiki.apache.org/nutch/FAQ?action=show to contain the
>>> questions from the FAQ in nutchs' home page
>>> http://lucene.apache.org/nutch/faq.html
>>>
>>> I propose to replace the current home page FAQ with the one in the
>>> wiki. I believe there should be only one FAQ and it is easier to
>>> maintain.
>>>
>>> Regards,
>>>
>>> Gal
>
> Doug,
>
> I would be glad to assist with the FAQ. Is there a way to start?
>
> BTW .. We're using Nutch for search on our site :)
>
> http://fisher.osu.edu
>
> Best,
>
> Jon Shoberg
> Systems Developer
> Fisher College of Business
>
>
>
> .
>
Hi Doug, yes thank you.
I do not have an access to the main site of Nutch on apache.org
The link on our home page points currently to:
http://lucene.apache.org/nutch/faq.html
It should point to: http://wiki.apache.org/nutch/FAQ
Regards,
Gal
Doug - FAQ - Re: Maintaining only one FAQ
Posted by Jon Shoberg <jo...@shoberg.net>.
Doug Cutting wrote:
> +1
>
> Gal Nitzan wrote:
>
>> Hi,
>>
>> I have enhanced the FAQ http://wiki.apache.org/nutch/FAQ?action=show
>> to contain the questions from the FAQ in nutchs' home page
>> http://lucene.apache.org/nutch/faq.html
>>
>> I propose to replace the current home page FAQ with the one in the
>> wiki. I believe there should be only one FAQ and it is easier to
>> maintain.
>>
>> Regards,
>>
>> Gal
Doug,
I would be glad to assist with the FAQ. Is there a way to start?
BTW .. We're using Nutch for search on our site :)
http://fisher.osu.edu
Best,
Jon Shoberg
Systems Developer
Fisher College of Business
Re: Maintaining only one FAQ
Posted by Doug Cutting <cu...@nutch.org>.
+1
Gal Nitzan wrote:
> Hi,
>
> I have enhanced the FAQ http://wiki.apache.org/nutch/FAQ?action=show to
> contain the questions from the FAQ in nutchs' home page
> http://lucene.apache.org/nutch/faq.html
>
> I propose to replace the current home page FAQ with the one in the wiki.
> I believe there should be only one FAQ and it is easier to maintain.
>
> Regards,
>
> Gal
Re: [Nutch-general] Maintaining only one FAQ
Posted by og...@yahoo.com.
+1
And +1 for making it look like the existing Lucene FAQ.
Otis
--- Gal Nitzan <gn...@usa.net> wrote:
> Hi,
>
> I have enhanced the FAQ http://wiki.apache.org/nutch/FAQ?action=show
> to
> contain the questions from the FAQ in nutchs' home page
> http://lucene.apache.org/nutch/faq.html
>
> I propose to replace the current home page FAQ with the one in the
> wiki.
> I believe there should be only one FAQ and it is easier to maintain.
>
> Regards,
>
> Gal
>
>
> -------------------------------------------------------
> This SF.Net email is sponsored by:
> Power Architecture Resource Center: Free content, downloads,
> discussions,
> and more. http://solutions.newsforge.com/ibmarch.tmpl
> _______________________________________________
> Nutch-general mailing list
> Nutch-general@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/nutch-general
>
Maintaining only one FAQ
Posted by Gal Nitzan <gn...@usa.net>.
Hi,
I have enhanced the FAQ http://wiki.apache.org/nutch/FAQ?action=show to
contain the questions from the FAQ in nutchs' home page
http://lucene.apache.org/nutch/faq.html
I propose to replace the current home page FAQ with the one in the wiki.
I believe there should be only one FAQ and it is easier to maintain.
Regards,
Gal
HTTP ERROR: 500
Posted by Gal Nitzan <gn...@usa.net>.
Hi,
I connected to jetty on port: 7845
when clicking: jobdetails.jsp
HTTP ERROR: 500
Internal Server Error
RequestURI=/jobdetails.jsp
/Powered by Jetty:/ <http://jetty.mortbay.org>
/The job tracker works perfectly.
Regards,
Gal
Re: NDFS java.io.IOException
Posted by Gal Nitzan <gn...@usa.net>.
Doug Cutting wrote:
> What version of Nutch are you using?
>
> The version of NDFS in the mapred branch is much improved. The
> crawling code in that branch has also been re-written to be
> MapReduce-based, and will automatically manage multi-machine fetching,
> db updates, indexing, etc.
>
> There's not yet much documentation for this version however. Probably
> the best documentation is in this pdf, and it is spartan:
>
> http://wiki.apache.org/nutch-data/attachments/Presentations/attachments/oscon05.pdf
>
>
> Here's a quick cheat sheet:
>
> svn co https://svn.apache.org/repos/asf/lucene/nutch/branches/mapred
> cd mapred
> ant
>
> emacs conf/nutch-site.xml
> # define fs.default.name to be masterHost:XXXX
> # define mapred.job.tracker to be masterHost:YYYY
>
> emacs conf/mapred-default.xml
> # define mapred.map.tasks to be multiple of # of slave hosts
> # define mapred.reduce tasks to be # of slave hosts
>
> # make a file with slave host names
> echo slave1 >> ~/.slaves
> echo slave2 >> ~/.slaves
> echo slave3 >> ~/.slaves
>
> # start all ndfs & mapred daemons
> bin/start-all.sh
>
> # make a directory with seed list file
> mkdir seeds
> echo http://lucene.apache.org/nutch/ > seeds/urls
>
> # put seed directory in ndfs
> bin/nutch ndfs -put seeds seeds
>
> # crawl a bit
> bin/nutch crawl seeds -depth 3
>
> # monitor things from adminstrative interface
> firefox masterHost:7845
>
> If you try this, please tell us how it goes.
>
> Doug
>
> .
>
Hi,
This cheat sheet worked perfectly !!! first time !!!
And all I can say is wow. Looks great.
Gal.
Re: NDFS java.io.IOException
Posted by Rod Taylor <rb...@sitesell.com>.
> > Secondly, will it still be possible to get the output dumped (ie.
> > segread -dump) to a flat file in large chunks?
>
> In principle, yes, but I have not tested the segread code in the mapred
> branch, and it may need to be updated, as the structure of segments has
> changed a bit.
I'm not a Java programmer nor do I really understand what is going on,
but I took a crack at reimplementing the most basic version of the
segread code (full output with -dump to stdout).
It appears to function correctly with a single Nutch backend. I am sure
it is not correct to send data to STDOUT from the reduce() function,
but I'm not sure what other location is more appropriate.
I am hoping that this will encourage someone to either finish it off or
tell me about the logic issues.
The attached SegmentReader.java goes into org.apache.nutch.crawl and you
may need to fiddle with the bin/nutch shell script to use it.
--
Rod Taylor <rb...@sitesell.com>
Re: NDFS java.io.IOException
Posted by Doug Cutting <cu...@nutch.org>.
Rod Taylor wrote:
> I haven't looked at it and wasn't concerned until I saw the
> "automatically" but will we still be able to crawl and not index?
Yes, but the sequence of commands has changed slightly. Look at Crawl.java.
> Secondly, will it still be possible to get the output dumped (ie.
> segread -dump) to a flat file in large chunks?
In principle, yes, but I have not tested the segread code in the mapred
branch, and it may need to be updated, as the structure of segments has
changed a bit.
Doug
Re: NDFS java.io.IOException
Posted by Rod Taylor <rb...@sitesell.com>.
On Tue, 2005-09-20 at 19:07 -0700, Doug Cutting wrote:
> What version of Nutch are you using?
>
> The version of NDFS in the mapred branch is much improved. The crawling
> code in that branch has also been re-written to be MapReduce-based, and
> will automatically manage multi-machine fetching, db updates, indexing, etc.
I haven't looked at it and wasn't concerned until I saw the
"automatically" but will we still be able to crawl and not index?
Secondly, will it still be possible to get the output dumped (ie.
segread -dump) to a flat file in large chunks?
We use Nutch as a crawler only, then after taking a dump of the data we
remove the segment from the filesystem. This means we only have a couple
hundred GB of data around at any given time.
We do our crawling, db updates, etc. in one environment then
post-process the HTML retrieved in large chunks (segments of about 200k
pages) within another environment.
--
Rod Taylor <rb...@sitesell.com>
Re: NDFS java.io.IOException
Posted by Doug Cutting <cu...@nutch.org>.
What version of Nutch are you using?
The version of NDFS in the mapred branch is much improved. The crawling
code in that branch has also been re-written to be MapReduce-based, and
will automatically manage multi-machine fetching, db updates, indexing, etc.
There's not yet much documentation for this version however. Probably
the best documentation is in this pdf, and it is spartan:
http://wiki.apache.org/nutch-data/attachments/Presentations/attachments/oscon05.pdf
Here's a quick cheat sheet:
svn co https://svn.apache.org/repos/asf/lucene/nutch/branches/mapred
cd mapred
ant
emacs conf/nutch-site.xml
# define fs.default.name to be masterHost:XXXX
# define mapred.job.tracker to be masterHost:YYYY
emacs conf/mapred-default.xml
# define mapred.map.tasks to be multiple of # of slave hosts
# define mapred.reduce tasks to be # of slave hosts
# make a file with slave host names
echo slave1 >> ~/.slaves
echo slave2 >> ~/.slaves
echo slave3 >> ~/.slaves
# start all ndfs & mapred daemons
bin/start-all.sh
# make a directory with seed list file
mkdir seeds
echo http://lucene.apache.org/nutch/ > seeds/urls
# put seed directory in ndfs
bin/nutch ndfs -put seeds seeds
# crawl a bit
bin/nutch crawl seeds -depth 3
# monitor things from adminstrative interface
firefox masterHost:7845
If you try this, please tell us how it goes.
Doug