You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@nutch.apache.org by Doug Cutting <cu...@nutch.org> on 2005/08/04 19:17:49 UTC

near-term plan

Here's a near-term plan for Nutch.

1. Release Nutch 0.7, based on current trunk.  We should do this ASAP. 
Are there bugs in trunk that we need to fix before this can be done? 
The trunk will be copied to a 0.7 release branch.

2. Merge the mapred branch to trunk.

3. Move the packages org.apache.nutch.{io,ipc,fs,ndfs,mapred} into a 
separate project for distributed computing tools.  If the Lucene PMC 
approves this, it would be a new Lucene sub-project, a Nutch sibling.

Does this sound reasonable to folks?

Doug

Re: Documentation

Posted by Nishant Chandra <ni...@gmail.com>.

Hi,
I am looking for more technical documentation (developers manual). The
link doesnt have that. Can some1 help me with this.

Nishant

On 8/4/05, Stefan Groschupf <sg...@media-style.com> wrote:
> try:
> http://wiki.media-style.com/display/nutchDocu/Home
> 
> Stefan
> 
> Am 04.08.2005 um 19:54 schrieb Nishant Chandra:
> 
> > Hi,
> > I am new to nutch. Is there any articles/tutorials which explains the
> > internal working of the crawler (crawl stratergy) etc.
> >
> > Nishant
> >
> >
> 
> ---------------------------------------------------------------
> company:        http://www.media-style.com
> forum:        http://www.text-mining.org
> blog:            http://www.find23.net
> 
> 
> 
> 


-- 
Website: www.cse.iitb.ac.in/~nishantc

Re: Documentation

Posted by Stefan Groschupf <sg...@media-style.com>.

try:
http://wiki.media-style.com/display/nutchDocu/Home

Stefan

Am 04.08.2005 um 19:54 schrieb Nishant Chandra:

> Hi,
> I am new to nutch. Is there any articles/tutorials which explains the
> internal working of the crawler (crawl stratergy) etc.
>
> Nishant
>
>

---------------------------------------------------------------
company:        http://www.media-style.com
forum:        http://www.text-mining.org
blog:            http://www.find23.net

Documentation

Posted by Nishant Chandra <ni...@gmail.com>.

Hi,
I am new to nutch. Is there any articles/tutorials which explains the
internal working of the crawler (crawl stratergy) etc.

Nishant

Detecting unmodified content patches (Re: near-term plan)

Posted by Andrzej Bialecki <ab...@getopt.org>.

Doug Cutting wrote:
> Andrzej Bialecki wrote:
> 
>> So, I would propose a deadline of Aug 8 for the last commits, and then 
>> perhaps Aug 15 for the release?
> 
> 
> Sounds good to me.  Thanks for helping with this!

Unfortunately, the patches related to detecting the unmodified content 
will have to wait until after the release.

Here's the problem: It's quite easy to add this checking and recording 
capability to all fetcher plugins, fetchlist generation and db update 
tools, and I've done this in my local patches. However, after a while I 
discovered a serious problem in the way Nutch currently manages "phasing 
out" of old segment data. If we assume that we always refresh after some 
fixed interval (30 days, or whatever), then we can safely delete 
segments older than 30 days. If the interval varies, then potentially we 
could be stuck with some segments with very old (but still valid) data. 
This is very inefficient, because in a single given segment there might 
be only a couple of such pages left after a while, and the rest of them 
would have to be removed again and again by deduplication because newer 
pages would exist in newer segments.

Moreover (and this is the worst problem) if such segments are lost, the 
information in webdb must be updated in a way to force refetching, even 
though the "If-Modified-Since" or the MD5 points out that the page is 
still unchanged since the last fetching. Currently the only way to do 
this is to "add days" - but if we use a variable refetch interval then 
it doesn't make much sense. I think we need to track in a better way 
which pages are "missing" from the segments, and have to be re-fetched, 
or to have a better DB update mechanism if we lose some segments.

Perhaps we should extend the Page to record which segment holds the 
latest version of the page? But segments don't have unique ID's now (a 
directory name is too fragile and too easily changed) ...

Related question: in the FetchListEntry we have a "fetch" flag. I think 
that after minor modifications of the FetchListTool (to generate only 
entries, which we are supposed to fetch) we could get rid of this flag, 
or change its semantics to mean "unconditionally fetch, even if unmodified".

Any comments?

-- 
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Re: near-term plan

Posted by Doug Cutting <cu...@nutch.org>.

Andrzej Bialecki wrote:
> So, I 
> would propose a deadline of Aug 8 for the last commits, and then perhaps 
> Aug 15 for the release?

Sounds good to me.  Thanks for helping with this!

Doug

Re: near-term plan

Posted by Piotr Kosiorowski <pk...@gmail.com>.

Hello,
I think it is good idea to release ASAP. I wanted to contribute my code
for fault-tolerant searching - it takes more time than I expected 
because as some of you know in meantime I become a father. But I hope I 
will be able to send something for comments early next week. I will look 
at the Jira to check if some more bugs can be fixed before deadline 
proposed by Andrzej.
Regards
Piotr


Andrzej Bialecki wrote:
> Doug Cutting wrote:
> 
>> Here's a near-term plan for Nutch.
>>
>> 1. Release Nutch 0.7, based on current trunk.  We should do this ASAP. 
>> Are there bugs in trunk that we need to fix before this can be done? 
>> The trunk will be copied to a 0.7 release branch.
>>
> 
> I'll be back from vacation in 3-4 days, I hope I can do some work in the 
> meantime; I'd like to close some bugs marked with Major (e.g. the 
> multi-line protocol properties), and perhaps integrate the RSS parser 
> before the release. Other than that I think we should do it ASAP. So, I 
> would propose a deadline of Aug 8 for the last commits, and then perhaps 
> Aug 15 for the release?
> 
>> 2. Merge the mapred branch to trunk.
>>
>> 3. Move the packages org.apache.nutch.{io,ipc,fs,ndfs,mapred} into a 
>> separate project for distributed computing tools.  If the Lucene PMC 
>> approves this, it would be a new Lucene sub-project, a Nutch sibling.
> 
> 
> I concur. They are very useful at times in unrelated projects.
> 
>

Re: near-term plan

Posted by Andrzej Bialecki <ab...@getopt.org>.

Doug Cutting wrote:
> Here's a near-term plan for Nutch.
> 
> 1. Release Nutch 0.7, based on current trunk.  We should do this ASAP. 
> Are there bugs in trunk that we need to fix before this can be done? The 
> trunk will be copied to a 0.7 release branch.
> 

I'll be back from vacation in 3-4 days, I hope I can do some work in the 
meantime; I'd like to close some bugs marked with Major (e.g. the 
multi-line protocol properties), and perhaps integrate the RSS parser 
before the release. Other than that I think we should do it ASAP. So, I 
would propose a deadline of Aug 8 for the last commits, and then perhaps 
Aug 15 for the release?

> 2. Merge the mapred branch to trunk.
> 
> 3. Move the packages org.apache.nutch.{io,ipc,fs,ndfs,mapred} into a 
> separate project for distributed computing tools.  If the Lucene PMC 
> approves this, it would be a new Lucene sub-project, a Nutch sibling.

I concur. They are very useful at times in unrelated projects.

-- 
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Re: Fw: Re: near-term plan

Posted by Piotr Kosiorowski <pk...@gmail.com>.

I think it was already answered by Doug ealier in this thread.

"... Yes.  It is alpha-quality, not yet release-worthy, but it works.  If
you're an experienced Java developer, I'd encourage you to give it a
try.  If you're a user who doesn't want to look beyond the config files,
then I'd wait a bit."

P.

On 8/5/05, Jay Pound <we...@poundwebhosting.com> wrote:
> is the mapreduce working yet?
> I would also like to test it.
> -J
> ----- Original Message -----
> From: "Piotr Kosiorowski" <pk...@gmail.com>
> To: <nu...@lucene.apache.org>
> Sent: Friday, August 05, 2005 8:06 AM
> Subject: Re: Fw: Re: near-term plan
> 
> 
> > I am not sure what you exactly did in this test but I understand you
> > were using jar file prepared by me (it was nutch from trunk + ndfs
> > patches). As these patches were applied by Andrzej some time ago - we
> > can assume you were using NDFS code from trunk.
> > Because a lot of work went into mapreduce branch it woul dbe good to
> > test it with mapreduce branch code.
> > Regards
> > Piotr
> >
> > On 8/5/05, webmaster <we...@www.poundwebhosting.com> wrote:
> > >
> > > ---------- Forwarded Message -----------
> > > From: "webmaster" <sa...@www.poundwebhosting.com>
> > > To: nutch-dev@lucene.apache.org
> > > Sent: Thu, 4 Aug 2005 19:42:53 -0500
> > > Subject: Re: near-term plan
> > >
> > > I was using a nightly build that Pitor had given me the
> nutch-nightly.jar
> > > (actually it was nutch-dev0.7.jar or something of that nature) I tested
> it on
> > > the windows platform, I had 5 machines running it, 2 at 100 mbit both
> quad p3
> > > xeon, 1 pentium 4 3ghz hyperthreading, 1 amd athlon xp 2600+ and 1
> Athlon 64
> > > 3500+. all have 1gb or more of ram. now I have my big server and if you
> have
> > > worked on ndfs since the begining of july I'll test it again, my big
> server's
> > > HD array is very fast 200+mbytes a sec, so it will be able to fully
> saturate
> > > gigabit better. anyway the p4 and the 2 amd machines are hooked into the
> > > switch at gigabit and the 2 xeons are hooked into my other switch at
> 100mbit,
> > > but it has a gigabit uplink to my gigabit switch, so both xeons would
> > > constantly be saturated at 11mbytes a sec. while the p4 was able to
> reach
> > > higher speeds of 50-60mbytes a sec with its internal raid 0 array (dual
> 120gb
> > > drives) my main pc (athlon 64 3500+) was the namenode and a datanode and
> also
> > > the ndfs client, I could not get nutch to work properly with ndfs, it
> was
> > > setup correctly, it "kinda" worked but would crash out the namenode when
> I
> > > was trying to fetch segments in the ndfs filesystem or index them, or do
> much
> > > of anything. so I copied all my segment directories, indexes,
> > > content.wtahever it was 1.8gb and some dvd images onto ndfs. my primary
> > > machine and nutch runs off 10000rpm disks raid 0 (2x36gb raptors) they
> can
> > > output about 120mbytes a sec sustained so here is what I found out ( in
> > > windows) if I dont start a datanode on the namenode with the conf
> pointing to
> > > 127.0.0.1 instead of its outside ip the namenode will not copy data to
> the
> > > other machines, instead if I'm running datanode on the namenode data
> will
> > > replicate from the datanode to the other 3 datanodes, I tried this a
> hundred
> > > ways to try and make it work with an independant namenode without luck.
> but
> > > the way I saw data go across my network was I would put data into ndfs
> the
> > > namenode would request a datanode and find the internal datanode, copy
> data
> > > to it only then after that the datanode would still be coping data from
> my
> > > other hd's into chunks on the raid array, while copying it would
> replicate to
> > > the p4 via gigabit at 50-60mbytes a sec, then it would replicate from
> the p4
> > > to the xeons kinda like alternating them as I only had replication at
> default
> > > 2 and i had about 100gbytes to copy in so the copy would finish onto the
> > > internal raid array fairly quickly then it finished replication to the
> p4 and
> > > the xeons got a little bit of data, but not near as much as the p4, my
> guess
> > > is it only needs 2 copies and the first copy was datanode on the
> internal
> > > machine, the second was the p4 datanode. the xeons only had a smaller
> > > connection so they didnt recieve as many chunks as fast as the p4 could,
> and
> > > the p4 had enough space for all the data so it worked out, I should of
> put
> > > replication to 4. the amd athlon xp 1900+ was running linux suse 9.3 and
> it
> > > would crash the namenode on windows if I connected it as a datanode. so
> that
> > > one didnt get tested, but I was able to put out 50-60 mbytes a sec to 1
> > > machine, but it would not replicate data to multiple machines at the
> same
> > > time it seemed. I would of thought it would of output to the xeons at
> the
> > > same time as the p4, give the xeons 20% of the data and the p4 80% or
> > > something of that nature, but it could be that they just arent fast
> enough to
> > > request data before the p4 was recieving its 32mb chunks every 1/2
> second?
> > > The good news cpu usage was only at 50% on my amd 3500+ that was while
> it was
> > > copying data to the internal datanode from the ndfs client from another
> > > internal HD running the namenode and running the datanode internally.
> does it
> > > now work with a separate namenode? I'm getting ready to run nutch in
> linux
> > > full time, if I can ever get the damn driver for my highpoint 2220 raid
> card
> > > to work with suse, any suse, the drivers dont work with dual core cpu's
> or
> > > something??? they are working on it, now I'm stuck with fedora 4 untill
> they
> > > fix it. so its not ready for testing yet. I'll let you know when I can
> test
> > > it in a full linux environment.
> > > wow that was a long one!!!
> > > -Jay
> > > ------- End of Forwarded Message -------
> > >
> > >
> > > --
> > > Pound Web Hosting www.poundwebhosting.com
> > > (607)-435-3048
> > >
> >
> >
> 
> 
>

Re: Fw: Re: near-term plan

Posted by Jay Pound <we...@poundwebhosting.com>.

is the mapreduce working yet?
I would also like to test it.
-J
----- Original Message ----- 
From: "Piotr Kosiorowski" <pk...@gmail.com>
To: <nu...@lucene.apache.org>
Sent: Friday, August 05, 2005 8:06 AM
Subject: Re: Fw: Re: near-term plan


> I am not sure what you exactly did in this test but I understand you
> were using jar file prepared by me (it was nutch from trunk + ndfs
> patches). As these patches were applied by Andrzej some time ago - we
> can assume you were using NDFS code from trunk.
> Because a lot of work went into mapreduce branch it woul dbe good to
> test it with mapreduce branch code.
> Regards
> Piotr
>
> On 8/5/05, webmaster <we...@www.poundwebhosting.com> wrote:
> >
> > ---------- Forwarded Message -----------
> > From: "webmaster" <sa...@www.poundwebhosting.com>
> > To: nutch-dev@lucene.apache.org
> > Sent: Thu, 4 Aug 2005 19:42:53 -0500
> > Subject: Re: near-term plan
> >
> > I was using a nightly build that Pitor had given me the
nutch-nightly.jar
> > (actually it was nutch-dev0.7.jar or something of that nature) I tested
it on
> > the windows platform, I had 5 machines running it, 2 at 100 mbit both
quad p3
> > xeon, 1 pentium 4 3ghz hyperthreading, 1 amd athlon xp 2600+ and 1
Athlon 64
> > 3500+. all have 1gb or more of ram. now I have my big server and if you
have
> > worked on ndfs since the begining of july I'll test it again, my big
server's
> > HD array is very fast 200+mbytes a sec, so it will be able to fully
saturate
> > gigabit better. anyway the p4 and the 2 amd machines are hooked into the
> > switch at gigabit and the 2 xeons are hooked into my other switch at
100mbit,
> > but it has a gigabit uplink to my gigabit switch, so both xeons would
> > constantly be saturated at 11mbytes a sec. while the p4 was able to
reach
> > higher speeds of 50-60mbytes a sec with its internal raid 0 array (dual
120gb
> > drives) my main pc (athlon 64 3500+) was the namenode and a datanode and
also
> > the ndfs client, I could not get nutch to work properly with ndfs, it
was
> > setup correctly, it "kinda" worked but would crash out the namenode when
I
> > was trying to fetch segments in the ndfs filesystem or index them, or do
much
> > of anything. so I copied all my segment directories, indexes,
> > content.wtahever it was 1.8gb and some dvd images onto ndfs. my primary
> > machine and nutch runs off 10000rpm disks raid 0 (2x36gb raptors) they
can
> > output about 120mbytes a sec sustained so here is what I found out ( in
> > windows) if I dont start a datanode on the namenode with the conf
pointing to
> > 127.0.0.1 instead of its outside ip the namenode will not copy data to
the
> > other machines, instead if I'm running datanode on the namenode data
will
> > replicate from the datanode to the other 3 datanodes, I tried this a
hundred
> > ways to try and make it work with an independant namenode without luck.
but
> > the way I saw data go across my network was I would put data into ndfs
the
> > namenode would request a datanode and find the internal datanode, copy
data
> > to it only then after that the datanode would still be coping data from
my
> > other hd's into chunks on the raid array, while copying it would
replicate to
> > the p4 via gigabit at 50-60mbytes a sec, then it would replicate from
the p4
> > to the xeons kinda like alternating them as I only had replication at
default
> > 2 and i had about 100gbytes to copy in so the copy would finish onto the
> > internal raid array fairly quickly then it finished replication to the
p4 and
> > the xeons got a little bit of data, but not near as much as the p4, my
guess
> > is it only needs 2 copies and the first copy was datanode on the
internal
> > machine, the second was the p4 datanode. the xeons only had a smaller
> > connection so they didnt recieve as many chunks as fast as the p4 could,
and
> > the p4 had enough space for all the data so it worked out, I should of
put
> > replication to 4. the amd athlon xp 1900+ was running linux suse 9.3 and
it
> > would crash the namenode on windows if I connected it as a datanode. so
that
> > one didnt get tested, but I was able to put out 50-60 mbytes a sec to 1
> > machine, but it would not replicate data to multiple machines at the
same
> > time it seemed. I would of thought it would of output to the xeons at
the
> > same time as the p4, give the xeons 20% of the data and the p4 80% or
> > something of that nature, but it could be that they just arent fast
enough to
> > request data before the p4 was recieving its 32mb chunks every 1/2
second?
> > The good news cpu usage was only at 50% on my amd 3500+ that was while
it was
> > copying data to the internal datanode from the ndfs client from another
> > internal HD running the namenode and running the datanode internally.
does it
> > now work with a separate namenode? I'm getting ready to run nutch in
linux
> > full time, if I can ever get the damn driver for my highpoint 2220 raid
card
> > to work with suse, any suse, the drivers dont work with dual core cpu's
or
> > something??? they are working on it, now I'm stuck with fedora 4 untill
they
> > fix it. so its not ready for testing yet. I'll let you know when I can
test
> > it in a full linux environment.
> > wow that was a long one!!!
> > -Jay
> > ------- End of Forwarded Message -------
> >
> >
> > --
> > Pound Web Hosting www.poundwebhosting.com
> > (607)-435-3048
> >
>
>

Re: Fw: Re: near-term plan

Posted by Piotr Kosiorowski <pk...@gmail.com>.

I am not sure what you exactly did in this test but I understand you
were using jar file prepared by me (it was nutch from trunk + ndfs
patches). As these patches were applied by Andrzej some time ago - we
can assume you were using NDFS code from trunk.
Because a lot of work went into mapreduce branch it woul dbe good to
test it with mapreduce branch code.
Regards
Piotr

On 8/5/05, webmaster <we...@www.poundwebhosting.com> wrote:
> 
> ---------- Forwarded Message -----------
> From: "webmaster" <sa...@www.poundwebhosting.com>
> To: nutch-dev@lucene.apache.org
> Sent: Thu, 4 Aug 2005 19:42:53 -0500
> Subject: Re: near-term plan
> 
> I was using a nightly build that Pitor had given me the nutch-nightly.jar
> (actually it was nutch-dev0.7.jar or something of that nature) I tested it on
> the windows platform, I had 5 machines running it, 2 at 100 mbit both quad p3
> xeon, 1 pentium 4 3ghz hyperthreading, 1 amd athlon xp 2600+ and 1 Athlon 64
> 3500+. all have 1gb or more of ram. now I have my big server and if you have
> worked on ndfs since the begining of july I'll test it again, my big server's
> HD array is very fast 200+mbytes a sec, so it will be able to fully saturate
> gigabit better. anyway the p4 and the 2 amd machines are hooked into the
> switch at gigabit and the 2 xeons are hooked into my other switch at 100mbit,
> but it has a gigabit uplink to my gigabit switch, so both xeons would
> constantly be saturated at 11mbytes a sec. while the p4 was able to reach
> higher speeds of 50-60mbytes a sec with its internal raid 0 array (dual 120gb
> drives) my main pc (athlon 64 3500+) was the namenode and a datanode and also
> the ndfs client, I could not get nutch to work properly with ndfs, it was
> setup correctly, it "kinda" worked but would crash out the namenode when I
> was trying to fetch segments in the ndfs filesystem or index them, or do much
> of anything. so I copied all my segment directories, indexes,
> content.wtahever it was 1.8gb and some dvd images onto ndfs. my primary
> machine and nutch runs off 10000rpm disks raid 0 (2x36gb raptors) they can
> output about 120mbytes a sec sustained so here is what I found out ( in
> windows) if I dont start a datanode on the namenode with the conf pointing to
> 127.0.0.1 instead of its outside ip the namenode will not copy data to the
> other machines, instead if I'm running datanode on the namenode data will
> replicate from the datanode to the other 3 datanodes, I tried this a hundred
> ways to try and make it work with an independant namenode without luck. but
> the way I saw data go across my network was I would put data into ndfs the
> namenode would request a datanode and find the internal datanode, copy data
> to it only then after that the datanode would still be coping data from my
> other hd's into chunks on the raid array, while copying it would replicate to
> the p4 via gigabit at 50-60mbytes a sec, then it would replicate from the p4
> to the xeons kinda like alternating them as I only had replication at default
> 2 and i had about 100gbytes to copy in so the copy would finish onto the
> internal raid array fairly quickly then it finished replication to the p4 and
> the xeons got a little bit of data, but not near as much as the p4, my guess
> is it only needs 2 copies and the first copy was datanode on the internal
> machine, the second was the p4 datanode. the xeons only had a smaller
> connection so they didnt recieve as many chunks as fast as the p4 could, and
> the p4 had enough space for all the data so it worked out, I should of put
> replication to 4. the amd athlon xp 1900+ was running linux suse 9.3 and it
> would crash the namenode on windows if I connected it as a datanode. so that
> one didnt get tested, but I was able to put out 50-60 mbytes a sec to 1
> machine, but it would not replicate data to multiple machines at the same
> time it seemed. I would of thought it would of output to the xeons at the
> same time as the p4, give the xeons 20% of the data and the p4 80% or
> something of that nature, but it could be that they just arent fast enough to
> request data before the p4 was recieving its 32mb chunks every 1/2 second?
> The good news cpu usage was only at 50% on my amd 3500+ that was while it was
> copying data to the internal datanode from the ndfs client from another
> internal HD running the namenode and running the datanode internally. does it
> now work with a separate namenode? I'm getting ready to run nutch in linux
> full time, if I can ever get the damn driver for my highpoint 2220 raid card
> to work with suse, any suse, the drivers dont work with dual core cpu's or
> something??? they are working on it, now I'm stuck with fedora 4 untill they
> fix it. so its not ready for testing yet. I'll let you know when I can test
> it in a full linux environment.
> wow that was a long one!!!
> -Jay
> ------- End of Forwarded Message -------
> 
> 
> --
> Pound Web Hosting www.poundwebhosting.com
> (607)-435-3048
>

Fw: Re: near-term plan

Posted by webmaster <we...@www.poundwebhosting.com>.

---------- Forwarded Message -----------
From: "webmaster" <sa...@www.poundwebhosting.com>
To: nutch-dev@lucene.apache.org
Sent: Thu, 4 Aug 2005 19:42:53 -0500
Subject: Re: near-term plan

I was using a nightly build that Pitor had given me the nutch-nightly.jar 
(actually it was nutch-dev0.7.jar or something of that nature) I tested it on 
the windows platform, I had 5 machines running it, 2 at 100 mbit both quad p3 
xeon, 1 pentium 4 3ghz hyperthreading, 1 amd athlon xp 2600+ and 1 Athlon 64 
3500+. all have 1gb or more of ram. now I have my big server and if you have 
worked on ndfs since the begining of july I'll test it again, my big server's 
HD array is very fast 200+mbytes a sec, so it will be able to fully saturate 
gigabit better. anyway the p4 and the 2 amd machines are hooked into the 
switch at gigabit and the 2 xeons are hooked into my other switch at 100mbit, 
but it has a gigabit uplink to my gigabit switch, so both xeons would 
constantly be saturated at 11mbytes a sec. while the p4 was able to reach 
higher speeds of 50-60mbytes a sec with its internal raid 0 array (dual 120gb 
drives) my main pc (athlon 64 3500+) was the namenode and a datanode and also 
the ndfs client, I could not get nutch to work properly with ndfs, it was 
setup correctly, it "kinda" worked but would crash out the namenode when I 
was trying to fetch segments in the ndfs filesystem or index them, or do much 
of anything. so I copied all my segment directories, indexes, 
content.wtahever it was 1.8gb and some dvd images onto ndfs. my primary 
machine and nutch runs off 10000rpm disks raid 0 (2x36gb raptors) they can 
output about 120mbytes a sec sustained so here is what I found out ( in 
windows) if I dont start a datanode on the namenode with the conf pointing to 
127.0.0.1 instead of its outside ip the namenode will not copy data to the 
other machines, instead if I'm running datanode on the namenode data will 
replicate from the datanode to the other 3 datanodes, I tried this a hundred 
ways to try and make it work with an independant namenode without luck. but 
the way I saw data go across my network was I would put data into ndfs the 
namenode would request a datanode and find the internal datanode, copy data 
to it only then after that the datanode would still be coping data from my 
other hd's into chunks on the raid array, while copying it would replicate to 
the p4 via gigabit at 50-60mbytes a sec, then it would replicate from the p4 
to the xeons kinda like alternating them as I only had replication at default 
2 and i had about 100gbytes to copy in so the copy would finish onto the 
internal raid array fairly quickly then it finished replication to the p4 and 
the xeons got a little bit of data, but not near as much as the p4, my guess 
is it only needs 2 copies and the first copy was datanode on the internal 
machine, the second was the p4 datanode. the xeons only had a smaller 
connection so they didnt recieve as many chunks as fast as the p4 could, and 
the p4 had enough space for all the data so it worked out, I should of put 
replication to 4. the amd athlon xp 1900+ was running linux suse 9.3 and it 
would crash the namenode on windows if I connected it as a datanode. so that 
one didnt get tested, but I was able to put out 50-60 mbytes a sec to 1 
machine, but it would not replicate data to multiple machines at the same 
time it seemed. I would of thought it would of output to the xeons at the 
same time as the p4, give the xeons 20% of the data and the p4 80% or 
something of that nature, but it could be that they just arent fast enough to 
request data before the p4 was recieving its 32mb chunks every 1/2 second?
The good news cpu usage was only at 50% on my amd 3500+ that was while it was 
copying data to the internal datanode from the ndfs client from another 
internal HD running the namenode and running the datanode internally. does it 
now work with a separate namenode? I'm getting ready to run nutch in linux 
full time, if I can ever get the damn driver for my highpoint 2220 raid card 
to work with suse, any suse, the drivers dont work with dual core cpu's or 
something??? they are working on it, now I'm stuck with fedora 4 untill they 
fix it. so its not ready for testing yet. I'll let you know when I can test 
it in a full linux environment.
wow that was a long one!!!
-Jay
------- End of Forwarded Message -------


--
Pound Web Hosting www.poundwebhosting.com
(607)-435-3048

Re: near-term plan

Posted by webmaster <sa...@www.poundwebhosting.com>.

I was using a nightly build that Pitor had given me the nutch-nightly.jar 
(actually it was nutch-dev0.7.jar or something of that nature) I tested it on 
the windows platform, I had 5 machines running it, 2 at 100 mbit both quad p3 
xeon, 1 pentium 4 3ghz hyperthreading, 1 amd athlon xp 2600+ and 1 Athlon 64 
3500+. all have 1gb or more of ram. now I have my big server and if you have 
worked on ndfs since the begining of july I'll test it again, my big server's 
HD array is very fast 200+mbytes a sec, so it will be able to fully saturate 
gigabit better. anyway the p4 and the 2 amd machines are hooked into the 
switch at gigabit and the 2 xeons are hooked into my other switch at 100mbit, 
but it has a gigabit uplink to my gigabit switch, so both xeons would 
constantly be saturated at 11mbytes a sec. while the p4 was able to reach 
higher speeds of 50-60mbytes a sec with its internal raid 0 array (dual 120gb 
drives) my main pc (athlon 64 3500+) was the namenode and a datanode and also 
the ndfs client, I could not get nutch to work properly with ndfs, it was 
setup correctly, it "kinda" worked but would crash out the namenode when I 
was trying to fetch segments in the ndfs filesystem or index them, or do much 
of anything. so I copied all my segment directories, indexes, 
content.wtahever it was 1.8gb and some dvd images onto ndfs. my primary 
machine and nutch runs off 10000rpm disks raid 0 (2x36gb raptors) they can 
output about 120mbytes a sec sustained so here is what I found out ( in 
windows) if I dont start a datanode on the namenode with the conf pointing to 
127.0.0.1 instead of its outside ip the namenode will not copy data to the 
other machines, instead if I'm running datanode on the namenode data will 
replicate from the datanode to the other 3 datanodes, I tried this a hundred 
ways to try and make it work with an independant namenode without luck. but 
the way I saw data go across my network was I would put data into ndfs the 
namenode would request a datanode and find the internal datanode, copy data 
to it only then after that the datanode would still be coping data from my 
other hd's into chunks on the raid array, while copying it would replicate to 
the p4 via gigabit at 50-60mbytes a sec, then it would replicate from the p4 
to the xeons kinda like alternating them as I only had replication at default 
2 and i had about 100gbytes to copy in so the copy would finish onto the 
internal raid array fairly quickly then it finished replication to the p4 and 
the xeons got a little bit of data, but not near as much as the p4, my guess 
is it only needs 2 copies and the first copy was datanode on the internal 
machine, the second was the p4 datanode. the xeons only had a smaller 
connection so they didnt recieve as many chunks as fast as the p4 could, and 
the p4 had enough space for all the data so it worked out, I should of put 
replication to 4. the amd athlon xp 1900+ was running linux suse 9.3 and it 
would crash the namenode on windows if I connected it as a datanode. so that 
one didnt get tested, but I was able to put out 50-60 mbytes a sec to 1 
machine, but it would not replicate data to multiple machines at the same 
time it seemed. I would of thought it would of output to the xeons at the 
same time as the p4, give the xeons 20% of the data and the p4 80% or 
something of that nature, but it could be that they just arent fast enough to 
request data before the p4 was recieving its 32mb chunks every 1/2 second?
The good news cpu usage was only at 50% on my amd 3500+ that was while it was 
copying data to the internal datanode from the ndfs client from another 
internal HD running the namenode and running the datanode internally. does it 
now work with a separate namenode? I'm getting ready to run nutch in linux 
full time, if I can ever get the damn driver for my highpoint 2220 raid card 
to work with suse, any suse, the drivers dont work with dual core cpu's or 
something??? they are working on it, now I'm stuck with fedora 4 untill they 
fix it. so its not ready for testing yet. I'll let you know when I can test 
it in a full linux environment.
wow that was a long one!!!
-Jay

Re: near-term plan

Posted by Doug Cutting <cu...@nutch.org>.

Jay Pound wrote:
> Doug I also ran into this when I was testing ndfs the system would have to
> wait for the namenode to tell the datanodes what data to recieve and which
> data to replicate

When did you test this?  Which version of Nutch?  How many nodes?  My 
benchmark results from just a few days ago.  There've been a lot of 
fixes in the past week and NDFS now works much better.

> I'm currently setting up lustre to see how it works, its
> at the kernel level that it operates, do you think if the namenode was not
> java that it would perform better? I plan on running a system where the
> namenode (metadata) server will have to perform thousands of i-o's a
> sec,concurrently updating indexes of multiple segments simultaniously,
> updating the db on one machine, and fetching multiple segments on multiple
> machines, all accessing the same logical filesystem at the same time.

While running the benchmark the namenode was typically using only 2% of 
its 1Ghz CPU.

> PS: where can I find out about the mapreduce, I read the presentations, but
> I dont get the core concept of it?

http://labs.google.com/papers/mapreduce.html

> PSS: via chips aernt very fpu powerfull try an opteron for your namenode, I
> bet you will see a huge improvement in speed, even over xeon's p4's etc... I
> was only able to test 5 machines but I was able to saturate 50-60mb a sec to
> each (mainly replication throughput running level 1)

Via is not my first choice of CPU, it's simply what the Internet Archive 
has given me to use.  With hundreds of datanodes a Via-based namenode 
could become a bottleneck.  Right now it is not.

Doug

Re: near-term plan

Posted by Jay Pound <we...@poundwebhosting.com>.

Doug I also ran into this when I was testing ndfs the system would have to
wait for the namenode to tell the datanodes what data to recieve and which
data to replicate, I'm currently setting up lustre to see how it works, its
at the kernel level that it operates, do you think if the namenode was not
java that it would perform better? I plan on running a system where the
namenode (metadata) server will have to perform thousands of i-o's a
sec,concurrently updating indexes of multiple segments simultaniously,
updating the db on one machine, and fetching multiple segments on multiple
machines, all accessing the same logical filesystem at the same time. the
way that namenode responded it took a few seconds to replicate data to other
datanodes, and it took time to start the copying of data, if writing an
index imagine if you have to wait 1-10 secs per file to be written(if
queued), that will cause serious problems. also I was able to saturate
gigabit with ndfs (well about 50-60MBytes a sec its hard to get better than
that with copper) , it just took a few secs to "ramp up" to speed, thats
including file copying and replication.
-Jay
PS: where can I find out about the mapreduce, I read the presentations, but
I dont get the core concept of it?

PSS: via chips aernt very fpu powerfull try an opteron for your namenode, I
bet you will see a huge improvement in speed, even over xeon's p4's etc... I
was only able to test 5 machines but I was able to saturate 50-60mb a sec to
each (mainly replication throughput running level 1)

----- Original Message ----- 
From: "Doug Cutting" <cu...@nutch.org>
To: <nu...@lucene.apache.org>
Sent: Thursday, August 04, 2005 3:54 PM
Subject: Re: near-term plan


> Stefan Groschupf wrote:
> >> http://wiki.apache.org/nutch/Presentations
> >
> > Can you explan what this means: Page 20:
> > - cheduling is bottleneck, not disk, network or CPU?
>
> I mean that neither the CPUs, disks or network are at 100% of capacity.
>   Disks are running around 50% busy, CPUs a bit higher, and the network
> switch has lots of bandwidth left.  (Although, if we used multiple racks
> connected with gigabit links, these inter-rack links would already be
> near capacity.)  So sometimes the CPU is busy generating random data and
> stuffing it in a buffer, and sometimes the disk is busy writing data,
> but we're not keeping both busy at the same time all the time.  Perhaps
> if more threads/processes and/or bigger buffers would increase the
> utilization--I have not tried to tune things for this benchmark.  But I
> am not dissapointed with this performance.  Rather, I think that it is
> fast enough so that with real applications, with non-trival map and
> reduce functions, NDFS will not be a bottleneck.
>
> Doug
>
>

Re: near-term plan

Posted by Doug Cutting <cu...@nutch.org>.

Stefan Groschupf wrote:
>> http://wiki.apache.org/nutch/Presentations
> 
> Can you explan what this means: Page 20:
> - cheduling is bottleneck, not disk, network or CPU?

I mean that neither the CPUs, disks or network are at 100% of capacity. 
  Disks are running around 50% busy, CPUs a bit higher, and the network 
switch has lots of bandwidth left.  (Although, if we used multiple racks 
connected with gigabit links, these inter-rack links would already be 
near capacity.)  So sometimes the CPU is busy generating random data and 
stuffing it in a buffer, and sometimes the disk is busy writing data, 
but we're not keeping both busy at the same time all the time.  Perhaps 
if more threads/processes and/or bigger buffers would increase the 
utilization--I have not tried to tune things for this benchmark.  But I 
am not dissapointed with this performance.  Rather, I think that it is 
fast enough so that with real applications, with non-trival map and 
reduce functions, NDFS will not be a bottleneck.

Doug

Re: near-term plan

Posted by Stefan Groschupf <sg...@media-style.com>.

Hi Doug,
> The slides from my talk yesterday at OSCON give some hints on how  
> to get started.  We need a MapReduce tutorial.
>
> http://wiki.apache.org/nutch/Presentations

Can you explan what this means: Page 20:
- cheduling is bottleneck, not disk, network or CPU?

Thanks.
Stefan

Re: near-term plan

Posted by Doug Cutting <cu...@nutch.org>.

Fredrik Andersson wrote:
> Is the map-reduce currently functional?

Yes.  It is alpha-quality, not yet release-worthy, but it works.  If 
you're an experienced Java developer, I'd encourage you to give it a 
try.  If you're a user who doesn't want to look beyond the config files, 
then I'd wait a bit.

The slides from my talk yesterday at OSCON give some hints on how to get 
started.  We need a MapReduce tutorial.

http://wiki.apache.org/nutch/Presentations

Doug

Re: near-term plan

Posted by Fredrik Andersson <fi...@gmail.com>.

Is the map-reduce currently functional?

On 8/4/05, Andy Liu <an...@gmail.com> wrote:
> Sounds good.  I've used the io and fs classes for non-Nutch purposes,
> so this separation makes sense.
> 
> On 8/4/05, Doug Cutting <cu...@nutch.org> wrote:
> > Here's a near-term plan for Nutch.
> >
> > 1. Release Nutch 0.7, based on current trunk.  We should do this ASAP.
> > Are there bugs in trunk that we need to fix before this can be done?
> > The trunk will be copied to a 0.7 release branch.
> >
> > 2. Merge the mapred branch to trunk.
> >
> > 3. Move the packages org.apache.nutch.{io,ipc,fs,ndfs,mapred} into a
> > separate project for distributed computing tools.  If the Lucene PMC
> > approves this, it would be a new Lucene sub-project, a Nutch sibling.
> >
> > Does this sound reasonable to folks?
> >
> > Doug
> >
> >
>

Re: near-term plan

Posted by Andy Liu <an...@gmail.com>.

Sounds good.  I've used the io and fs classes for non-Nutch purposes,
so this separation makes sense.

On 8/4/05, Doug Cutting <cu...@nutch.org> wrote:
> Here's a near-term plan for Nutch.
> 
> 1. Release Nutch 0.7, based on current trunk.  We should do this ASAP.
> Are there bugs in trunk that we need to fix before this can be done?
> The trunk will be copied to a 0.7 release branch.
> 
> 2. Merge the mapred branch to trunk.
> 
> 3. Move the packages org.apache.nutch.{io,ipc,fs,ndfs,mapred} into a
> separate project for distributed computing tools.  If the Lucene PMC
> approves this, it would be a new Lucene sub-project, a Nutch sibling.
> 
> Does this sound reasonable to folks?
> 
> Doug
> 
>