You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@nutch.apache.org by Jay Pound <we...@poundwebhosting.com> on 2005/08/06 16:09:46 UTC

mapred

could someone post some of the mapred commands?
Thanks,
-J

Re: NDFS benchmark results

Posted by Jay Pound <we...@poundwebhosting.com>.

> Multiple datanodes are automatically started if you have a
> comma-separated list of data directories in your config file.  One
> datanode thread is launched per data directory.  These are assumed to be
> on separate devices.

-Thanks that will save me the time of managing x datanodes per box

> This is a known problem.  Please file a bug report.  Better would be to
> submit a patch that fixes it, or hire someone to do so.

I'm looking for a programmer, believe it or not most of the programmers I
know cant even grasp the concept of the distributed filesystem. I live in
upstate NY so there is very little here in terms of programmers. do you have
any areas to point me to to find someone, I also need an automation script
to be written, I have it written down on paper what sequence of commands how
to thread them the gui etc..., and probably can write it myself with my
limited programming ability, but I'm only 1 person I to don't have the time
to mess with it. If I do find someone to hire I'll be sure to help in aiding
in the development of nutch this is a fantastic program. for now I'll just
let you guys know if I'm able to find out any new peformance tips!!!

-Jay

Re: NDFS benchmark results

Posted by Doug Cutting <cu...@nutch.org>.

Jay Pound wrote:
> the bad news, when doing
> a get it will use 100% of the cpu to pull down data @ 100mbit on a gigabit
> machine, perhaps there is some code that could be cleaned up in the
> org.apache.nutch.fs.TestClient to make this faster, or if it could open
> multiple threads for recieving data to distribute the load across all cpu's
> in the system.

I am much more interested in optimizing aggregate performance (rate at 
which all nodes can read files) than optimizing single-node performance.

> now I was able to see a performance increase per machine
> while running multiple datanodes on each box, by this i mean more network
> throughput per box, so Doug if you run 4 datanodes per box if your 400gb
> drives arent in a raid setup you will see a higher throughput per box for
> datanode traffic. Doug I know your allready looking at the namenode to see
> how to speed things up, may I request 2 things for NDFS that are going to be
> needed.

Multiple datanodes are automatically started if you have a 
comma-separated list of data directories in your config file.  One 
datanode thread is launched per data directory.  These are assumed to be 
on separate devices.

> 1.) namenode-please thread out the different sections of the code, make the
> replication a single thread, while put and get are seperate threads also,
> this should speed things up when working in a large cluster, maybe also
> lower the time it takes to respond to putting chunks on the machines, it
> seems like it queues the put requests for each datanode, maybe run requests
> in parallel for get and put instead of waiting for a response from the
> datanode being requested? if I'm wrong on any of this sorry, I'm not a
> programmer I dont know how to read the nutch code to see if this is true or
> not. otherwise I would know the answer to these.

No file data ever flows through the namenode.  All replication and file 
access is already in separate threads.

> 2.)datanode- please please please put the data into sub-directories like the
> way squid does, I really do not want a single directory with a million
> file's/chunks in it, reiser will do ok with it, but I'm running multiple
> terabytes per datanode in a single logical drive configuration, I dont want
> to run the filesystem to its limit and crash and lose all my data because
> the machine wont boot (have experience in this area unfortunately).

This is a known problem.  Please file a bug report.  Better would be to 
submit a patch that fixes it, or hire someone to do so.

> 3.)excellent job on making it much more stable its very close to usable now
> as it looks!!

Thanks!

> PS: Doug I would like to talk with you sometime about this if you have an

Please use the mailing lists.  I am fully booked and unavailable for 
hire at present.

Doug

Re: luke??

Posted by Jay Pound <we...@poundwebhosting.com>.

I got it to work now, it wasent selecting the directory I had chosen, so I
typed it in and it works fine
BTW very cool tool
-J
----- Original Message ----- 
From: "Fredrik Andersson" <fi...@gmail.com>
To: <nu...@lucene.apache.org>
Sent: Sunday, August 07, 2005 6:16 PM
Subject: Re: luke??


> That's odd, Luke is working great both on Debian and Windows here.
> Have you validated the index, like no funny errors when running
> 'bin/nutch index'? Does the index directory contain all the necessary
> files (.fdt, .fdx, frq, etc, depends on what field you've chosen to
> index)? Try using the Searcher class to make a "manual" search to see
> if your index is flawed in some way, that's a good way to start.
>
> Fredrik
>
> On 8/7/05, Jay Pound <we...@poundwebhosting.com> wrote:
> > I tell luke to look to my index directory for 1 segment, it then tells
me
> > its not a lucene index, I point directly to the
l:/segments/2005xxx/index/
> > does it work properly in windows? very cool tool anyway check it out for
> > those who havent, I found it on Andrzej's website http://www.getopt.org
its
> > towards the bottom.
> > -J
> >
> >
> >
>
>

Re: luke??

Posted by Fredrik Andersson <fi...@gmail.com>.

That's odd, Luke is working great both on Debian and Windows here.
Have you validated the index, like no funny errors when running
'bin/nutch index'? Does the index directory contain all the necessary
files (.fdt, .fdx, frq, etc, depends on what field you've chosen to
index)? Try using the Searcher class to make a "manual" search to see
if your index is flawed in some way, that's a good way to start.

Fredrik

On 8/7/05, Jay Pound <we...@poundwebhosting.com> wrote:
> I tell luke to look to my index directory for 1 segment, it then tells me
> its not a lucene index, I point directly to the l:/segments/2005xxx/index/
> does it work properly in windows? very cool tool anyway check it out for
> those who havent, I found it on Andrzej's website http://www.getopt.org its
> towards the bottom.
> -J
> 
> 
>

Re: luke??

Posted by Jay Pound <we...@poundwebhosting.com>.

thanks
-J
----- Original Message ----- 
From: "EM" <em...@cpuedge.com>
To: <nu...@lucene.apache.org>
Sent: Sunday, August 07, 2005 4:29 PM
Subject: RE: luke??


> I've just downloaded and tried it, it works for me, try entering the
> directory without the 'index' part.
>
> -----Original Message-----
> From: Jay Pound [mailto:webmaster@poundwebhosting.com]
> Sent: Sunday, August 07, 2005 4:20 PM
> To: nutch-user@lucene.apache.org; nutch-dev@lucene.apache.org
> Subject: luke??
>
> I tell luke to look to my index directory for 1 segment, it then tells me
> its not a lucene index, I point directly to the l:/segments/2005xxx/index/
> does it work properly in windows? very cool tool anyway check it out for
> those who havent, I found it on Andrzej's website http://www.getopt.org
its
> towards the bottom.
> -J
>
>
>
>
>

RE: luke??

Posted by EM <em...@cpuedge.com>.

I've just downloaded and tried it, it works for me, try entering the
directory without the 'index' part.

-----Original Message-----
From: Jay Pound [mailto:webmaster@poundwebhosting.com] 
Sent: Sunday, August 07, 2005 4:20 PM
To: nutch-user@lucene.apache.org; nutch-dev@lucene.apache.org
Subject: luke??

I tell luke to look to my index directory for 1 segment, it then tells me
its not a lucene index, I point directly to the l:/segments/2005xxx/index/
does it work properly in windows? very cool tool anyway check it out for
those who havent, I found it on Andrzej's website http://www.getopt.org its
towards the bottom.
-J

luke??

Posted by Jay Pound <we...@poundwebhosting.com>.

I tell luke to look to my index directory for 1 segment, it then tells me
its not a lucene index, I point directly to the l:/segments/2005xxx/index/
does it work properly in windows? very cool tool anyway check it out for
those who havent, I found it on Andrzej's website http://www.getopt.org its
towards the bottom.
-J

Re: ndfs problem needs fix

Posted by Jay Pound <we...@poundwebhosting.com>.

#2 from your response:
I'm not yet sure how disk
> failures appear to a JVM.  Things are currently written so that if an
> exception is thrown during disk i/o then the datanode should take itself
> offline, initiating replication of its data.  We'll see if that's
> sufficient.
the data is replicated, the problem is the client stalls still trying to
write the data to that node, waiting for a response indefinately, the same
thing happens when the device runs out of hd space, the client hangs trying
to write the last block to the disk that it cannot write because there is
only >32mb available. I think that there should be a limit of like 100mb
available left on the disk so the machine dosent become unusable with so
little free space.
Thanks,
-J
----- Original Message ----- 
From: "Doug Cutting" <cu...@nutch.org>
To: <nu...@lucene.apache.org>
Sent: Monday, August 08, 2005 4:13 PM
Subject: Re: ndfs problem needs fix


> Jay Pound wrote:
> > 1.) we need to split up chunks of data into sub-folders as not to run
the
> > filesystem out of  its physical limitations of concurrent files in a
single
> > directory, like the way squid splits up its data into directories.
>
> I agree.  I am currently using reiser with NDFS so this is not a
> priority, but long-term it should be fixed.  Please file a bug report,
> and, ideally, contribute a patch.
>
> > 2.)when a datanode is set to store data on a nfs share / samba share
[...]
>
> That is not a recommended configuration.
>
> A datanode should reasonably handle disk failures.  Developing and
> debugging this may take time, however.  I'm not yet sure how disk
> failures appear to a JVM.  Things are currently written so that if an
> exception is thrown during disk i/o then the datanode should take itself
> offline, initiating replication of its data.  We'll see if that's
> sufficient.
>
> > 3.)we need to set a limit on how much of the filesystem can be used by
ndfs,
> > or a max # of 32mb chunks to be stored, when a single machine runs out
of
> > space the same thing happens as in #2 ndfs hangs waiting to write data
to
> > that particular datanode not transmitting data to the other datanodes
>
> The max storage per datanode was configurable, but we found that to be
> difficult, as it required separate configuration per datanode if
> datanodes have different devices.  So now all space on the device is
> assumed to be available to NDFS.  Probably making this optionally
> configurable would be better.  Please file a bug report, and, ideally,
> contribute a patch.
>
> Doug
>
>

Re: ndfs problem needs fix

Posted by Doug Cutting <cu...@nutch.org>.

Jay Pound wrote:
> 1.) we need to split up chunks of data into sub-folders as not to run the
> filesystem out of  its physical limitations of concurrent files in a single
> directory, like the way squid splits up its data into directories.

I agree.  I am currently using reiser with NDFS so this is not a 
priority, but long-term it should be fixed.  Please file a bug report, 
and, ideally, contribute a patch.

> 2.)when a datanode is set to store data on a nfs share / samba share [...]

That is not a recommended configuration.

A datanode should reasonably handle disk failures.  Developing and 
debugging this may take time, however.  I'm not yet sure how disk 
failures appear to a JVM.  Things are currently written so that if an 
exception is thrown during disk i/o then the datanode should take itself 
offline, initiating replication of its data.  We'll see if that's 
sufficient.

> 3.)we need to set a limit on how much of the filesystem can be used by ndfs,
> or a max # of 32mb chunks to be stored, when a single machine runs out of
> space the same thing happens as in #2 ndfs hangs waiting to write data to
> that particular datanode not transmitting data to the other datanodes

The max storage per datanode was configurable, but we found that to be 
difficult, as it required separate configuration per datanode if 
datanodes have different devices.  So now all space on the device is 
assumed to be available to NDFS.  Probably making this optionally 
configurable would be better.  Please file a bug report, and, ideally, 
contribute a patch.

Doug

luke??

Posted by Jay Pound <we...@poundwebhosting.com>.

I tell luke to look to my index directory for 1 segment, it then tells me
its not a lucene index, I point directly to the l:/segments/2005xxx/index/
does it work properly in windows? very cool tool anyway check it out for
those who havent, I found it on Andrzej's website http://www.getopt.org its
towards the bottom.
-J

Re: ndfs problem needs fix

Posted by Jay Pound <we...@poundwebhosting.com>.

ok here short and sweet I've found some problems that need fixing with ndfs:

1.) we need to split up chunks of data into sub-folders as not to run the
filesystem out of  its physical limitations of concurrent files in a single
directory, like the way squid splits up its data into directories.

2.)when a datanode is set to store data on a nfs share / samba share (via
conf) and the connection is severed the whole ndfs filesystem hangs untill
data can be written to that one drive, when the drive map is re-connected it
goes really fast for a few secs to catch up (50mb a sec for about 15secs)
this will also be a problem when a HD fails in a system, the datanode will
still function but the drive will not be able to send-recieve data because
its dead and ndfs will hang.

3.)we need to set a limit on how much of the filesystem can be used by ndfs,
or a max # of 32mb chunks to be stored, when a single machine runs out of
space the same thing happens as in #2 ndfs hangs waiting to write data to
that particular datanode not transmitting data to the other datanodes

Also I've found its much more stable now, I havent had any crashes when the
conditions are ideal for the way ndfs is now!

sorry about the big e-mails, my brain goes much faster than my fingers!!!
-J

----- Original Message ----- 
From: "Andrzej Bialecki" <ab...@getopt.org>
To: <we...@poundwebhosting.com>
Sent: Sunday, August 07, 2005 3:00 PM
Subject: Re: ndfs problem needs fix

> Jay Pound wrote:
>
> [.....................................]
>
> Jay,
>
> This is nothing personal, but I tend to skip your messages, because they
> are so badly formatted that it just hurts my eyes, and I don't have the
> time to parse paragraphs, which occupy half a page... Please try to be
> more concise and divide your messages into shorter paragraphs.
>
> -- 
> Best regards,
> Andrzej Bialecki     <><
>   ___. ___ ___ ___ _ _   __________________________________
> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> http://www.sigram.com  Contact: info at sigram dot com
>
>

Re: ndfs problem needs fix

Posted by Jay Pound <we...@poundwebhosting.com>.

ok here short and sweet I've found some problems that need fixing with ndfs:

1.) we need to split up chunks of data into sub-folders as not to run the
filesystem out of  its physical limitations of concurrent files in a single
directory, like the way squid splits up its data into directories.

2.)when a datanode is set to store data on a nfs share / samba share (via
conf) and the connection is severed the whole ndfs filesystem hangs untill
data can be written to that one drive, when the drive map is re-connected it
goes really fast for a few secs to catch up (50mb a sec for about 15secs)
this will also be a problem when a HD fails in a system, the datanode will
still function but the drive will not be able to send-recieve data because
its dead and ndfs will hang.

3.)we need to set a limit on how much of the filesystem can be used by ndfs,
or a max # of 32mb chunks to be stored, when a single machine runs out of
space the same thing happens as in #2 ndfs hangs waiting to write data to
that particular datanode not transmitting data to the other datanodes

Also I've found its much more stable now, I havent had any crashes when the
conditions are ideal for the way ndfs is now!

sorry about the big e-mails, my brain goes much faster than my fingers!!!
-J

----- Original Message ----- 
From: "Andrzej Bialecki" <ab...@getopt.org>
To: <we...@poundwebhosting.com>
Sent: Sunday, August 07, 2005 3:00 PM
Subject: Re: ndfs problem needs fix

> Jay Pound wrote:
>
> [.....................................]
>
> Jay,
>
> This is nothing personal, but I tend to skip your messages, because they
> are so badly formatted that it just hurts my eyes, and I don't have the
> time to parse paragraphs, which occupy half a page... Please try to be
> more concise and divide your messages into shorter paragraphs.
>
> -- 
> Best regards,
> Andrzej Bialecki     <><
>   ___. ___ ___ ___ _ _   __________________________________
> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> http://www.sigram.com  Contact: info at sigram dot com
>
>

ndfs problem needs fix

Posted by Jay Pound <we...@poundwebhosting.com>.

I'm copying data into the ndfs right now, I've had the server crash (bad mem
timings oops) it was running 2 datanodes and the namenode. It recovered from
a flat out crash perfectly (blue-screen kernel error system beeping windows
2003 64 sucks), I started the datanodes first then the namenode and it
replicated data and continued writing of the file. perfect. but here is
something people are going to run into big-time on one of my machines, I'm
running nutch on it from a network share, the network share went down and
here is what happens.
the network stops sending data. it waits saying could not complete file,
retrying, now it has all its data it had to send and recieve while it didnt
trying to catch up, the bad news is had I not re-connected to the server's
share drive the system would of hung waiting untill I did without copying
the rest of the 300gbytes of data I'm copying into it @ 5-22mbytes a sec
I'll let you know in the morning if it completes, if it is able to recover
fully from this data is transfering but it still says could not complete
file, retrying every 1/2-1/3 second.
good news data rates are up to 10mbytes on some machines minimum 4mbytes

Just in case someone has a single machine connected to a nas or iscsi or nfs
or even crappy samba and experiences a network outage.
-Jay

ndfs problem needs fix

Posted by Jay Pound <we...@poundwebhosting.com>.

I'm copying data into the ndfs right now, I've had the server crash (bad mem
timings oops) it was running 2 datanodes and the namenode. It recovered from
a flat out crash perfectly (blue-screen kernel error system beeping windows
2003 64 sucks), I started the datanodes first then the namenode and it
replicated data and continued writing of the file. perfect. but here is
something people are going to run into big-time on one of my machines, I'm
running nutch on it from a network share, the network share went down and
here is what happens.
the network stops sending data. it waits saying could not complete file,
retrying, now it has all its data it had to send and recieve while it didnt
trying to catch up, the bad news is had I not re-connected to the server's
share drive the system would of hung waiting untill I did without copying
the rest of the 300gbytes of data I'm copying into it @ 5-22mbytes a sec
I'll let you know in the morning if it completes, if it is able to recover
fully from this data is transfering but it still says could not complete
file, retrying every 1/2-1/3 second.
good news data rates are up to 10mbytes on some machines minimum 4mbytes

Just in case someone has a single machine connected to a nas or iscsi or nfs
or even crappy samba and experiences a network outage.
-Jay

NDFS benchmark results

Posted by Jay Pound <we...@poundwebhosting.com>.

ok here it is:

I was seeing the same thing doug was seeing when copying data in and out of
ndfs, 5mb a sec i/o, reminds me of the good old days when there were only
100mbit half-duplex connx. I'm running 3 machines @1000mbit and 2 @100mbit.
now here is where I'm able to see throughput in the 500-600mbit range, while
copying data to the dfs and I shutdown a node it will replicate data at the
same time as transfering data into the dfs it will peak around 53mbytes a
sec, I'm only working with a 1.8gb file this time. the bad news, when doing
a get it will use 100% of the cpu to pull down data @ 100mbit on a gigabit
machine, perhaps there is some code that could be cleaned up in the
org.apache.nutch.fs.TestClient to make this faster, or if it could open
multiple threads for recieving data to distribute the load across all cpu's
in the system. now I was able to see a performance increase per machine
while running multiple datanodes on each box, by this i mean more network
throughput per box, so Doug if you run 4 datanodes per box if your 400gb
drives arent in a raid setup you will see a higher throughput per box for
datanode traffic. Doug I know your allready looking at the namenode to see
how to speed things up, may I request 2 things for NDFS that are going to be
needed.
1.) namenode-please thread out the different sections of the code, make the
replication a single thread, while put and get are seperate threads also,
this should speed things up when working in a large cluster, maybe also
lower the time it takes to respond to putting chunks on the machines, it
seems like it queues the put requests for each datanode, maybe run requests
in parallel for get and put instead of waiting for a response from the
datanode being requested? if I'm wrong on any of this sorry, I'm not a
programmer I dont know how to read the nutch code to see if this is true or
not. otherwise I would know the answer to these.
2.)datanode- please please please put the data into sub-directories like the
way squid does, I really do not want a single directory with a million
file's/chunks in it, reiser will do ok with it, but I'm running multiple
terabytes per datanode in a single logical drive configuration, I dont want
to run the filesystem to its limit and crash and lose all my data because
the machine wont boot (have experience in this area unfortunately).
3.)excellent job on making it much more stable its very close to usable now
as it looks!!
-Jay Pound
PS: Doug I would like to talk with you sometime about this if you have an
opportunity.
PSS: here is a snipit of the -report just if your interested:




Administrator@desk /nutch-ndfs
$ ./bin/nutch org.apache.nutch.fs.TestClient -report
050806 182802 parsing file:/C:/cygwin/nutch-ndfs/conf/nutch-default.xml
050806 182802 parsing file:/C:/cygwin/nutch-ndfs/conf/nutch-site.xml
050806 182803 No FS indicated, using default:OPTERON:9000
050806 182803 Client connection to 10.0.0.101:9000: starting
Total raw bytes: 2449488044032 (2281.26 Gb)
Used raw bytes: 1194891779358 (1112.82 Gb)
% used: 48.78%

Total effective bytes: 4014804878 (3.73 Gb)
Effective replication multiplier: 297.6213827739601
-------------------------------------------------
Datanodes available: 10

Name: CPQ19312594631:7000
Total raw bytes: 39999500288 (37.25 Gb)
Used raw bytes: 14021105091 (13.05 Gb)
% used: 35.05%
Last contact with namenode: Sat Aug 06 18:29:01 EDT 2005

Name: desk:7000
Total raw bytes: 74027487232 (68.94 Gb)
Used raw bytes: 58792909619 (54.75 Gb)
% used: 79.42%
Last contact with namenode: Sat Aug 06 18:29:02 EDT 2005

Name: desk:7001
Total raw bytes: 320070287360 (298.08 Gb)
Used raw bytes: 287845425725 (268.07 Gb)
% used: 89.93%
Last contact with namenode: Sat Aug 06 18:29:01 EDT 2005

Name: desk:7002
Total raw bytes: 250048479232 (232.87 Gb)
Used raw bytes: 248354007613 (231.29 Gb)
% used: 99.32%
Last contact with namenode: Sat Aug 06 18:29:02 EDT 2005

Name: desk:7003
Total raw bytes: 200047001600 (186.30 Gb)
Used raw bytes: 196543472435 (183.04 Gb)
% used: 98.24%
Last contact with namenode: Sat Aug 06 18:29:00 EDT 2005

Name: desk:7004
Total raw bytes: 200047001600 (186.30 Gb)
Used raw bytes: 190989330432 (177.87 Gb)
% used: 95.47%
Last contact with namenode: Sat Aug 06 18:28:59 EDT 2005

Name: desk:7005
Total raw bytes: 200038776832 (186.30 Gb)
Used raw bytes: 81084996239 (75.51 Gb)
% used: 40.53%
Last contact with namenode: Sat Aug 06 18:28:59 EDT 2005

Name: michael-05699cn:7000
Total raw bytes: 160031014912 (149.04 Gb)
Used raw bytes: 46235792507 (43.06 Gb)
% used: 28.89%
Last contact with namenode: Sat Aug 06 18:29:02 EDT 2005

Name: opteron:7000
Total raw bytes: 959914815488 (893.99 Gb)
Used raw bytes: 29605043569 (27.57 Gb)
% used: 3.08%
Last contact with namenode: Sat Aug 06 18:29:02 EDT 2005

Name: quadzilla:7000
Total raw bytes: 45263679488 (42.15 Gb)
Used raw bytes: 41419696128 (38.57 Gb)
% used: 91.50%
Last contact with namenode: Sat Aug 06 18:29:02 EDT 2005

I love cluster filesystems!!! how cool is that

NDFS benchmark results

Posted by Jay Pound <we...@poundwebhosting.com>.

ok here it is:

I was seeing the same thing doug was seeing when copying data in and out of
ndfs, 5mb a sec i/o, reminds me of the good old days when there were only
100mbit half-duplex connx. I'm running 3 machines @1000mbit and 2 @100mbit.
now here is where I'm able to see throughput in the 500-600mbit range, while
copying data to the dfs and I shutdown a node it will replicate data at the
same time as transfering data into the dfs it will peak around 53mbytes a
sec, I'm only working with a 1.8gb file this time. the bad news, when doing
a get it will use 100% of the cpu to pull down data @ 100mbit on a gigabit
machine, perhaps there is some code that could be cleaned up in the
org.apache.nutch.fs.TestClient to make this faster, or if it could open
multiple threads for recieving data to distribute the load across all cpu's
in the system. now I was able to see a performance increase per machine
while running multiple datanodes on each box, by this i mean more network
throughput per box, so Doug if you run 4 datanodes per box if your 400gb
drives arent in a raid setup you will see a higher throughput per box for
datanode traffic. Doug I know your allready looking at the namenode to see
how to speed things up, may I request 2 things for NDFS that are going to be
needed.
1.) namenode-please thread out the different sections of the code, make the
replication a single thread, while put and get are seperate threads also,
this should speed things up when working in a large cluster, maybe also
lower the time it takes to respond to putting chunks on the machines, it
seems like it queues the put requests for each datanode, maybe run requests
in parallel for get and put instead of waiting for a response from the
datanode being requested? if I'm wrong on any of this sorry, I'm not a
programmer I dont know how to read the nutch code to see if this is true or
not. otherwise I would know the answer to these.
2.)datanode- please please please put the data into sub-directories like the
way squid does, I really do not want a single directory with a million
file's/chunks in it, reiser will do ok with it, but I'm running multiple
terabytes per datanode in a single logical drive configuration, I dont want
to run the filesystem to its limit and crash and lose all my data because
the machine wont boot (have experience in this area unfortunately).
3.)excellent job on making it much more stable its very close to usable now
as it looks!!
-Jay Pound
PS: Doug I would like to talk with you sometime about this if you have an
opportunity.
PSS: here is a snipit of the -report just if your interested:




Administrator@desk /nutch-ndfs
$ ./bin/nutch org.apache.nutch.fs.TestClient -report
050806 182802 parsing file:/C:/cygwin/nutch-ndfs/conf/nutch-default.xml
050806 182802 parsing file:/C:/cygwin/nutch-ndfs/conf/nutch-site.xml
050806 182803 No FS indicated, using default:OPTERON:9000
050806 182803 Client connection to 10.0.0.101:9000: starting
Total raw bytes: 2449488044032 (2281.26 Gb)
Used raw bytes: 1194891779358 (1112.82 Gb)
% used: 48.78%

Total effective bytes: 4014804878 (3.73 Gb)
Effective replication multiplier: 297.6213827739601
-------------------------------------------------
Datanodes available: 10

Name: CPQ19312594631:7000
Total raw bytes: 39999500288 (37.25 Gb)
Used raw bytes: 14021105091 (13.05 Gb)
% used: 35.05%
Last contact with namenode: Sat Aug 06 18:29:01 EDT 2005

Name: desk:7000
Total raw bytes: 74027487232 (68.94 Gb)
Used raw bytes: 58792909619 (54.75 Gb)
% used: 79.42%
Last contact with namenode: Sat Aug 06 18:29:02 EDT 2005

Name: desk:7001
Total raw bytes: 320070287360 (298.08 Gb)
Used raw bytes: 287845425725 (268.07 Gb)
% used: 89.93%
Last contact with namenode: Sat Aug 06 18:29:01 EDT 2005

Name: desk:7002
Total raw bytes: 250048479232 (232.87 Gb)
Used raw bytes: 248354007613 (231.29 Gb)
% used: 99.32%
Last contact with namenode: Sat Aug 06 18:29:02 EDT 2005

Name: desk:7003
Total raw bytes: 200047001600 (186.30 Gb)
Used raw bytes: 196543472435 (183.04 Gb)
% used: 98.24%
Last contact with namenode: Sat Aug 06 18:29:00 EDT 2005

Name: desk:7004
Total raw bytes: 200047001600 (186.30 Gb)
Used raw bytes: 190989330432 (177.87 Gb)
% used: 95.47%
Last contact with namenode: Sat Aug 06 18:28:59 EDT 2005

Name: desk:7005
Total raw bytes: 200038776832 (186.30 Gb)
Used raw bytes: 81084996239 (75.51 Gb)
% used: 40.53%
Last contact with namenode: Sat Aug 06 18:28:59 EDT 2005

Name: michael-05699cn:7000
Total raw bytes: 160031014912 (149.04 Gb)
Used raw bytes: 46235792507 (43.06 Gb)
% used: 28.89%
Last contact with namenode: Sat Aug 06 18:29:02 EDT 2005

Name: opteron:7000
Total raw bytes: 959914815488 (893.99 Gb)
Used raw bytes: 29605043569 (27.57 Gb)
% used: 3.08%
Last contact with namenode: Sat Aug 06 18:29:02 EDT 2005

Name: quadzilla:7000
Total raw bytes: 45263679488 (42.15 Gb)
Used raw bytes: 41419696128 (38.57 Gb)
% used: 91.50%
Last contact with namenode: Sat Aug 06 18:29:02 EDT 2005

I love cluster filesystems!!! how cool is that

mapred question

Posted by Jay Pound <we...@poundwebhosting.com>.

how would I setup mapred for smp machines, I understand it will split up big
jobs like indexing or updating the db into a bunch of chunks to be processed
by separate machines, I have machines that are multiple processor machines
that I want to test this with internally, makes sense to utilize the full
potential of smp machines, this is a great idea and I'm very glad its being
implemented. currently on these machines I index 4 segements at the same
time, but the update db can only be done one segment at a time, so it would
be great to speed that process up. will mapreduce work with the updatedb and
also the generatedb? not that generate db is bad to wait for but it will be
when it contains billions of links!!!
-J
PS: I'm planning on running another benchmark for ndfs, I'll make a site and
post data/screenshots with the results

mapred question

Posted by Jay Pound <we...@poundwebhosting.com>.

how would I setup mapred for smp machines, I understand it will split up big
jobs like indexing or updating the db into a bunch of chunks to be processed
by separate machines, I have machines that are multiple processor machines
that I want to test this with internally, makes sense to utilize the full
potential of smp machines, this is a great idea and I'm very glad its being
implemented. currently on these machines I index 4 segements at the same
time, but the update db can only be done one segment at a time, so it would
be great to speed that process up. will mapreduce work with the updatedb and
also the generatedb? not that generate db is bad to wait for but it will be
when it contains billions of links!!!
-J
PS: I'm planning on running another benchmark for ndfs, I'll make a site and
post data/screenshots with the results