You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Paul van Brouwershaven <pa...@vanbrouwershaven.com> on 2005/09/29 20:22:37 UTC

MapReduce

Hello,

What do I wrong?

I have setup two systems to setup a MapReduce Nutch Crawler but the jobs 
run always on the same "localhost" when I take a view at http://server:7845

seeds/urls is an url list with 500.000 urls

# put seed directory in ndfs
bin/nutch ndfs -put seeds seeds

# crawl a bit
bin/nutch crawl seeds -depth 10

I had mapred.map.tasks on 2 and on 4 and 8
I had mapred.reduce.tasks on 1 2 and 3

But always I have the same rsult

Name	Host	# running tasks	Secs since heartbeat
tracker_29414	srv34	0	1
tracker_36968	srv21	1	2

srv21 is the master

Thanks,

Paul

Re: MapReduce

Posted by Paul van Brouwershaven <pa...@vanbrouwershaven.com>.

Doug Cutting wrote:
>> The AcceptEnv option is only avalible with ssh 3.9 > Debian currently 
>> only has 3.8.1p1 in stable and testing. (4.2 unstable)
>>
>> Is there an other way to solve the env. problem?
> 
> 
> I don't know.  The Fedora and Debian systems that I use have AcceptEnv.

I think you are using the unstable branch of Debian (Version 4.2). (Fedora 
has version 3.9)

I have compiled ssh 4.2 by hand and thats solved the AcceptEnv problem.

But when I add one server the my slave list I have a java error when 
uploading the files to de NDFS.

Re: a simple map reduce tutorial

Posted by Earl Cahill <ca...@yahoo.com>.

> I think end to end testing must focus on end to end
> problems (ie checking
> pdf parsing is already checked by unit tests, and it
> is really the right place for doing it).

Hate to say it, but today was the first time I got ant
test to work (hadn't tried too hard), and yeah, I saw
several such tests.

> What about creating a trunk/qa "module" ?

Well, I am most concerned with the mapred branch, and
was specifically wondering about where to put content,
like html content.

Earl


	
		
______________________________________________________ 
Yahoo! for Good 
Donate to the Hurricane Katrina relief effort. 
http://store.yahoo.com/redcross-donate3/

Re: a simple map reduce tutorial

Posted by Jérôme Charron <je...@gmail.com>.

> > I think it would be better to have the junit tests
> > start jetty then
> > crawl localhost. I'd love to see some end-to-end
> > unit tests like that.

+1

> I think this would also make it nice to test things
> like recursive linking, parsing pdfs or other file
> formats, observing robots.txt or any crawling bugs
> that are encountered and then fixed.

I think end to end testing must focus on end to end problems (ie checking
pdf parsing is already checked by unit tests, and it is really the right
place for
doing it).
It should be better to performs some end to end tests (functional tests) for
checking (not exhaustive):
* that depending on many configurations, the good documents are fecthed and
correctly parsed (as you suggested it).
* checking some limit cases : Protocol errors, Corrupted content,
* Performs some fetching/crawling/indexing performance tests with many confs
* Performs some searching performance tests with many querying
charges/database size, ...

That just some ideas....
But it could be very cool if you can work on this subject.

Suggestions for where to put such test content in the
> tree?

What about creating a trunk/qa "module" ?

Regards

Jérôme

--
http://motrech.free.fr/
http://www.frutch.org/

Re: a simple map reduce tutorial

Posted by Doug Cutting <cu...@nutch.org>.

Earl Cahill wrote:
> Suggestions for where to put such test content in the
> tree?

For now, just put them with the test code.

Doug

Re: a simple map reduce tutorial

Posted by Earl Cahill <ca...@yahoo.com>.

> I'm not sure what you mean.  I set environment
> variables in my .bashrc, 
> then simply use 'bin/start-all.sh' and 'bin/nutch
> crawl'.

Well, not sure if you looked at my tutorial, which is
now on the wiki

http://wiki.apache.org/nutch/SimpleMapReduceTutorial

but yeah, that is much simpler than what I am doing. 
Looks like a little example has been added to the FAQ,
which wasn't there last time I looked.

> NutchBean now looks for things in the subdirectory
> of the connected 
> directory named 'crawl'.  Is that an improvement or
> is it just confusing?

I think magic is ok so long as it is documented and it
works.

> I think it would be better to have the junit tests
> start jetty then 
> crawl localhost.  I'd love to see some end-to-end
> unit tests like that.

Think I will start to work on this.  Maybe start with
on page that contains just a few phrases, or maybe
just the word nutch, then make sure it can be queried
out in the end?  Could also check status through the
process to make sure everything looks good.  If
nothing else, I would likely understand the process
pretty well by the time I got done with my writing.

I think this would also make it nice to test things
like recursive linking, parsing pdfs or other file
formats, observing robots.txt or any crawling bugs
that are encountered and then fixed.

Suggestions for where to put such test content in the
tree?

> You should be able to add them to the wiki yourself.

Thanks, I added them.

Earl


	
		
______________________________________________________ 
Yahoo! for Good 
Donate to the Hurricane Katrina relief effort. 
http://store.yahoo.com/redcross-donate3/

Re: a simple map reduce tutorial

Posted by Doug Cutting <cu...@nutch.org>.

Earl Cahill wrote:
> 1.  Sounds like some of you have some glue programs
> that help run the whole process.  Are these going to
> end up in subversion sometime?  I am guessing there is
> much duplicated effort.

I'm not sure what you mean.  I set environment variables in my .bashrc, 
then simply use 'bin/start-all.sh' and 'bin/nutch crawl'.

> 2.  Not sure how to test that my index actually
> worked.  Starting catalina in my index directory
> didn't work this time.

NutchBean now looks for things in the subdirectory of the connected 
directory named 'crawl'.  Is that an improvement or is it just confusing?

> 3.  What do you all think of setting up some test
> directories to crawl, in say 
> 
> http://lucene.apache.org/nutch/test/
> 
> Thinking it would be kind of cool to have junit run
> through a whole process on external pages.

I think it would be better to have the junit tests start jetty then 
crawl localhost.  I'd love to see some end-to-end unit tests like that.

> 4.  Any way that
> 
> http://spack.net/nutch/SimpleMapReduceTutorial.html
> http://spack.net/nutch/GettingNutchRunningOnUbuntu.html
> 
> can get on the wiki?  I am using apache-ish style and
> would change to whatever, but as fun as these are to
> write, I would like to see them used.  

You should be able to add them to the wiki yourself.  Just fill out:

http://wiki.apache.org/nutch/UserPreferences

Thanks,

Doug

a simple map reduce tutorial

Posted by Earl Cahill <ca...@yahoo.com>.

Looks like I made it through a cycle on the map reduce
branch.  I put my steps here

http://spack.net/nutch/SimpleMapReduceTutorial.html

A few things questions.

1.  Sounds like some of you have some glue programs
that help run the whole process.  Are these going to
end up in subversion sometime?  I am guessing there is
much duplicated effort.

2.  Not sure how to test that my index actually
worked.  Starting catalina in my index directory
didn't work this time.

3.  What do you all think of setting up some test
directories to crawl, in say 

http://lucene.apache.org/nutch/test/

Thinking it would be kind of cool to have junit run
through a whole process on external pages.

4.  Any way that

http://spack.net/nutch/SimpleMapReduceTutorial.html
http://spack.net/nutch/GettingNutchRunningOnUbuntu.html

can get on the wiki?  I am using apache-ish style and
would change to whatever, but as fun as these are to
write, I would like to see them used.  

Feedback would be appreciated.

Enjoy!

Earl


		
__________________________________ 
Yahoo! Mail - PC Magazine Editors' Choice 2005 
http://mail.yahoo.com

Re: MapReduce

Posted by Paul van Brouwershaven <pa...@vanbrouwershaven.com>.

> I have the same problem and the only thing I found was rsh. :-/
> For my servers I will do following:
> Have a simple damon that executes all scripts in a folder and delete  
> the file after executing.
> And I will have a script that just copy the command as text file to  all 
> my slaves where the damon will execute it.
> This is very poor but will work with older version of ssh as well.

I do the same here, but I don't like this setup. :)

Re: MapReduce

Posted by Stefan Groschupf <sg...@media-style.com>.

>> The AcceptEnv option is only avalible with ssh 3.9 > Debian  
>> currently only has 3.8.1p1 in stable and testing. (4.2 unstable)
>> Is there an other way to solve the env. problem?
>>
>
> I don't know.  The Fedora and Debian systems that I use have  
> AcceptEnv.

I have the same problem and the only thing I found was rsh. :-/
For my servers I will do following:
Have a simple damon that executes all scripts in a folder and delete  
the file after executing.
And I will have a script that just copy the command as text file to  
all my slaves where the damon will execute it.
This is very poor but will work with older version of ssh as well.

Stefan

Re: MapReduce

Posted by Doug Cutting <cu...@nutch.org>.

Paul van Brouwershaven wrote:
> The AcceptEnv option is only avalible with ssh 3.9 > Debian currently 
> only has 3.8.1p1 in stable and testing. (4.2 unstable)
> 
> Is there an other way to solve the env. problem?

I don't know.  The Fedora and Debian systems that I use have AcceptEnv.

Doug

Re: MapReduce

Posted by Paul van Brouwershaven <pa...@vanbrouwershaven.com>.

Paul van Brouwershaven wrote:
>> Note that nutch-daemons.sh assumes that /etc/ssh/sshd_config has 
>> 'AcceptEnv yes'.  Its also useful, if your hosts don't share the nutch 
>> installation via NFS, to set NUTCH_MASTER with something like:
> 
> AcceptEnv yes or AcceptEnv * is not working (Bad configuration option: 
> AcceptEnv) but I can have set PermitUserEnvironment yes now.

The AcceptEnv option is only avalible with ssh 3.9 > Debian currently only 
has 3.8.1p1 in stable and testing. (4.2 unstable)

Is there an other way to solve the env. problem?

Re: MapReduce

Posted by Paul van Brouwershaven <pa...@vanbrouwershaven.com>.

Doug Cutting wrote:
> Where are you setting mapred.map.tasks?  It should be set in 
> mapred-default.xml used when starting the tasktracker.  You must restart 
> the task trackers for this setting to take effect:
The settings are done in mapred-default.xml

> bin/nutch-daemons.sh stop tasktracker
> bin/nutch-daemons.sh start tasktracker
I have tried this, but I think that stop_all.sh & start_all.sh does this also.

> (I assume your ~/.slaves file contains srv34 and srv21.)
Yes, correct

> Note that nutch-daemons.sh assumes that /etc/ssh/sshd_config has 
> 'AcceptEnv yes'.  Its also useful, if your hosts don't share the nutch 
> installation via NFS, to set NUTCH_MASTER with something like:
AcceptEnv yes or AcceptEnv * is not working (Bad configuration option: 
AcceptEnv) but I can have set PermitUserEnvironment yes now.

> export NUTCH_MASTER=srv21:/home/USER/src/nutch
Good option, lets try to get it working first :)

> I'm working on documenting this stuff better...
Should be nice if there was some good documentation, but we will find the 
problem.

I start the crawler with: "bin/nutch crawl seeds -depth 10" is this ok?

Ok now the crwaler starts but get stuck afster 25% I think this is an 
other problem. But ok, its also only at the primary server.

The firewall is full open so that can't be the problem.

On the master (srv21 - conf/nutch-site.xml):
   <name>fs.default.name</name>
   <value>localhost:9009</value>

   <name>mapred.job.tracker</name>
   <value>localhost:9010</value>

On the master (srv21 - /root/.slaves)
   srv21
   srv34

On the slave (srv34 - conf/nutch-site.xml):
   <name>fs.default.name</name>
   <value>srv21:9009</value>

   <name>mapred.job.tracker</name>
   <value>srv21:9010</value>

On the master & slave (srv21/srv34 - conf/mapred-default.xml)
   <name>mapred.map.tasks</name>
   <value>4</value>

   <name>mapred.reduce.tasks</name>
   <value>2</value>

Is there a way that you can say, craw the urls I submit to the server and 
update them if you have done all the urls but not more then one time in x 
days.

Re: MapReduce

Posted by Doug Cutting <cu...@nutch.org>.

Paul van Brouwershaven wrote:
> I had mapred.map.tasks on 2 and on 4 and 8
> I had mapred.reduce.tasks on 1 2 and 3
> 
> But always I have the same rsult
> 
> Name    Host    # running tasks    Secs since heartbeat
> tracker_29414    srv34    0    1
> tracker_36968    srv21    1    2

Where are you setting mapred.map.tasks?  It should be set in 
mapred-default.xml used when starting the tasktracker.  You must restart 
the task trackers for this setting to take effect:

bin/nutch-daemons.sh stop tasktracker
bin/nutch-daemons.sh start tasktracker

(I assume your ~/.slaves file contains srv34 and srv21.)

Note that nutch-daemons.sh assumes that /etc/ssh/sshd_config has 
'AcceptEnv yes'.  Its also useful, if your hosts don't share the nutch 
installation via NFS, to set NUTCH_MASTER with something like:

export NUTCH_MASTER=srv21:/home/USER/src/nutch

This will cause daemons to sync the latest config and code from the 
master each time they're restarted.

I'm working on documenting this stuff better...

Doug