You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Paul van Brouwershaven <pa...@vanbrouwershaven.com> on 2005/09/29 20:22:37 UTC
MapReduce
Hello,
What do I wrong?
I have setup two systems to setup a MapReduce Nutch Crawler but the jobs
run always on the same "localhost" when I take a view at http://server:7845
seeds/urls is an url list with 500.000 urls
# put seed directory in ndfs
bin/nutch ndfs -put seeds seeds
# crawl a bit
bin/nutch crawl seeds -depth 10
I had mapred.map.tasks on 2 and on 4 and 8
I had mapred.reduce.tasks on 1 2 and 3
But always I have the same rsult
Name Host # running tasks Secs since heartbeat
tracker_29414 srv34 0 1
tracker_36968 srv21 1 2
srv21 is the master
Thanks,
Paul
Re: MapReduce
Posted by Paul van Brouwershaven <pa...@vanbrouwershaven.com>.
Doug Cutting wrote:
>> The AcceptEnv option is only avalible with ssh 3.9 > Debian currently
>> only has 3.8.1p1 in stable and testing. (4.2 unstable)
>>
>> Is there an other way to solve the env. problem?
>
>
> I don't know. The Fedora and Debian systems that I use have AcceptEnv.
I think you are using the unstable branch of Debian (Version 4.2). (Fedora
has version 3.9)
I have compiled ssh 4.2 by hand and thats solved the AcceptEnv problem.
But when I add one server the my slave list I have a java error when
uploading the files to de NDFS.
Re: a simple map reduce tutorial
Posted by Earl Cahill <ca...@yahoo.com>.
> I think end to end testing must focus on end to end
> problems (ie checking
> pdf parsing is already checked by unit tests, and it
> is really the right place for doing it).
Hate to say it, but today was the first time I got ant
test to work (hadn't tried too hard), and yeah, I saw
several such tests.
> What about creating a trunk/qa "module" ?
Well, I am most concerned with the mapred branch, and
was specifically wondering about where to put content,
like html content.
Earl
______________________________________________________
Yahoo! for Good
Donate to the Hurricane Katrina relief effort.
http://store.yahoo.com/redcross-donate3/
Re: a simple map reduce tutorial
Posted by Jérôme Charron <je...@gmail.com>.
> > I think it would be better to have the junit tests
> > start jetty then
> > crawl localhost. I'd love to see some end-to-end
> > unit tests like that.
+1
> I think this would also make it nice to test things
> like recursive linking, parsing pdfs or other file
> formats, observing robots.txt or any crawling bugs
> that are encountered and then fixed.
I think end to end testing must focus on end to end problems (ie checking
pdf parsing is already checked by unit tests, and it is really the right
place for
doing it).
It should be better to performs some end to end tests (functional tests) for
checking (not exhaustive):
* that depending on many configurations, the good documents are fecthed and
correctly parsed (as you suggested it).
* checking some limit cases : Protocol errors, Corrupted content,
* Performs some fetching/crawling/indexing performance tests with many confs
* Performs some searching performance tests with many querying
charges/database size, ...
That just some ideas....
But it could be very cool if you can work on this subject.
Suggestions for where to put such test content in the
> tree?
What about creating a trunk/qa "module" ?
Regards
Jérôme
--
http://motrech.free.fr/
http://www.frutch.org/
Re: a simple map reduce tutorial
Posted by Doug Cutting <cu...@nutch.org>.
Earl Cahill wrote:
> Suggestions for where to put such test content in the
> tree?
For now, just put them with the test code.
Doug
Re: a simple map reduce tutorial
Posted by Earl Cahill <ca...@yahoo.com>.
> I'm not sure what you mean. I set environment
> variables in my .bashrc,
> then simply use 'bin/start-all.sh' and 'bin/nutch
> crawl'.
Well, not sure if you looked at my tutorial, which is
now on the wiki
http://wiki.apache.org/nutch/SimpleMapReduceTutorial
but yeah, that is much simpler than what I am doing.
Looks like a little example has been added to the FAQ,
which wasn't there last time I looked.
> NutchBean now looks for things in the subdirectory
> of the connected
> directory named 'crawl'. Is that an improvement or
> is it just confusing?
I think magic is ok so long as it is documented and it
works.
> I think it would be better to have the junit tests
> start jetty then
> crawl localhost. I'd love to see some end-to-end
> unit tests like that.
Think I will start to work on this. Maybe start with
on page that contains just a few phrases, or maybe
just the word nutch, then make sure it can be queried
out in the end? Could also check status through the
process to make sure everything looks good. If
nothing else, I would likely understand the process
pretty well by the time I got done with my writing.
I think this would also make it nice to test things
like recursive linking, parsing pdfs or other file
formats, observing robots.txt or any crawling bugs
that are encountered and then fixed.
Suggestions for where to put such test content in the
tree?
> You should be able to add them to the wiki yourself.
Thanks, I added them.
Earl
______________________________________________________
Yahoo! for Good
Donate to the Hurricane Katrina relief effort.
http://store.yahoo.com/redcross-donate3/
Re: a simple map reduce tutorial
Posted by Doug Cutting <cu...@nutch.org>.
Earl Cahill wrote:
> 1. Sounds like some of you have some glue programs
> that help run the whole process. Are these going to
> end up in subversion sometime? I am guessing there is
> much duplicated effort.
I'm not sure what you mean. I set environment variables in my .bashrc,
then simply use 'bin/start-all.sh' and 'bin/nutch crawl'.
> 2. Not sure how to test that my index actually
> worked. Starting catalina in my index directory
> didn't work this time.
NutchBean now looks for things in the subdirectory of the connected
directory named 'crawl'. Is that an improvement or is it just confusing?
> 3. What do you all think of setting up some test
> directories to crawl, in say
>
> http://lucene.apache.org/nutch/test/
>
> Thinking it would be kind of cool to have junit run
> through a whole process on external pages.
I think it would be better to have the junit tests start jetty then
crawl localhost. I'd love to see some end-to-end unit tests like that.
> 4. Any way that
>
> http://spack.net/nutch/SimpleMapReduceTutorial.html
> http://spack.net/nutch/GettingNutchRunningOnUbuntu.html
>
> can get on the wiki? I am using apache-ish style and
> would change to whatever, but as fun as these are to
> write, I would like to see them used.
You should be able to add them to the wiki yourself. Just fill out:
http://wiki.apache.org/nutch/UserPreferences
Thanks,
Doug
a simple map reduce tutorial
Posted by Earl Cahill <ca...@yahoo.com>.
Looks like I made it through a cycle on the map reduce
branch. I put my steps here
http://spack.net/nutch/SimpleMapReduceTutorial.html
A few things questions.
1. Sounds like some of you have some glue programs
that help run the whole process. Are these going to
end up in subversion sometime? I am guessing there is
much duplicated effort.
2. Not sure how to test that my index actually
worked. Starting catalina in my index directory
didn't work this time.
3. What do you all think of setting up some test
directories to crawl, in say
http://lucene.apache.org/nutch/test/
Thinking it would be kind of cool to have junit run
through a whole process on external pages.
4. Any way that
http://spack.net/nutch/SimpleMapReduceTutorial.html
http://spack.net/nutch/GettingNutchRunningOnUbuntu.html
can get on the wiki? I am using apache-ish style and
would change to whatever, but as fun as these are to
write, I would like to see them used.
Feedback would be appreciated.
Enjoy!
Earl
__________________________________
Yahoo! Mail - PC Magazine Editors' Choice 2005
http://mail.yahoo.com
Re: MapReduce
Posted by Paul van Brouwershaven <pa...@vanbrouwershaven.com>.
> I have the same problem and the only thing I found was rsh. :-/
> For my servers I will do following:
> Have a simple damon that executes all scripts in a folder and delete
> the file after executing.
> And I will have a script that just copy the command as text file to all
> my slaves where the damon will execute it.
> This is very poor but will work with older version of ssh as well.
I do the same here, but I don't like this setup. :)
Re: MapReduce
Posted by Stefan Groschupf <sg...@media-style.com>.
>> The AcceptEnv option is only avalible with ssh 3.9 > Debian
>> currently only has 3.8.1p1 in stable and testing. (4.2 unstable)
>> Is there an other way to solve the env. problem?
>>
>
> I don't know. The Fedora and Debian systems that I use have
> AcceptEnv.
I have the same problem and the only thing I found was rsh. :-/
For my servers I will do following:
Have a simple damon that executes all scripts in a folder and delete
the file after executing.
And I will have a script that just copy the command as text file to
all my slaves where the damon will execute it.
This is very poor but will work with older version of ssh as well.
Stefan
Re: MapReduce
Posted by Doug Cutting <cu...@nutch.org>.
Paul van Brouwershaven wrote:
> The AcceptEnv option is only avalible with ssh 3.9 > Debian currently
> only has 3.8.1p1 in stable and testing. (4.2 unstable)
>
> Is there an other way to solve the env. problem?
I don't know. The Fedora and Debian systems that I use have AcceptEnv.
Doug
Re: MapReduce
Posted by Paul van Brouwershaven <pa...@vanbrouwershaven.com>.
Paul van Brouwershaven wrote:
>> Note that nutch-daemons.sh assumes that /etc/ssh/sshd_config has
>> 'AcceptEnv yes'. Its also useful, if your hosts don't share the nutch
>> installation via NFS, to set NUTCH_MASTER with something like:
>
> AcceptEnv yes or AcceptEnv * is not working (Bad configuration option:
> AcceptEnv) but I can have set PermitUserEnvironment yes now.
The AcceptEnv option is only avalible with ssh 3.9 > Debian currently only
has 3.8.1p1 in stable and testing. (4.2 unstable)
Is there an other way to solve the env. problem?
Re: MapReduce
Posted by Paul van Brouwershaven <pa...@vanbrouwershaven.com>.
Doug Cutting wrote:
> Where are you setting mapred.map.tasks? It should be set in
> mapred-default.xml used when starting the tasktracker. You must restart
> the task trackers for this setting to take effect:
The settings are done in mapred-default.xml
> bin/nutch-daemons.sh stop tasktracker
> bin/nutch-daemons.sh start tasktracker
I have tried this, but I think that stop_all.sh & start_all.sh does this also.
> (I assume your ~/.slaves file contains srv34 and srv21.)
Yes, correct
> Note that nutch-daemons.sh assumes that /etc/ssh/sshd_config has
> 'AcceptEnv yes'. Its also useful, if your hosts don't share the nutch
> installation via NFS, to set NUTCH_MASTER with something like:
AcceptEnv yes or AcceptEnv * is not working (Bad configuration option:
AcceptEnv) but I can have set PermitUserEnvironment yes now.
> export NUTCH_MASTER=srv21:/home/USER/src/nutch
Good option, lets try to get it working first :)
> I'm working on documenting this stuff better...
Should be nice if there was some good documentation, but we will find the
problem.
I start the crawler with: "bin/nutch crawl seeds -depth 10" is this ok?
Ok now the crwaler starts but get stuck afster 25% I think this is an
other problem. But ok, its also only at the primary server.
The firewall is full open so that can't be the problem.
On the master (srv21 - conf/nutch-site.xml):
<name>fs.default.name</name>
<value>localhost:9009</value>
<name>mapred.job.tracker</name>
<value>localhost:9010</value>
On the master (srv21 - /root/.slaves)
srv21
srv34
On the slave (srv34 - conf/nutch-site.xml):
<name>fs.default.name</name>
<value>srv21:9009</value>
<name>mapred.job.tracker</name>
<value>srv21:9010</value>
On the master & slave (srv21/srv34 - conf/mapred-default.xml)
<name>mapred.map.tasks</name>
<value>4</value>
<name>mapred.reduce.tasks</name>
<value>2</value>
Is there a way that you can say, craw the urls I submit to the server and
update them if you have done all the urls but not more then one time in x
days.
Re: MapReduce
Posted by Doug Cutting <cu...@nutch.org>.
Paul van Brouwershaven wrote:
> I had mapred.map.tasks on 2 and on 4 and 8
> I had mapred.reduce.tasks on 1 2 and 3
>
> But always I have the same rsult
>
> Name Host # running tasks Secs since heartbeat
> tracker_29414 srv34 0 1
> tracker_36968 srv21 1 2
Where are you setting mapred.map.tasks? It should be set in
mapred-default.xml used when starting the tasktracker. You must restart
the task trackers for this setting to take effect:
bin/nutch-daemons.sh stop tasktracker
bin/nutch-daemons.sh start tasktracker
(I assume your ~/.slaves file contains srv34 and srv21.)
Note that nutch-daemons.sh assumes that /etc/ssh/sshd_config has
'AcceptEnv yes'. Its also useful, if your hosts don't share the nutch
installation via NFS, to set NUTCH_MASTER with something like:
export NUTCH_MASTER=srv21:/home/USER/src/nutch
This will cause daemons to sync the latest config and code from the
master each time they're restarted.
I'm working on documenting this stuff better...
Doug