You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Doug Cutting <cu...@nutch.org> on 2005/10/03 22:58:47 UTC

Re: MapReduce

Paul van Brouwershaven wrote:
> The AcceptEnv option is only avalible with ssh 3.9 > Debian currently 
> only has 3.8.1p1 in stable and testing. (4.2 unstable)
> 
> Is there an other way to solve the env. problem?

I don't know.  The Fedora and Debian systems that I use have AcceptEnv.

Doug

Re: MapReduce

Posted by Paul van Brouwershaven <pa...@vanbrouwershaven.com>.
Doug Cutting wrote:
>> The AcceptEnv option is only avalible with ssh 3.9 > Debian currently 
>> only has 3.8.1p1 in stable and testing. (4.2 unstable)
>>
>> Is there an other way to solve the env. problem?
> 
> 
> I don't know.  The Fedora and Debian systems that I use have AcceptEnv.

I think you are using the unstable branch of Debian (Version 4.2). (Fedora 
has version 3.9)

I have compiled ssh 4.2 by hand and thats solved the AcceptEnv problem.

But when I add one server the my slave list I have a java error when 
uploading the files to de NDFS.

Re: a simple map reduce tutorial

Posted by Earl Cahill <ca...@yahoo.com>.
> I think end to end testing must focus on end to end
> problems (ie checking
> pdf parsing is already checked by unit tests, and it
> is really the right place for doing it).

Hate to say it, but today was the first time I got ant
test to work (hadn't tried too hard), and yeah, I saw
several such tests.

> What about creating a trunk/qa "module" ?

Well, I am most concerned with the mapred branch, and
was specifically wondering about where to put content,
like html content.

Earl


	
		
______________________________________________________ 
Yahoo! for Good 
Donate to the Hurricane Katrina relief effort. 
http://store.yahoo.com/redcross-donate3/ 


Re: a simple map reduce tutorial

Posted by Jérôme Charron <je...@gmail.com>.
> > I think it would be better to have the junit tests
> > start jetty then
> > crawl localhost. I'd love to see some end-to-end
> > unit tests like that.

+1

> I think this would also make it nice to test things
> like recursive linking, parsing pdfs or other file
> formats, observing robots.txt or any crawling bugs
> that are encountered and then fixed.

I think end to end testing must focus on end to end problems (ie checking
pdf parsing is already checked by unit tests, and it is really the right
place for
doing it).
It should be better to performs some end to end tests (functional tests) for
checking (not exhaustive):
* that depending on many configurations, the good documents are fecthed and
correctly parsed (as you suggested it).
* checking some limit cases : Protocol errors, Corrupted content,
* Performs some fetching/crawling/indexing performance tests with many confs
* Performs some searching performance tests with many querying
charges/database size, ...

That just some ideas....
But it could be very cool if you can work on this subject.

Suggestions for where to put such test content in the
> tree?

What about creating a trunk/qa "module" ?

Regards

Jérôme

--
http://motrech.free.fr/
http://www.frutch.org/

Re: a simple map reduce tutorial

Posted by Doug Cutting <cu...@nutch.org>.
Earl Cahill wrote:
> Suggestions for where to put such test content in the
> tree?

For now, just put them with the test code.

Doug

Re: a simple map reduce tutorial

Posted by Earl Cahill <ca...@yahoo.com>.
> I'm not sure what you mean.  I set environment
> variables in my .bashrc, 
> then simply use 'bin/start-all.sh' and 'bin/nutch
> crawl'.

Well, not sure if you looked at my tutorial, which is
now on the wiki

http://wiki.apache.org/nutch/SimpleMapReduceTutorial

but yeah, that is much simpler than what I am doing. 
Looks like a little example has been added to the FAQ,
which wasn't there last time I looked.

> NutchBean now looks for things in the subdirectory
> of the connected 
> directory named 'crawl'.  Is that an improvement or
> is it just confusing?

I think magic is ok so long as it is documented and it
works.

> I think it would be better to have the junit tests
> start jetty then 
> crawl localhost.  I'd love to see some end-to-end
> unit tests like that.

Think I will start to work on this.  Maybe start with
on page that contains just a few phrases, or maybe
just the word nutch, then make sure it can be queried
out in the end?  Could also check status through the
process to make sure everything looks good.  If
nothing else, I would likely understand the process
pretty well by the time I got done with my writing.

I think this would also make it nice to test things
like recursive linking, parsing pdfs or other file
formats, observing robots.txt or any crawling bugs
that are encountered and then fixed.

Suggestions for where to put such test content in the
tree?

> You should be able to add them to the wiki yourself.

Thanks, I added them.

Earl


	
		
______________________________________________________ 
Yahoo! for Good 
Donate to the Hurricane Katrina relief effort. 
http://store.yahoo.com/redcross-donate3/ 


Re: a simple map reduce tutorial

Posted by Doug Cutting <cu...@nutch.org>.
Earl Cahill wrote:
> 1.  Sounds like some of you have some glue programs
> that help run the whole process.  Are these going to
> end up in subversion sometime?  I am guessing there is
> much duplicated effort.

I'm not sure what you mean.  I set environment variables in my .bashrc, 
then simply use 'bin/start-all.sh' and 'bin/nutch crawl'.

> 2.  Not sure how to test that my index actually
> worked.  Starting catalina in my index directory
> didn't work this time.

NutchBean now looks for things in the subdirectory of the connected 
directory named 'crawl'.  Is that an improvement or is it just confusing?

> 3.  What do you all think of setting up some test
> directories to crawl, in say 
> 
> http://lucene.apache.org/nutch/test/
> 
> Thinking it would be kind of cool to have junit run
> through a whole process on external pages.

I think it would be better to have the junit tests start jetty then 
crawl localhost.  I'd love to see some end-to-end unit tests like that.

> 4.  Any way that
> 
> http://spack.net/nutch/SimpleMapReduceTutorial.html
> http://spack.net/nutch/GettingNutchRunningOnUbuntu.html
> 
> can get on the wiki?  I am using apache-ish style and
> would change to whatever, but as fun as these are to
> write, I would like to see them used.  

You should be able to add them to the wiki yourself.  Just fill out:

http://wiki.apache.org/nutch/UserPreferences

Thanks,

Doug


a simple map reduce tutorial

Posted by Earl Cahill <ca...@yahoo.com>.
Looks like I made it through a cycle on the map reduce
branch.  I put my steps here

http://spack.net/nutch/SimpleMapReduceTutorial.html

A few things questions.

1.  Sounds like some of you have some glue programs
that help run the whole process.  Are these going to
end up in subversion sometime?  I am guessing there is
much duplicated effort.

2.  Not sure how to test that my index actually
worked.  Starting catalina in my index directory
didn't work this time.

3.  What do you all think of setting up some test
directories to crawl, in say 

http://lucene.apache.org/nutch/test/

Thinking it would be kind of cool to have junit run
through a whole process on external pages.

4.  Any way that

http://spack.net/nutch/SimpleMapReduceTutorial.html
http://spack.net/nutch/GettingNutchRunningOnUbuntu.html

can get on the wiki?  I am using apache-ish style and
would change to whatever, but as fun as these are to
write, I would like to see them used.  

Feedback would be appreciated.

Enjoy!

Earl


		
__________________________________ 
Yahoo! Mail - PC Magazine Editors' Choice 2005 
http://mail.yahoo.com

Re: MapReduce

Posted by Paul van Brouwershaven <pa...@vanbrouwershaven.com>.
> I have the same problem and the only thing I found was rsh. :-/
> For my servers I will do following:
> Have a simple damon that executes all scripts in a folder and delete  
> the file after executing.
> And I will have a script that just copy the command as text file to  all 
> my slaves where the damon will execute it.
> This is very poor but will work with older version of ssh as well.

I do the same here, but I don't like this setup. :)

Re: MapReduce

Posted by Stefan Groschupf <sg...@media-style.com>.
>> The AcceptEnv option is only avalible with ssh 3.9 > Debian  
>> currently only has 3.8.1p1 in stable and testing. (4.2 unstable)
>> Is there an other way to solve the env. problem?
>>
>
> I don't know.  The Fedora and Debian systems that I use have  
> AcceptEnv.

I have the same problem and the only thing I found was rsh. :-/
For my servers I will do following:
Have a simple damon that executes all scripts in a folder and delete  
the file after executing.
And I will have a script that just copy the command as text file to  
all my slaves where the damon will execute it.
This is very poor but will work with older version of ssh as well.

Stefan