You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Earl Cahill <ca...@yahoo.com> on 2005/10/04 09:09:42 UTC

a simple map reduce tutorial

Looks like I made it through a cycle on the map reduce
branch.  I put my steps here

http://spack.net/nutch/SimpleMapReduceTutorial.html

A few things questions.

1.  Sounds like some of you have some glue programs
that help run the whole process.  Are these going to
end up in subversion sometime?  I am guessing there is
much duplicated effort.

2.  Not sure how to test that my index actually
worked.  Starting catalina in my index directory
didn't work this time.

3.  What do you all think of setting up some test
directories to crawl, in say 

http://lucene.apache.org/nutch/test/

Thinking it would be kind of cool to have junit run
through a whole process on external pages.

4.  Any way that

http://spack.net/nutch/SimpleMapReduceTutorial.html
http://spack.net/nutch/GettingNutchRunningOnUbuntu.html

can get on the wiki?  I am using apache-ish style and
would change to whatever, but as fun as these are to
write, I would like to see them used.  

Feedback would be appreciated.

Enjoy!

Earl


		
__________________________________ 
Yahoo! Mail - PC Magazine Editors' Choice 2005 
http://mail.yahoo.com

Re: a simple map reduce tutorial

Posted by Earl Cahill <ca...@yahoo.com>.
> I think end to end testing must focus on end to end
> problems (ie checking
> pdf parsing is already checked by unit tests, and it
> is really the right place for doing it).

Hate to say it, but today was the first time I got ant
test to work (hadn't tried too hard), and yeah, I saw
several such tests.

> What about creating a trunk/qa "module" ?

Well, I am most concerned with the mapred branch, and
was specifically wondering about where to put content,
like html content.

Earl


	
		
______________________________________________________ 
Yahoo! for Good 
Donate to the Hurricane Katrina relief effort. 
http://store.yahoo.com/redcross-donate3/ 


Re: a simple map reduce tutorial

Posted by Jérôme Charron <je...@gmail.com>.
> > I think it would be better to have the junit tests
> > start jetty then
> > crawl localhost. I'd love to see some end-to-end
> > unit tests like that.

+1

> I think this would also make it nice to test things
> like recursive linking, parsing pdfs or other file
> formats, observing robots.txt or any crawling bugs
> that are encountered and then fixed.

I think end to end testing must focus on end to end problems (ie checking
pdf parsing is already checked by unit tests, and it is really the right
place for
doing it).
It should be better to performs some end to end tests (functional tests) for
checking (not exhaustive):
* that depending on many configurations, the good documents are fecthed and
correctly parsed (as you suggested it).
* checking some limit cases : Protocol errors, Corrupted content,
* Performs some fetching/crawling/indexing performance tests with many confs
* Performs some searching performance tests with many querying
charges/database size, ...

That just some ideas....
But it could be very cool if you can work on this subject.

Suggestions for where to put such test content in the
> tree?

What about creating a trunk/qa "module" ?

Regards

Jérôme

--
http://motrech.free.fr/
http://www.frutch.org/

Re: a simple map reduce tutorial

Posted by Doug Cutting <cu...@nutch.org>.
Earl Cahill wrote:
> Suggestions for where to put such test content in the
> tree?

For now, just put them with the test code.

Doug

Re: a simple map reduce tutorial

Posted by Earl Cahill <ca...@yahoo.com>.
> I'm not sure what you mean.  I set environment
> variables in my .bashrc, 
> then simply use 'bin/start-all.sh' and 'bin/nutch
> crawl'.

Well, not sure if you looked at my tutorial, which is
now on the wiki

http://wiki.apache.org/nutch/SimpleMapReduceTutorial

but yeah, that is much simpler than what I am doing. 
Looks like a little example has been added to the FAQ,
which wasn't there last time I looked.

> NutchBean now looks for things in the subdirectory
> of the connected 
> directory named 'crawl'.  Is that an improvement or
> is it just confusing?

I think magic is ok so long as it is documented and it
works.

> I think it would be better to have the junit tests
> start jetty then 
> crawl localhost.  I'd love to see some end-to-end
> unit tests like that.

Think I will start to work on this.  Maybe start with
on page that contains just a few phrases, or maybe
just the word nutch, then make sure it can be queried
out in the end?  Could also check status through the
process to make sure everything looks good.  If
nothing else, I would likely understand the process
pretty well by the time I got done with my writing.

I think this would also make it nice to test things
like recursive linking, parsing pdfs or other file
formats, observing robots.txt or any crawling bugs
that are encountered and then fixed.

Suggestions for where to put such test content in the
tree?

> You should be able to add them to the wiki yourself.

Thanks, I added them.

Earl


	
		
______________________________________________________ 
Yahoo! for Good 
Donate to the Hurricane Katrina relief effort. 
http://store.yahoo.com/redcross-donate3/ 


Re: a simple map reduce tutorial

Posted by Doug Cutting <cu...@nutch.org>.
Earl Cahill wrote:
> 1.  Sounds like some of you have some glue programs
> that help run the whole process.  Are these going to
> end up in subversion sometime?  I am guessing there is
> much duplicated effort.

I'm not sure what you mean.  I set environment variables in my .bashrc, 
then simply use 'bin/start-all.sh' and 'bin/nutch crawl'.

> 2.  Not sure how to test that my index actually
> worked.  Starting catalina in my index directory
> didn't work this time.

NutchBean now looks for things in the subdirectory of the connected 
directory named 'crawl'.  Is that an improvement or is it just confusing?

> 3.  What do you all think of setting up some test
> directories to crawl, in say 
> 
> http://lucene.apache.org/nutch/test/
> 
> Thinking it would be kind of cool to have junit run
> through a whole process on external pages.

I think it would be better to have the junit tests start jetty then 
crawl localhost.  I'd love to see some end-to-end unit tests like that.

> 4.  Any way that
> 
> http://spack.net/nutch/SimpleMapReduceTutorial.html
> http://spack.net/nutch/GettingNutchRunningOnUbuntu.html
> 
> can get on the wiki?  I am using apache-ish style and
> would change to whatever, but as fun as these are to
> write, I would like to see them used.  

You should be able to add them to the wiki yourself.  Just fill out:

http://wiki.apache.org/nutch/UserPreferences

Thanks,

Doug