You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Michael Taggart <mi...@webco.tv> on 2005/12/16 02:55:32 UTC

MapRed searching

I got mapred to complete a full index cycle. I now would like to search
the index I created except I can't find out how to do that. I replaced
the war file and started it from my 0.8 installation directory but every
search comes up with 0 results. Do I have to tell tomcat to search the
ndfs? Unsure how to link up the parts at this point.
Mike

Re: MapRed searching

Posted by Bruno Patini Furtado <bp...@gmail.com>.

Hi Michael,

The WEB-INF folder is defined at the Java Servlets Framework specification.
It´s the folder where a Java web application should put its classes files
(web-inf/classes) and it´s JAR dependencies (web-inf/lib) among other
configurations files (web-inf/classes/log4j.xml is a common example) like
Nutch ones.

Not exactly about Nutch, but I hope it helps your undestanding on deploying
the Nutch web app :)

On 12/16/05, Michael Taggart <mi...@webco.tv> wrote:
>
> Stefan,
> Thank you so much for lending me a hand with this. I really appreciate
> it.
> Here is my output for bin/nutch ndfs -ls
>
> Found 5 items
> /user/root/crawldb      <dir>
> /user/root/indexes      <dir>
> /user/root/linkdb       <dir>
> /user/root/segments     <dir>
> /user/root/urls <dir>
>
> I'm pretty sure that is setup correctly and I have my nutch-site.xml
> configured with searcher.dir as /usr/root  However, I have seen people
> talk of a web-inf folder but I can't find one on my system. Is that just
> an abbreviation for something? In addition I don't have a classes
> folder. My nutch-default.xml and nutch-site.xml are in my conf dir. Am I
> missing something?
> Mike
>
> On Fri, 2005-12-16 at 12:20 +0100, Stefan Groschupf wrote:
> > Mike,
> > the question is where is your data located in the ndfs? As a note
> > searching a index stored in ndfs is very slow, however first let us
> > fix your problem.
> > Deploy the nutch webapp in your tomcat, check the nutch-default.xml
> > in web-inf/classes that it also has the ndfs name node correct
> > configured.
> > Than verify that you have the correct structure in your ndfs.
> > e.g.
> > /user/nutchuser/segments
> > /user/nutchuser/indexes
> > /user/nutchuser/linkdb
> >
> > Than configure in web-inf/classes/nutch-default.xml the parameter
> > searcher.dir just /user/nutchuser      .
> > That's it.
> > HTH
> > Stefan
> >
> > Am 16.12.2005 um 02:55 schrieb Michael Taggart:
> >
> > > I got mapred to complete a full index cycle. I now would like to
> > > search
> > > the index I created except I can't find out how to do that. I replaced
> > > the war file and started it from my 0.8 installation directory but
> > > every
> > > search comes up with 0 results. Do I have to tell tomcat to search the
> > > ndfs? Unsure how to link up the parts at this point.
> > > Mike
> > >
> >
>



--
"Minds are like parachutes, they work best when open."

Bruno Patini Furtado
Software Developer
webpage: www.bpfurtado.net
blog: http://www.livejournal.com/users/bpfurtado/

Re: MapRed searching

Posted by Michael Taggart <mi...@webco.tv>.

I'm also guessing that it's important for all tasktrackers to have the
appropriate configuration set in their conf/nutch-site.xml or can I just
do it on the namenode?

On Fri, 2005-12-16 at 12:57 -0800, Michael Taggart wrote:
> Should I specify that urls.txt file as /user/root/urls/urls.txt so it
> pulls it off the ndfs?
> 
> On Fri, 2005-12-16 at 21:39 +0100, Stefan Groschupf wrote:
> >  >I would like to crawl a list of domains,
> > > but I would like crawling limited to just those domains. When I first
> > > played around with nutch in a localsetup I just set the following
> > > property in nutch-site.xml:
> > > <property>
> > >   <name>urlfilter.prefix.file</name>
> > >   <value>urls.txt</value>
> > >   <description>Name of file on CLASSPATH containing url prefixes
> > >   used by urlfilter-prefix (PrefixURLFilter) plugin.</description>
> > > </property>
> > > Can I do this in a mapred system?
> > Sure, all plugins also works with map reduce.
> > There is also a db url filter plugin contribution in the jira.
> > You need to check what is better for your, needs if you only have a  
> > few host than the file based would be enough.
> > 
> > > Also how does the fetching work, does
> > > each new round of generate crawldb fetch the next "level" of urls?
> > Somehow yes, but nutch use a important urls first algorithm called opic.
> > > I'm
> > > wondering the best way to put the crawl/index system on autopilot so
> > > pages are crawled and updated regularly.
> > A shell script, with a some regular expression matching of the nutch  
> > tools outcome.
> > 
> > > Thanks again for you help Stefan.
> > 
> > Nop, this is how open source works, just help other newbies as well  
> > if you know how to do it.
> > 
> > Stefan
> >

Re: MapRed searching

Posted by Michael Taggart <mi...@webco.tv>.

Should I specify that urls.txt file as /user/root/urls/urls.txt so it
pulls it off the ndfs?

On Fri, 2005-12-16 at 21:39 +0100, Stefan Groschupf wrote:
>  >I would like to crawl a list of domains,
> > but I would like crawling limited to just those domains. When I first
> > played around with nutch in a localsetup I just set the following
> > property in nutch-site.xml:
> > <property>
> >   <name>urlfilter.prefix.file</name>
> >   <value>urls.txt</value>
> >   <description>Name of file on CLASSPATH containing url prefixes
> >   used by urlfilter-prefix (PrefixURLFilter) plugin.</description>
> > </property>
> > Can I do this in a mapred system?
> Sure, all plugins also works with map reduce.
> There is also a db url filter plugin contribution in the jira.
> You need to check what is better for your, needs if you only have a  
> few host than the file based would be enough.
> 
> > Also how does the fetching work, does
> > each new round of generate crawldb fetch the next "level" of urls?
> Somehow yes, but nutch use a important urls first algorithm called opic.
> > I'm
> > wondering the best way to put the crawl/index system on autopilot so
> > pages are crawled and updated regularly.
> A shell script, with a some regular expression matching of the nutch  
> tools outcome.
> 
> > Thanks again for you help Stefan.
> 
> Nop, this is how open source works, just help other newbies as well  
> if you know how to do it.
> 
> Stefan
>

Re: MapRed searching

Posted by Stefan Groschupf <sg...@media-style.com>.

 >I would like to crawl a list of domains,
> but I would like crawling limited to just those domains. When I first
> played around with nutch in a localsetup I just set the following
> property in nutch-site.xml:
> <property>
>   <name>urlfilter.prefix.file</name>
>   <value>urls.txt</value>
>   <description>Name of file on CLASSPATH containing url prefixes
>   used by urlfilter-prefix (PrefixURLFilter) plugin.</description>
> </property>
> Can I do this in a mapred system?
Sure, all plugins also works with map reduce.
There is also a db url filter plugin contribution in the jira.
You need to check what is better for your, needs if you only have a  
few host than the file based would be enough.

> Also how does the fetching work, does
> each new round of generate crawldb fetch the next "level" of urls?
Somehow yes, but nutch use a important urls first algorithm called opic.
> I'm
> wondering the best way to put the crawl/index system on autopilot so
> pages are crawled and updated regularly.
A shell script, with a some regular expression matching of the nutch  
tools outcome.

> Thanks again for you help Stefan.

Nop, this is how open source works, just help other newbies as well  
if you know how to do it.

Stefan

Re: MapRed searching

Posted by Michael Taggart <mi...@webco.tv>.

It works! Wow, I feel like I'm really starting to learn how nutch works.
Okay one more newbie question. I would like to crawl a list of domains,
but I would like crawling limited to just those domains. When I first
played around with nutch in a localsetup I just set the following
property in nutch-site.xml:
<property>
  <name>urlfilter.prefix.file</name>
  <value>urls.txt</value>
  <description>Name of file on CLASSPATH containing url prefixes
  used by urlfilter-prefix (PrefixURLFilter) plugin.</description>
</property>
Can I do this in a mapred system? Also how does the fetching work, does
each new round of generate crawldb fetch the next "level" of urls? I'm
wondering the best way to put the crawl/index system on autopilot so
pages are crawled and updated regularly.
Thanks again for you help Stefan.
Mike
On Fri, 2005-12-16 at 20:12 +0100, Stefan Groschupf wrote:
> looks good, now just install tomcat.
> uncompress your nutch-XXX.war file a folder called ROOT.war with  
> unzip and change this in  ROOT.war/WEB-INF/classes also.
> Than you can simply copy this folder into TOMCAT/webapps, that's it.
> 
> 
> Am 16.12.2005 um 20:09 schrieb Michael Taggart:
> 
> > Sorry Stefan, I am so used to typing usr that I wrote my email
> > incorrectly. Here is exactly what is in my nutch-site.xml:
> >
> > <property>
> >   <name>searcher.dir</name>
> >    <value>/user/root</value>
> >   <description>
> >   Path to root of index directories.  This directory is searched (in
> >   order) for either the file search-servers.txt, containing a list of
> >   distributed search servers, or the directory "index" containing
> >   merged indexes, or the directory "segments" containing segment
> >   indexes.
> >   </description>
> > </property>
> >
> > On Fri, 2005-12-16 at 20:00 +0100, Stefan Groschupf wrote:
> >>> /user/root/urls <dir>
> >>>
> >>> configured with searcher.dir as /usr/root
> >> A typo? usr/root !=user/root !
> >>
> >>> However, I have seen people
> >>> talk of a web-inf folder but I can't find one on my system.
> >> may this helps:
> >> http://wiki.media-style.com/display/nutchDocu/install+user+interface
> >>
> >> part of the webapplication you had deployed to your tomcat:
> >> TOMCAT/webapps/ROOT/WEB-INF/classes/nutch-site.xml should b the
> >> location.
> >>
> >> Stefan
> >
>

Re: MapRed searching

Posted by Stefan Groschupf <sg...@media-style.com>.

looks good, now just install tomcat.
uncompress your nutch-XXX.war file a folder called ROOT.war with  
unzip and change this in  ROOT.war/WEB-INF/classes also.
Than you can simply copy this folder into TOMCAT/webapps, that's it.


Am 16.12.2005 um 20:09 schrieb Michael Taggart:

> Sorry Stefan, I am so used to typing usr that I wrote my email
> incorrectly. Here is exactly what is in my nutch-site.xml:
>
> <property>
>   <name>searcher.dir</name>
>    <value>/user/root</value>
>   <description>
>   Path to root of index directories.  This directory is searched (in
>   order) for either the file search-servers.txt, containing a list of
>   distributed search servers, or the directory "index" containing
>   merged indexes, or the directory "segments" containing segment
>   indexes.
>   </description>
> </property>
>
> On Fri, 2005-12-16 at 20:00 +0100, Stefan Groschupf wrote:
>>> /user/root/urls <dir>
>>>
>>> configured with searcher.dir as /usr/root
>> A typo? usr/root !=user/root !
>>
>>> However, I have seen people
>>> talk of a web-inf folder but I can't find one on my system.
>> may this helps:
>> http://wiki.media-style.com/display/nutchDocu/install+user+interface
>>
>> part of the webapplication you had deployed to your tomcat:
>> TOMCAT/webapps/ROOT/WEB-INF/classes/nutch-site.xml should b the
>> location.
>>
>> Stefan
>

Re: MapRed searching

Posted by Michael Taggart <mi...@webco.tv>.

Sorry Stefan, I am so used to typing usr that I wrote my email
incorrectly. Here is exactly what is in my nutch-site.xml:

<property>
  <name>searcher.dir</name>
   <value>/user/root</value>
  <description>
  Path to root of index directories.  This directory is searched (in
  order) for either the file search-servers.txt, containing a list of
  distributed search servers, or the directory "index" containing
  merged indexes, or the directory "segments" containing segment
  indexes.
  </description>
</property>

On Fri, 2005-12-16 at 20:00 +0100, Stefan Groschupf wrote:
> > /user/root/urls <dir>
> >
> > configured with searcher.dir as /usr/root
> A typo? usr/root !=user/root !
> 
> > However, I have seen people
> > talk of a web-inf folder but I can't find one on my system.
> may this helps:
> http://wiki.media-style.com/display/nutchDocu/install+user+interface
> 
> part of the webapplication you had deployed to your tomcat:
> TOMCAT/webapps/ROOT/WEB-INF/classes/nutch-site.xml should b the  
> location.
> 
> Stefan

Re: MapRed searching

Posted by Stefan Groschupf <sg...@media-style.com>.

> /user/root/urls <dir>
>
> configured with searcher.dir as /usr/root
A typo? usr/root !=user/root !

> However, I have seen people
> talk of a web-inf folder but I can't find one on my system.
may this helps:
http://wiki.media-style.com/display/nutchDocu/install+user+interface

part of the webapplication you had deployed to your tomcat:
TOMCAT/webapps/ROOT/WEB-INF/classes/nutch-site.xml should b the  
location.

Stefan

Re: MapRed searching

Posted by Michael Taggart <mi...@webco.tv>.

Stefan,
Thank you so much for lending me a hand with this. I really appreciate
it.
Here is my output for bin/nutch ndfs -ls

Found 5 items
/user/root/crawldb      <dir>
/user/root/indexes      <dir>
/user/root/linkdb       <dir>
/user/root/segments     <dir>
/user/root/urls <dir>

I'm pretty sure that is setup correctly and I have my nutch-site.xml
configured with searcher.dir as /usr/root  However, I have seen people
talk of a web-inf folder but I can't find one on my system. Is that just
an abbreviation for something? In addition I don't have a classes
folder. My nutch-default.xml and nutch-site.xml are in my conf dir. Am I
missing something?
Mike

On Fri, 2005-12-16 at 12:20 +0100, Stefan Groschupf wrote:
> Mike,
> the question is where is your data located in the ndfs? As a note  
> searching a index stored in ndfs is very slow, however first let us  
> fix your problem.
> Deploy the nutch webapp in your tomcat, check the nutch-default.xml  
> in web-inf/classes that it also has the ndfs name node correct  
> configured.
> Than verify that you have the correct structure in your ndfs.
> e.g.
> /user/nutchuser/segments
> /user/nutchuser/indexes
> /user/nutchuser/linkdb
> 
> Than configure in web-inf/classes/nutch-default.xml the parameter  
> searcher.dir just /user/nutchuser      .
> That's it.
> HTH
> Stefan
> 
> Am 16.12.2005 um 02:55 schrieb Michael Taggart:
> 
> > I got mapred to complete a full index cycle. I now would like to  
> > search
> > the index I created except I can't find out how to do that. I replaced
> > the war file and started it from my 0.8 installation directory but  
> > every
> > search comes up with 0 results. Do I have to tell tomcat to search the
> > ndfs? Unsure how to link up the parts at this point.
> > Mike
> >
>

Re: MapRed searching

Posted by Stefan Groschupf <sg...@media-style.com>.

Mike,
the question is where is your data located in the ndfs? As a note  
searching a index stored in ndfs is very slow, however first let us  
fix your problem.
Deploy the nutch webapp in your tomcat, check the nutch-default.xml  
in web-inf/classes that it also has the ndfs name node correct  
configured.
Than verify that you have the correct structure in your ndfs.
e.g.
/user/nutchuser/segments
/user/nutchuser/indexes
/user/nutchuser/linkdb

Than configure in web-inf/classes/nutch-default.xml the parameter  
searcher.dir just /user/nutchuser      .
That's it.
HTH
Stefan

Am 16.12.2005 um 02:55 schrieb Michael Taggart:

> I got mapred to complete a full index cycle. I now would like to  
> search
> the index I created except I can't find out how to do that. I replaced
> the war file and started it from my 0.8 installation directory but  
> every
> search comes up with 0 results. Do I have to tell tomcat to search the
> ndfs? Unsure how to link up the parts at this point.
> Mike
>