You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Michael Taggart <mi...@webco.tv> on 2005/12/16 02:55:32 UTC
MapRed searching
I got mapred to complete a full index cycle. I now would like to search
the index I created except I can't find out how to do that. I replaced
the war file and started it from my 0.8 installation directory but every
search comes up with 0 results. Do I have to tell tomcat to search the
ndfs? Unsure how to link up the parts at this point.
Mike
Re: MapRed searching
Posted by Bruno Patini Furtado <bp...@gmail.com>.
Hi Michael,
The WEB-INF folder is defined at the Java Servlets Framework specification.
It´s the folder where a Java web application should put its classes files
(web-inf/classes) and it´s JAR dependencies (web-inf/lib) among other
configurations files (web-inf/classes/log4j.xml is a common example) like
Nutch ones.
Not exactly about Nutch, but I hope it helps your undestanding on deploying
the Nutch web app :)
On 12/16/05, Michael Taggart <mi...@webco.tv> wrote:
>
> Stefan,
> Thank you so much for lending me a hand with this. I really appreciate
> it.
> Here is my output for bin/nutch ndfs -ls
>
> Found 5 items
> /user/root/crawldb <dir>
> /user/root/indexes <dir>
> /user/root/linkdb <dir>
> /user/root/segments <dir>
> /user/root/urls <dir>
>
> I'm pretty sure that is setup correctly and I have my nutch-site.xml
> configured with searcher.dir as /usr/root However, I have seen people
> talk of a web-inf folder but I can't find one on my system. Is that just
> an abbreviation for something? In addition I don't have a classes
> folder. My nutch-default.xml and nutch-site.xml are in my conf dir. Am I
> missing something?
> Mike
>
> On Fri, 2005-12-16 at 12:20 +0100, Stefan Groschupf wrote:
> > Mike,
> > the question is where is your data located in the ndfs? As a note
> > searching a index stored in ndfs is very slow, however first let us
> > fix your problem.
> > Deploy the nutch webapp in your tomcat, check the nutch-default.xml
> > in web-inf/classes that it also has the ndfs name node correct
> > configured.
> > Than verify that you have the correct structure in your ndfs.
> > e.g.
> > /user/nutchuser/segments
> > /user/nutchuser/indexes
> > /user/nutchuser/linkdb
> >
> > Than configure in web-inf/classes/nutch-default.xml the parameter
> > searcher.dir just /user/nutchuser .
> > That's it.
> > HTH
> > Stefan
> >
> > Am 16.12.2005 um 02:55 schrieb Michael Taggart:
> >
> > > I got mapred to complete a full index cycle. I now would like to
> > > search
> > > the index I created except I can't find out how to do that. I replaced
> > > the war file and started it from my 0.8 installation directory but
> > > every
> > > search comes up with 0 results. Do I have to tell tomcat to search the
> > > ndfs? Unsure how to link up the parts at this point.
> > > Mike
> > >
> >
>
--
"Minds are like parachutes, they work best when open."
Bruno Patini Furtado
Software Developer
webpage: www.bpfurtado.net
blog: http://www.livejournal.com/users/bpfurtado/
Re: MapRed searching
Posted by Michael Taggart <mi...@webco.tv>.
I'm also guessing that it's important for all tasktrackers to have the
appropriate configuration set in their conf/nutch-site.xml or can I just
do it on the namenode?
On Fri, 2005-12-16 at 12:57 -0800, Michael Taggart wrote:
> Should I specify that urls.txt file as /user/root/urls/urls.txt so it
> pulls it off the ndfs?
>
> On Fri, 2005-12-16 at 21:39 +0100, Stefan Groschupf wrote:
> > >I would like to crawl a list of domains,
> > > but I would like crawling limited to just those domains. When I first
> > > played around with nutch in a localsetup I just set the following
> > > property in nutch-site.xml:
> > > <property>
> > > <name>urlfilter.prefix.file</name>
> > > <value>urls.txt</value>
> > > <description>Name of file on CLASSPATH containing url prefixes
> > > used by urlfilter-prefix (PrefixURLFilter) plugin.</description>
> > > </property>
> > > Can I do this in a mapred system?
> > Sure, all plugins also works with map reduce.
> > There is also a db url filter plugin contribution in the jira.
> > You need to check what is better for your, needs if you only have a
> > few host than the file based would be enough.
> >
> > > Also how does the fetching work, does
> > > each new round of generate crawldb fetch the next "level" of urls?
> > Somehow yes, but nutch use a important urls first algorithm called opic.
> > > I'm
> > > wondering the best way to put the crawl/index system on autopilot so
> > > pages are crawled and updated regularly.
> > A shell script, with a some regular expression matching of the nutch
> > tools outcome.
> >
> > > Thanks again for you help Stefan.
> >
> > Nop, this is how open source works, just help other newbies as well
> > if you know how to do it.
> >
> > Stefan
> >
Re: MapRed searching
Posted by Michael Taggart <mi...@webco.tv>.
Should I specify that urls.txt file as /user/root/urls/urls.txt so it
pulls it off the ndfs?
On Fri, 2005-12-16 at 21:39 +0100, Stefan Groschupf wrote:
> >I would like to crawl a list of domains,
> > but I would like crawling limited to just those domains. When I first
> > played around with nutch in a localsetup I just set the following
> > property in nutch-site.xml:
> > <property>
> > <name>urlfilter.prefix.file</name>
> > <value>urls.txt</value>
> > <description>Name of file on CLASSPATH containing url prefixes
> > used by urlfilter-prefix (PrefixURLFilter) plugin.</description>
> > </property>
> > Can I do this in a mapred system?
> Sure, all plugins also works with map reduce.
> There is also a db url filter plugin contribution in the jira.
> You need to check what is better for your, needs if you only have a
> few host than the file based would be enough.
>
> > Also how does the fetching work, does
> > each new round of generate crawldb fetch the next "level" of urls?
> Somehow yes, but nutch use a important urls first algorithm called opic.
> > I'm
> > wondering the best way to put the crawl/index system on autopilot so
> > pages are crawled and updated regularly.
> A shell script, with a some regular expression matching of the nutch
> tools outcome.
>
> > Thanks again for you help Stefan.
>
> Nop, this is how open source works, just help other newbies as well
> if you know how to do it.
>
> Stefan
>
Re: MapRed searching
Posted by Stefan Groschupf <sg...@media-style.com>.
>I would like to crawl a list of domains,
> but I would like crawling limited to just those domains. When I first
> played around with nutch in a localsetup I just set the following
> property in nutch-site.xml:
> <property>
> <name>urlfilter.prefix.file</name>
> <value>urls.txt</value>
> <description>Name of file on CLASSPATH containing url prefixes
> used by urlfilter-prefix (PrefixURLFilter) plugin.</description>
> </property>
> Can I do this in a mapred system?
Sure, all plugins also works with map reduce.
There is also a db url filter plugin contribution in the jira.
You need to check what is better for your, needs if you only have a
few host than the file based would be enough.
> Also how does the fetching work, does
> each new round of generate crawldb fetch the next "level" of urls?
Somehow yes, but nutch use a important urls first algorithm called opic.
> I'm
> wondering the best way to put the crawl/index system on autopilot so
> pages are crawled and updated regularly.
A shell script, with a some regular expression matching of the nutch
tools outcome.
> Thanks again for you help Stefan.
Nop, this is how open source works, just help other newbies as well
if you know how to do it.
Stefan
Re: MapRed searching
Posted by Michael Taggart <mi...@webco.tv>.
It works! Wow, I feel like I'm really starting to learn how nutch works.
Okay one more newbie question. I would like to crawl a list of domains,
but I would like crawling limited to just those domains. When I first
played around with nutch in a localsetup I just set the following
property in nutch-site.xml:
<property>
<name>urlfilter.prefix.file</name>
<value>urls.txt</value>
<description>Name of file on CLASSPATH containing url prefixes
used by urlfilter-prefix (PrefixURLFilter) plugin.</description>
</property>
Can I do this in a mapred system? Also how does the fetching work, does
each new round of generate crawldb fetch the next "level" of urls? I'm
wondering the best way to put the crawl/index system on autopilot so
pages are crawled and updated regularly.
Thanks again for you help Stefan.
Mike
On Fri, 2005-12-16 at 20:12 +0100, Stefan Groschupf wrote:
> looks good, now just install tomcat.
> uncompress your nutch-XXX.war file a folder called ROOT.war with
> unzip and change this in ROOT.war/WEB-INF/classes also.
> Than you can simply copy this folder into TOMCAT/webapps, that's it.
>
>
> Am 16.12.2005 um 20:09 schrieb Michael Taggart:
>
> > Sorry Stefan, I am so used to typing usr that I wrote my email
> > incorrectly. Here is exactly what is in my nutch-site.xml:
> >
> > <property>
> > <name>searcher.dir</name>
> > <value>/user/root</value>
> > <description>
> > Path to root of index directories. This directory is searched (in
> > order) for either the file search-servers.txt, containing a list of
> > distributed search servers, or the directory "index" containing
> > merged indexes, or the directory "segments" containing segment
> > indexes.
> > </description>
> > </property>
> >
> > On Fri, 2005-12-16 at 20:00 +0100, Stefan Groschupf wrote:
> >>> /user/root/urls <dir>
> >>>
> >>> configured with searcher.dir as /usr/root
> >> A typo? usr/root !=user/root !
> >>
> >>> However, I have seen people
> >>> talk of a web-inf folder but I can't find one on my system.
> >> may this helps:
> >> http://wiki.media-style.com/display/nutchDocu/install+user+interface
> >>
> >> part of the webapplication you had deployed to your tomcat:
> >> TOMCAT/webapps/ROOT/WEB-INF/classes/nutch-site.xml should b the
> >> location.
> >>
> >> Stefan
> >
>
Re: MapRed searching
Posted by Stefan Groschupf <sg...@media-style.com>.
looks good, now just install tomcat.
uncompress your nutch-XXX.war file a folder called ROOT.war with
unzip and change this in ROOT.war/WEB-INF/classes also.
Than you can simply copy this folder into TOMCAT/webapps, that's it.
Am 16.12.2005 um 20:09 schrieb Michael Taggart:
> Sorry Stefan, I am so used to typing usr that I wrote my email
> incorrectly. Here is exactly what is in my nutch-site.xml:
>
> <property>
> <name>searcher.dir</name>
> <value>/user/root</value>
> <description>
> Path to root of index directories. This directory is searched (in
> order) for either the file search-servers.txt, containing a list of
> distributed search servers, or the directory "index" containing
> merged indexes, or the directory "segments" containing segment
> indexes.
> </description>
> </property>
>
> On Fri, 2005-12-16 at 20:00 +0100, Stefan Groschupf wrote:
>>> /user/root/urls <dir>
>>>
>>> configured with searcher.dir as /usr/root
>> A typo? usr/root !=user/root !
>>
>>> However, I have seen people
>>> talk of a web-inf folder but I can't find one on my system.
>> may this helps:
>> http://wiki.media-style.com/display/nutchDocu/install+user+interface
>>
>> part of the webapplication you had deployed to your tomcat:
>> TOMCAT/webapps/ROOT/WEB-INF/classes/nutch-site.xml should b the
>> location.
>>
>> Stefan
>
Re: MapRed searching
Posted by Michael Taggart <mi...@webco.tv>.
Sorry Stefan, I am so used to typing usr that I wrote my email
incorrectly. Here is exactly what is in my nutch-site.xml:
<property>
<name>searcher.dir</name>
<value>/user/root</value>
<description>
Path to root of index directories. This directory is searched (in
order) for either the file search-servers.txt, containing a list of
distributed search servers, or the directory "index" containing
merged indexes, or the directory "segments" containing segment
indexes.
</description>
</property>
On Fri, 2005-12-16 at 20:00 +0100, Stefan Groschupf wrote:
> > /user/root/urls <dir>
> >
> > configured with searcher.dir as /usr/root
> A typo? usr/root !=user/root !
>
> > However, I have seen people
> > talk of a web-inf folder but I can't find one on my system.
> may this helps:
> http://wiki.media-style.com/display/nutchDocu/install+user+interface
>
> part of the webapplication you had deployed to your tomcat:
> TOMCAT/webapps/ROOT/WEB-INF/classes/nutch-site.xml should b the
> location.
>
> Stefan
Re: MapRed searching
Posted by Stefan Groschupf <sg...@media-style.com>.
> /user/root/urls <dir>
>
> configured with searcher.dir as /usr/root
A typo? usr/root !=user/root !
> However, I have seen people
> talk of a web-inf folder but I can't find one on my system.
may this helps:
http://wiki.media-style.com/display/nutchDocu/install+user+interface
part of the webapplication you had deployed to your tomcat:
TOMCAT/webapps/ROOT/WEB-INF/classes/nutch-site.xml should b the
location.
Stefan
Re: MapRed searching
Posted by Michael Taggart <mi...@webco.tv>.
Stefan,
Thank you so much for lending me a hand with this. I really appreciate
it.
Here is my output for bin/nutch ndfs -ls
Found 5 items
/user/root/crawldb <dir>
/user/root/indexes <dir>
/user/root/linkdb <dir>
/user/root/segments <dir>
/user/root/urls <dir>
I'm pretty sure that is setup correctly and I have my nutch-site.xml
configured with searcher.dir as /usr/root However, I have seen people
talk of a web-inf folder but I can't find one on my system. Is that just
an abbreviation for something? In addition I don't have a classes
folder. My nutch-default.xml and nutch-site.xml are in my conf dir. Am I
missing something?
Mike
On Fri, 2005-12-16 at 12:20 +0100, Stefan Groschupf wrote:
> Mike,
> the question is where is your data located in the ndfs? As a note
> searching a index stored in ndfs is very slow, however first let us
> fix your problem.
> Deploy the nutch webapp in your tomcat, check the nutch-default.xml
> in web-inf/classes that it also has the ndfs name node correct
> configured.
> Than verify that you have the correct structure in your ndfs.
> e.g.
> /user/nutchuser/segments
> /user/nutchuser/indexes
> /user/nutchuser/linkdb
>
> Than configure in web-inf/classes/nutch-default.xml the parameter
> searcher.dir just /user/nutchuser .
> That's it.
> HTH
> Stefan
>
> Am 16.12.2005 um 02:55 schrieb Michael Taggart:
>
> > I got mapred to complete a full index cycle. I now would like to
> > search
> > the index I created except I can't find out how to do that. I replaced
> > the war file and started it from my 0.8 installation directory but
> > every
> > search comes up with 0 results. Do I have to tell tomcat to search the
> > ndfs? Unsure how to link up the parts at this point.
> > Mike
> >
>
Re: MapRed searching
Posted by Stefan Groschupf <sg...@media-style.com>.
Mike,
the question is where is your data located in the ndfs? As a note
searching a index stored in ndfs is very slow, however first let us
fix your problem.
Deploy the nutch webapp in your tomcat, check the nutch-default.xml
in web-inf/classes that it also has the ndfs name node correct
configured.
Than verify that you have the correct structure in your ndfs.
e.g.
/user/nutchuser/segments
/user/nutchuser/indexes
/user/nutchuser/linkdb
Than configure in web-inf/classes/nutch-default.xml the parameter
searcher.dir just /user/nutchuser .
That's it.
HTH
Stefan
Am 16.12.2005 um 02:55 schrieb Michael Taggart:
> I got mapred to complete a full index cycle. I now would like to
> search
> the index I created except I can't find out how to do that. I replaced
> the war file and started it from my 0.8 installation directory but
> every
> search comes up with 0 results. Do I have to tell tomcat to search the
> ndfs? Unsure how to link up the parts at this point.
> Mike
>