You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Justin Hartman <jj...@gmail.com> on 2006/12/29 13:52:11 UTC

Searching via http & statistical data

Hi guys

I have my nutch system working pretty reasonably I think and I am
quite happy with the way it is fetching, crawling and indexing. I do
have a problem however in that I can not figure out how to make the
http searches pull data from the index.

Running the searcher command[1] brings up a list of search results
however when I run the same search from the http side[2] it generates
zero results.

I've gone through the nutch tutorials[3+4] as well as tried to
implement the faq question[5] that addresses this very issue but I
still get no results.

This current server is running CentOS with Plesk 8.1/Tomcat 5 and Java
1.4.2. Because Plesk does very odd things I've had to change some of
the config values in my tomcat5.conf file but this change was just
re-writing the access path to nutch.

I'm honestly fresh out of ideas and problem-solving and now need to
resort to some help from the experts!

I'd also like to ask if there is anyway to view any or all of the
following information:
   1. Documents indexed in the database
   2. Search query times

Any help on the above two questions is appreciated.

[1] bin/nutch org.apache.nutch.searcher.NutchBean apache
[2] http://localhost:9080/search.jsp?lang=en&query=apache
[3] http://wiki.apache.org/nutch/NutchTutorial
[4] http://lucene.apache.org/nutch/tutorial8.html
[5] http://wiki.apache.org/nutch/FAQ#head-0c5dd359a76f9ac5ed54f9d81d79130e4c9c3302
-- 
Regards
Justin Hartman
PGP Key ID: 102CC123

Re: Searching via http & statistical data

Posted by Justin Hartman <jj...@gmail.com>.

Hi Nitin

> IIRC, the tutorial requires you to start the tomcat instance so it knows
> where your index is.
> Are you starting tomcat from the directory that has your index (the
> suggested way in the tutorial) ?
> Or are you indicating to the search servlet the location of your index
> in some other way?

The problem with this is that Plesk have configured their software to
by default disable Tomcat support until you upgrade to a more
expensive license with SWsoft. Once you upgrade the license key then
Tomcat magically appears in the Plesk control panel and you can then
setup applications directly through Plesk.

The problem with this is that I've tried to not use Plesk when
configuring nutch but there are inherent problems. For example
catalina.sh does not exist on a Plesk server. They have either renamed
it or removed it and the only way to startup, restart or stop Tomcat
is to do so via the Plesk control panel.

> http://today.java.net/pub/a/today/2006/02/16/introduction-to-nutch-2.html
> shows how to set the index dir in nutch-site.xml  So either you need to
> do this or start tomcat from the index dir.

I tried to do this afternoon after I was unable to start tomcat from
the index directory. I figured this would work as it's forcing tomcat
to pull data from the directory I'm specifying and in my case it is,
"/usr/local/nutch/crawl/" This contains my indexes, linkdb, segments,
etc folders.

The problem I believe all has to do with stupid Plesk. For example
most of the tutorials reference the following: ~/tomcat/webapps/ROOT
but in Plesk and the way they structure it the similar path would
actually be: /usr/share/tomcat5/psa-wars/domain.com/ and not as all
the tutorials reference.

My problem with the tutorial is simply that because of this
re-structure that plesk has done there is no WEB-INF/classes/ folder
for me to store this xml file. I've gone through all the structures of
tomcat5 on the server and if I were to put the nutch-site.xml file
anywhere I would guess the best place would be
/usr/share/tomcat5/psa-wars/domain.com/ as the nutch-0.8.1.war file is
located in this directory.

Not an ideal situation this....

Regards
Justin

On 12/29/06, Nitin Borwankar <ni...@borwankar.com> wrote:
> Nitin Borwankar wrote:
>
> > Justin Hartman wrote:
> >
> >> Hi guys
> >>
> >> I have my nutch system working pretty reasonably I think and I am
> >> quite happy with the way it is fetching, crawling and indexing. I do
> >> have a problem however in that I can not figure out how to make the
> >> http searches pull data from the index.
> >
> >
> >
> > [....]
> >
> > Hi Justin,
> >
> > IIRC, the tutorial requires you to start the tomcat instance so it
> > knows where your index is.
> > Are you starting tomcat from the directory that has your index (the
> > suggested way in the tutorial) ?
> > Or are you indicating to the search servlet the location of your index
> > in some other way?
> >
> > Nitin
> >

>
> |<?xml version="1.0"?>
> <?xml-stylesheet type="text/xsl" href="nutch-conf.xsl"?>
>
> <!-- Put site-specific property overrides in this file. -->
>
> <nutch-conf>
> <property>
> <name>searcher.dir</name>
> <value>/Users/tom/Applications/nutch-0.7.1/crawl-tinysite</value>
> </property>
> </nutch-conf>|
>
>
>
>
> --
> Nitin Borwankar
> Find, Learn, Act ....
> Greener, the search engine for the planet
> http://greener.com
> nitin@borwankar.com
> 510-872-7066
>
>

-- 
Regards
Justin Hartman
PGP Key ID: 102CC123

Re: Searching via http & statistical data

Posted by Nitin Borwankar <ni...@borwankar.com>.

Nitin Borwankar wrote:

> Justin Hartman wrote:
>
>> Hi guys
>>
>> I have my nutch system working pretty reasonably I think and I am
>> quite happy with the way it is fetching, crawling and indexing. I do
>> have a problem however in that I can not figure out how to make the
>> http searches pull data from the index.
>
>
>
> [....]
>
> Hi Justin,
>
> IIRC, the tutorial requires you to start the tomcat instance so it 
> knows where your index is.
> Are you starting tomcat from the directory that has your index (the 
> suggested way in the tutorial) ?
> Or are you indicating to the search servlet the location of your index 
> in some other way?
>
> Nitin
>
http://today.java.net/pub/a/today/2006/02/16/introduction-to-nutch-2.html
shows how to set the index dir in nutch-site.xml  So either you need to 
do this or start tomcat from the index dir.

|<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="nutch-conf.xsl"?>

<!-- Put site-specific property overrides in this file. -->

<nutch-conf>
<property>
<name>searcher.dir</name>
<value>/Users/tom/Applications/nutch-0.7.1/crawl-tinysite</value>
</property>
</nutch-conf>|




-- 
Nitin Borwankar
Find, Learn, Act .... 
Greener, the search engine for the planet
http://greener.com
nitin@borwankar.com
510-872-7066

Re: Searching via http & statistical data

Posted by Nitin Borwankar <ni...@borwankar.com>.

Justin Hartman wrote:

> Hi guys
>
> I have my nutch system working pretty reasonably I think and I am
> quite happy with the way it is fetching, crawling and indexing. I do
> have a problem however in that I can not figure out how to make the
> http searches pull data from the index.

[....]

Hi Justin,

IIRC, the tutorial requires you to start the tomcat instance so it knows 
where your index is.
Are you starting tomcat from the directory that has your index (the 
suggested way in the tutorial) ?
Or are you indicating to the search servlet the location of your index 
in some other way?

Nitin

-- 
Nitin Borwankar
Find, Learn, Act .... 
Greener, the search engine for the planet
http://greener.com
nitin@borwankar.com
510-872-7066