You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Libor Štefek <li...@logis.cz> on 2007/01/16 07:25:46 UTC

Searcher doesn't find what expected

Hi,
I'm using nutch 0.8.1 to index several thousand text files (source code) 
and I use
intranet crawling method to create an index.

Everything looks fine, but when I try to search something, it often 
doesn't find
what it should. I'm sure that the term is in several pages, but I got 
result only
for some of them.

I tried to set limits in properties like page sizes, number of links 
etc. but nothing helped.
There aren't any error messages in logfile during crawl.

Is there any way how to find a reason for this behavior ?
How to make nutch more reliable in results?

Thanks for any hint.
Libor


Re: Searcher doesn't find what expected

Posted by Libor Štefek <li...@logis.cz>.
Thanks for hint.
I have changed my script and instead single "nutch crawl" step
I use  generate->fetch->updatedb->fetch->invertlinks->index commands.
I don't use dedup command.
Now it seems to be OK, search find out all occurrences. 
I think nutch removes duplicate pages even they are on different locations.
But for me it is important to have information about every occurrence of 
a term.

Libor

Alvaro Cabrerizo wrote:
> I recommend you to check you index using luke. Whith luke you can manage
> (query, see structure..) your lucene index in order to discover if you 
> have
> a problem during indexation or during the search.
>
> 2007/1/16, kauu <ba...@gmail.com>:
>>
>> so ,u must show us the logs ,
>> and did u change the nutch-site.xml in the tomcat ?
>>
>> On 1/16/07, Libor Štefek <li...@logis.cz> wrote:
>> >
>> > Hi,
>> > I'm using nutch 0.8.1 to index several thousand text files (source 
>> code)
>> > and I use
>> > intranet crawling method to create an index.
>> >
>> > Everything looks fine, but when I try to search something, it often
>> > doesn't find
>> > what it should. I'm sure that the term is in several pages, but I got
>> > result only
>> > for some of them.
>> >
>> > I tried to set limits in properties like page sizes, number of links
>> > etc. but nothing helped.
>> > There aren't any error messages in logfile during crawl.
>> >
>> > Is there any way how to find a reason for this behavior ?
>> > How to make nutch more reliable in results?
>> >
>> > Thanks for any hint.
>> > Libor
>> >
>> >
>>
>>
>> -- 
>> www.babatu.com
>>
>>
>


-- 
-- 
Libor Štefek
LOGIS, s.r.o.
tel. 	+420 556 841 100
fax. 	+420 556 841 117
mobil 	+420 605 228 985
www.logis.cz <http://www.logis.cz/>


Re: Searcher doesn't find what expected

Posted by Alvaro Cabrerizo <to...@gmail.com>.
I recommend you to check you index using luke. Whith luke you can manage
(query, see structure..) your lucene index in order to discover if you have
a problem during indexation or during the search.

2007/1/16, kauu <ba...@gmail.com>:
>
> so ,u must show us the logs ,
> and did u change the nutch-site.xml in the tomcat ?
>
> On 1/16/07, Libor Štefek <li...@logis.cz> wrote:
> >
> > Hi,
> > I'm using nutch 0.8.1 to index several thousand text files (source code)
> > and I use
> > intranet crawling method to create an index.
> >
> > Everything looks fine, but when I try to search something, it often
> > doesn't find
> > what it should. I'm sure that the term is in several pages, but I got
> > result only
> > for some of them.
> >
> > I tried to set limits in properties like page sizes, number of links
> > etc. but nothing helped.
> > There aren't any error messages in logfile during crawl.
> >
> > Is there any way how to find a reason for this behavior ?
> > How to make nutch more reliable in results?
> >
> > Thanks for any hint.
> > Libor
> >
> >
>
>
> --
> www.babatu.com
>
>

Re: Searcher doesn't find what expected

Posted by kauu <ba...@gmail.com>.
so ,u must show us the logs ,
and did u change the nutch-site.xml in the tomcat ?

On 1/16/07, Libor Štefek <li...@logis.cz> wrote:
>
> Hi,
> I'm using nutch 0.8.1 to index several thousand text files (source code)
> and I use
> intranet crawling method to create an index.
>
> Everything looks fine, but when I try to search something, it often
> doesn't find
> what it should. I'm sure that the term is in several pages, but I got
> result only
> for some of them.
>
> I tried to set limits in properties like page sizes, number of links
> etc. but nothing helped.
> There aren't any error messages in logfile during crawl.
>
> Is there any way how to find a reason for this behavior ?
> How to make nutch more reliable in results?
>
> Thanks for any hint.
> Libor
>
>


-- 
www.babatu.com