You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by pavankumar <ma...@gmail.com> on 2008/05/20 09:04:33 UTC

Nutch Query not giving required results

Hi,
       I want to filter out search results such that only urls with a
specific word in the "url" field should be present in the output. If the
word to search for in the "url" field has a hyphen(-), we are not getting
any results. 
I am using the following code snippet.

query.addRequiredTerm(<wordtosearch>, "url");
hits = bean.search(query, Short.MAX_VALUE);

if the <wordtosearch> has a hyphen in it, no results are obtained.
Please help me on solving this issue. 

-- 
View this message in context: http://www.nabble.com/Nutch-Query-not-giving-required-results-tp17334490p17334490.html
Sent from the Nutch - User mailing list archive at Nabble.com.


Re: Nutch Query not giving required results

Posted by Jasper Kamperman <ja...@openwaternet.com>.
The main point I was trying to make is that you need to use the same  
tokenizer when you're indexing and when you're searching. I haven't  
done this, perhaps other people on this list can tell you exactly  
where in your configuration you can find/change what tokenizer is  
being used to index your url field. If you use the same tokenizer  
when you're querying you should be fine.

On May 20, 2008, at 12:45 PM, pavankumar wrote:

>
> Hi Jasper,
>              Thanks for the help. The following pseudocode
>
> tokens = tokenize(wordtosearch)
> for t in token
>    query.addRequiredTerm(token,"url")
>
> may not solve my problem due to issues with ordering and also say  
> if there
> are 2 urls having words
> test and test-new in them. When I search for a word "test", it will  
> return
> both the urls having test and test-new but I need only url having a  
> word
> "test" in it (since we just add "test" as a required term which is  
> there in
> both the urls).
>
> I am not using any explicit tokenizer. I hope Nutch has
> NutchDocumentTokenizer which is called by default to tokenize. I  
> would be
> more interested to know how to avoid splitting strings containing  
> hyphens by
> overriding or using my own Tokenizer. Can you please point me on  
> how to use
> my own tokenizer to solve this issue and the configuration chnages  
> to be
> done to solve the same?
>
> Thanks,
> Pavan
>
>
> Jasper Kamperman wrote:
>>
>> Possibly your content was tokenized when it was indexed, splitting up
>> strings containing hyphens, so if the url was "multiple-word", the
>> indexed field looks like "multiple word". If you can find out what
>> tokenizer (if any) was used when indexing the url field, you could do
>> something like (pseudocode)
>>
>> tokens = tokenize(wordtosearch)
>> for t in token
>>    query.addRequiredTerm(token,"url")
>>    // should also add some restrictions that require the tokens to be
>> in the same order
>>
>> if you want to see how this stuff works you can use Luke on your
>> index -- you'll also see it has some pre-packaged Analyzers that can
>> do the kind of stuff in the pseudocode above.
>>
>> Hope this helps,
>>
>> Jasper
>>
>> On May 20, 2008, at 12:04 AM, pavankumar wrote:
>>
>>>
>>> Hi,
>>>        I want to filter out search results such that only urls  
>>> with a
>>> specific word in the "url" field should be present in the output.
>>> If the
>>> word to search for in the "url" field has a hyphen(-), we are not
>>> getting
>>> any results.
>>> I am using the following code snippet.
>>>
>>> query.addRequiredTerm(<wordtosearch>, "url");
>>> hits = bean.search(query, Short.MAX_VALUE);
>>>
>>> if the <wordtosearch> has a hyphen in it, no results are obtained.
>>> Please help me on solving this issue.
>>>
>>> -- 
>>> View this message in context: http://www.nabble.com/Nutch-Query-not-
>>> giving-required-results-tp17334490p17334490.html
>>> Sent from the Nutch - User mailing list archive at Nabble.com.
>>>
>>>
>>
>>
>>
>
> -- 
> View this message in context: http://www.nabble.com/Nutch-Query-not- 
> giving-required-results-tp17334490p17349085.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>
>


Re: Nutch Query not giving required results

Posted by pavankumar <ma...@gmail.com>.
Hi Jasper,
             Thanks for the help. The following pseudocode 

tokens = tokenize(wordtosearch)
for t in token
   query.addRequiredTerm(token,"url") 

may not solve my problem due to issues with ordering and also say if there
are 2 urls having words 
test and test-new in them. When I search for a word "test", it will return
both the urls having test and test-new but I need only url having a word
"test" in it (since we just add "test" as a required term which is there in
both the urls). 

I am not using any explicit tokenizer. I hope Nutch has
NutchDocumentTokenizer which is called by default to tokenize. I would be
more interested to know how to avoid splitting strings containing hyphens by
overriding or using my own Tokenizer. Can you please point me on how to use
my own tokenizer to solve this issue and the configuration chnages to be
done to solve the same?

Thanks,
Pavan


Jasper Kamperman wrote:
> 
> Possibly your content was tokenized when it was indexed, splitting up  
> strings containing hyphens, so if the url was "multiple-word", the  
> indexed field looks like "multiple word". If you can find out what  
> tokenizer (if any) was used when indexing the url field, you could do  
> something like (pseudocode)
> 
> tokens = tokenize(wordtosearch)
> for t in token
>    query.addRequiredTerm(token,"url")
>    // should also add some restrictions that require the tokens to be  
> in the same order
> 
> if you want to see how this stuff works you can use Luke on your  
> index -- you'll also see it has some pre-packaged Analyzers that can  
> do the kind of stuff in the pseudocode above.
> 
> Hope this helps,
> 
> Jasper
> 
> On May 20, 2008, at 12:04 AM, pavankumar wrote:
> 
>>
>> Hi,
>>        I want to filter out search results such that only urls with a
>> specific word in the "url" field should be present in the output.  
>> If the
>> word to search for in the "url" field has a hyphen(-), we are not  
>> getting
>> any results.
>> I am using the following code snippet.
>>
>> query.addRequiredTerm(<wordtosearch>, "url");
>> hits = bean.search(query, Short.MAX_VALUE);
>>
>> if the <wordtosearch> has a hyphen in it, no results are obtained.
>> Please help me on solving this issue.
>>
>> -- 
>> View this message in context: http://www.nabble.com/Nutch-Query-not- 
>> giving-required-results-tp17334490p17334490.html
>> Sent from the Nutch - User mailing list archive at Nabble.com.
>>
>>
> 
> 
> 

-- 
View this message in context: http://www.nabble.com/Nutch-Query-not-giving-required-results-tp17334490p17349085.html
Sent from the Nutch - User mailing list archive at Nabble.com.


Re: Nutch Query not giving required results

Posted by Jasper Kamperman <ja...@openwaternet.com>.
Possibly your content was tokenized when it was indexed, splitting up  
strings containing hyphens, so if the url was "multiple-word", the  
indexed field looks like "multiple word". If you can find out what  
tokenizer (if any) was used when indexing the url field, you could do  
something like (pseudocode)

tokens = tokenize(wordtosearch)
for t in token
   query.addRequiredTerm(token,"url")
   // should also add some restrictions that require the tokens to be  
in the same order

if you want to see how this stuff works you can use Luke on your  
index -- you'll also see it has some pre-packaged Analyzers that can  
do the kind of stuff in the pseudocode above.

Hope this helps,

Jasper

On May 20, 2008, at 12:04 AM, pavankumar wrote:

>
> Hi,
>        I want to filter out search results such that only urls with a
> specific word in the "url" field should be present in the output.  
> If the
> word to search for in the "url" field has a hyphen(-), we are not  
> getting
> any results.
> I am using the following code snippet.
>
> query.addRequiredTerm(<wordtosearch>, "url");
> hits = bean.search(query, Short.MAX_VALUE);
>
> if the <wordtosearch> has a hyphen in it, no results are obtained.
> Please help me on solving this issue.
>
> -- 
> View this message in context: http://www.nabble.com/Nutch-Query-not- 
> giving-required-results-tp17334490p17334490.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>
>


Re: Nutch Query not giving required results

Posted by foobar3001 <fo...@yahoo.com>.
Hello!

I am trying to just get ANY filtering on the URL working, but so far no
luck. I tried just specifying "... url:something" in the search string, but
it doesn't seem to work. The same if I try to add it programmatically as you
did in your code sample.

The result is always the same: If <wordtosearch> also appears in the full
text of the document, then the sample text given as the summary for the hit
is skewed to include <wordtosearch> as well, even though I really only
wanted it to be relevant in the URL, not at all in the full text of the
document.

Any idea how to fix this?




pavankumar wrote:
> 
> Hi,
>        I want to filter out search results such that only urls with a
> specific word in the "url" field should be present in the output. If the
> word to search for in the "url" field has a hyphen(-), we are not getting
> any results. 
> I am using the following code snippet.
> 
> query.addRequiredTerm(<wordtosearch>, "url");
> hits = bean.search(query, Short.MAX_VALUE);
> 
> if the <wordtosearch> has a hyphen in it, no results are obtained.
> Please help me on solving this issue. 
> 
> 

-- 
View this message in context: http://www.nabble.com/Nutch-Query-not-giving-required-results-tp17334490p17480666.html
Sent from the Nutch - User mailing list archive at Nabble.com.