You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by ".: Abhishek :." <ab...@gmail.com> on 2011/02/01 05:10:45 UTC

Implementing a negative keyword filter in index

Hi all,

 I am planning to implement a negative keyword indexer such that if a
negative keyword appears in a segment I should never show up it during the
search. I have the following steps in mind, please let me know if its right.

   - Writing a plug-in
      - Extend the IndexingFilter.
      - Do a NutchDocument.removeField for the negative keyword.
      - return the doc

  Now the questions are,

   - The NutchDocument is always mapped as a HTML page, so if I am doing the
   thing above, Am I really removing the segment from getting indexed or am I
   preventing the page from being indexed?

 Also, please let me know what I am intending to do is right? Thanks again
all for your time.

Cheers,
Abhi

Re: Implementing a negative keyword filter in index

Posted by Markus Jelsma <ma...@openindex.io>.
Hi,

It would be better if you open a separate thread on the JUnit question.

About the filter issue, are you using Nutch' search or Solr? Both use Lucene 
and are capable of queries with operators that prohibit a term. If that's Solr 
you're using, please consult the appropriate docs, wiki and mailings list on 
how to procede. I have no experience with Nutch' search capability but as it 
also uses Lucene i could imagine it allows these operators to be used as well.

Using these operators you can exclude certain terms in documents from showing 
up in your search. If you filter those documents out beforehand, you cannot 
query for them later. 

Check this for information on the LuceneQParser:
http://lucene.apache.org/java/2_9_1/queryparsersyntax.html

Cheers,

> Hi folks,
> 
>  I am sorry for adding another question to the same mail. I am also writing
> a plug-in extending HtmlParser. How do I test it with JUnit?
> 
>  I see the "filter" method takes Content content, ParseResult
> parseResult,HTMLMetaTags metaTags, DocumentFragment doc as argument. How
> can I generate these parameters of the test purpose?
> 
> Thanks,
> Abi
> 
> On Tue, Feb 1, 2011 at 12:10 PM, .: Abhishek :. <ab...@gmail.com> wrote:
> > Hi all,
> > 
> >  I am planning to implement a negative keyword indexer such that if a
> > 
> > negative keyword appears in a segment I should never show up it during
> > the search. I have the following steps in mind, please let me know if
> > its right.
> > 
> >    - Writing a plug-in
> >    
> >       - Extend the IndexingFilter.
> >       - Do a NutchDocument.removeField for the negative keyword.
> >       - return the doc
> >   
> >   Now the questions are,
> >   
> >    - The NutchDocument is always mapped as a HTML page, so if I am doing
> >    the thing above, Am I really removing the segment from getting indexed
> >    or am I preventing the page from being indexed?
> >  
> >  Also, please let me know what I am intending to do is right? Thanks
> >  again
> > 
> > all for your time.
> > 
> > Cheers,
> > Abhi

Re: Implementing a negative keyword filter in index

Posted by ".: Abhishek :." <ab...@gmail.com>.
Hi all,

 Some  help or guidance on this would be of great help...thanks a bunch for
all your patience.

Regards,
Abi


On Tue, Feb 1, 2011 at 4:07 PM, .: Abhishek :. <ab...@gmail.com> wrote:

> Hi folks,
>
>  I am sorry for adding another question to the same mail. I am also writing
> a plug-in extending HtmlParser. How do I test it with JUnit?
>
>  I see the "filter" method takes Content content, ParseResult
> parseResult,HTMLMetaTags metaTags, DocumentFragment doc as argument. How can
> I generate these parameters of the test purpose?
>
> Thanks,
> Abi
>
>
> On Tue, Feb 1, 2011 at 12:10 PM, .: Abhishek :. <ab...@gmail.com> wrote:
>
>> Hi all,
>>
>>  I am planning to implement a negative keyword indexer such that if a
>> negative keyword appears in a segment I should never show up it during the
>> search. I have the following steps in mind, please let me know if its right.
>>
>>    - Writing a plug-in
>>       - Extend the IndexingFilter.
>>       - Do a NutchDocument.removeField for the negative keyword.
>>       - return the doc
>>
>>   Now the questions are,
>>
>>    - The NutchDocument is always mapped as a HTML page, so if I am doing
>>    the thing above, Am I really removing the segment from getting indexed or am
>>    I preventing the page from being indexed?
>>
>>  Also, please let me know what I am intending to do is right? Thanks again
>> all for your time.
>>
>> Cheers,
>> Abhi
>>
>
>

Re: Implementing a negative keyword filter in index

Posted by ".: Abhishek :." <ab...@gmail.com>.
Hi folks,

 I am sorry for adding another question to the same mail. I am also writing
a plug-in extending HtmlParser. How do I test it with JUnit?

 I see the "filter" method takes Content content, ParseResult
parseResult,HTMLMetaTags metaTags, DocumentFragment doc as argument. How can
I generate these parameters of the test purpose?

Thanks,
Abi


On Tue, Feb 1, 2011 at 12:10 PM, .: Abhishek :. <ab...@gmail.com> wrote:

> Hi all,
>
>  I am planning to implement a negative keyword indexer such that if a
> negative keyword appears in a segment I should never show up it during the
> search. I have the following steps in mind, please let me know if its right.
>
>    - Writing a plug-in
>       - Extend the IndexingFilter.
>       - Do a NutchDocument.removeField for the negative keyword.
>       - return the doc
>
>   Now the questions are,
>
>    - The NutchDocument is always mapped as a HTML page, so if I am doing
>    the thing above, Am I really removing the segment from getting indexed or am
>    I preventing the page from being indexed?
>
>  Also, please let me know what I am intending to do is right? Thanks again
> all for your time.
>
> Cheers,
> Abhi
>