You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by li...@yahoo.com.INVALID on 2016/01/04 13:34:45 UTC

apply document filter to solr index

Hi everyone, I'm working on a search engine based on solr which indexes documents from a large variety of websites. 
The engine is focused on cook recipes. However, one problem is that these websites provide not only content related to cooking recipes but also content related to: fashion, travel, politics, liberty rights etc etc which are not what the user expects to find on a cooking recipes dedicated search engine. 
Is there any way to filter out content which is not related to the core business of the search engine?
Something like parental control software maybe?
Kind regards,Christian Christian Fotache Tel: 0728.297.207 Fax: 0351.411.570

Re: apply document filter to solr index

Posted by Binoy Dalal <bi...@gmail.com>.
There is no way that you can do that in solr.

You'll have to write something at the app level,  where you're crawling
your docs or write a custom update handler that will preprocess the crawled
docs and throw out the irrelevant ones.

One way you can do that is look at the doc title and the url for certain
keywords that might tell you that the particular article belongs to the
fashion domain etc.
If the content is well structured then you might also have certain fields
in the raw crawled doc that tell you the doc category.
To look at the raw crawled doc you can use the
DocumentAnalysisRequestHandler.

On Mon, 4 Jan 2016, 18:07  <li...@yahoo.com.invalid> wrote:

> Hi everyone, I'm working on a search engine based on solr which indexes
> documents from a large variety of websites.
> The engine is focused on cook recipes. However, one problem is that these
> websites provide not only content related to cooking recipes but also
> content related to: fashion, travel, politics, liberty rights etc etc which
> are not what the user expects to find on a cooking recipes dedicated search
> engine.
> Is there any way to filter out content which is not related to the core
> business of the search engine?
> Something like parental control software maybe?
> Kind regards,Christian Christian Fotache Tel: 0728.297.207 Fax:
> 0351.411.570

-- 
Regards,
Binoy Dalal

Re: apply document filter to solr index

Posted by Binoy Dalal <bi...@gmail.com>.
There is no way that you can do that in solr.

You'll have to write something at the app level,  where you're crawling
your docs or write a custom update handler that will preprocess the crawled
docs and throw out the irrelevant ones.

One way you can do that is look at the doc title and the url for certain
keywords that might tell you that the particular article belongs to the
fashion domain etc.
If the content is well structured then you might also have certain fields
in the raw crawled doc that tell you the doc category.
To look at the raw crawled doc you can use the
DocumentAnalysisRequestHandler.

On Mon, 4 Jan 2016, 18:07  <li...@yahoo.com.invalid> wrote:

> Hi everyone, I'm working on a search engine based on solr which indexes
> documents from a large variety of websites.
> The engine is focused on cook recipes. However, one problem is that these
> websites provide not only content related to cooking recipes but also
> content related to: fashion, travel, politics, liberty rights etc etc which
> are not what the user expects to find on a cooking recipes dedicated search
> engine.
> Is there any way to filter out content which is not related to the core
> business of the search engine?
> Something like parental control software maybe?
> Kind regards,Christian Christian Fotache Tel: 0728.297.207 Fax:
> 0351.411.570

-- 
Regards,
Binoy Dalal

Re: apply document filter to solr index

Posted by Alexandre Rafalovitch <ar...@gmail.com>.
Well, you have a crawling and extraction pipeline. You can probably inject
a classification algorithm somewhere in there, possibly NLP trained on
manual seed. Or just a list of typical words as a start.

This is kind of pre-Solr stage though.

Regards,
    Alex
On 4 Jan 2016 7:37 pm, <li...@yahoo.com.invalid> wrote:

> Hi everyone, I'm working on a search engine based on solr which indexes
> documents from a large variety of websites.
> The engine is focused on cook recipes. However, one problem is that these
> websites provide not only content related to cooking recipes but also
> content related to: fashion, travel, politics, liberty rights etc etc which
> are not what the user expects to find on a cooking recipes dedicated search
> engine.
> Is there any way to filter out content which is not related to the core
> business of the search engine?
> Something like parental control software maybe?
> Kind regards,Christian Christian Fotache Tel: 0728.297.207 Fax:
> 0351.411.570