You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by "Mark , N" <ni...@gmail.com> on 2012/06/05 14:28:07 UTC

filtering number and repeated contents

Is it possible to filter out numbers and disclaimer ( repeated contents)
while indexing to SOLR?
These are all surplus information and do not want to index it

I have tried using boilerpipe algorithm as well to remove surplus
infromation from web pages such as navigational elements, templates, and
advertisements , I think it works well but looking forward to see If I
could filter out  "disclaimer" information too mainly in email texts.
-- 
Thanks,

*Nipen Mark *

Re: filtering number and repeated contents

Posted by Jack Krupansky <ja...@basetechnology.com>.
Solr (Lucene actually) stores the source form of the data that was fed to 
Solr, so it is not yet tokenized and will include all punctuation and 
whitespace.

-- Jack Krupansky

-----Original Message----- 
From: Mark , N
Sent: Thursday, June 07, 2012 7:45 AM
To: solr-user@lucene.apache.org
Subject: Re: filtering number and repeated contents

thanks Jack  , I will try updateProcessor

Between does SOLR store tokenized "content" in fields if field have
property stored="true" ?







On Tue, Jun 5, 2012 at 8:23 PM, Jack Krupansky 
<ja...@basetechnology.com>wrote:

> My (very limited) understanding of "boilerpipe" in Tika is that it strips
> out "short text", which is great for all the menu and navigation text, but
> the typical disclaimer at the bottom of an email is not very short and
> frequently can be longer than the email message body itself. You may have
> to resort to a custom update processor that is programmed with some
> disclaimer signature text strings to be removed from field values.
>
> -- Jack Krupansky
>
> -----Original Message----- From: Mark , N
> Sent: Tuesday, June 05, 2012 8:28 AM
> To: solr-user@lucene.apache.org
> Subject: filtering number and repeated contents
>
>
> Is it possible to filter out numbers and disclaimer ( repeated contents)
> while indexing to SOLR?
> These are all surplus information and do not want to index it
>
> I have tried using boilerpipe algorithm as well to remove surplus
> infromation from web pages such as navigational elements, templates, and
> advertisements , I think it works well but looking forward to see If I
> could filter out  "disclaimer" information too mainly in email texts.
> --
> Thanks,
>
> *Nipen Mark *
>



-- 
Thanks,

*Nipen Mark * 


Re: filtering number and repeated contents

Posted by "Mark , N" <ni...@gmail.com>.
thanks Jack  , I will try updateProcessor

Between does SOLR store tokenized "content" in fields if field have
property stored="true" ?







On Tue, Jun 5, 2012 at 8:23 PM, Jack Krupansky <ja...@basetechnology.com>wrote:

> My (very limited) understanding of "boilerpipe" in Tika is that it strips
> out "short text", which is great for all the menu and navigation text, but
> the typical disclaimer at the bottom of an email is not very short and
> frequently can be longer than the email message body itself. You may have
> to resort to a custom update processor that is programmed with some
> disclaimer signature text strings to be removed from field values.
>
> -- Jack Krupansky
>
> -----Original Message----- From: Mark , N
> Sent: Tuesday, June 05, 2012 8:28 AM
> To: solr-user@lucene.apache.org
> Subject: filtering number and repeated contents
>
>
> Is it possible to filter out numbers and disclaimer ( repeated contents)
> while indexing to SOLR?
> These are all surplus information and do not want to index it
>
> I have tried using boilerpipe algorithm as well to remove surplus
> infromation from web pages such as navigational elements, templates, and
> advertisements , I think it works well but looking forward to see If I
> could filter out  "disclaimer" information too mainly in email texts.
> --
> Thanks,
>
> *Nipen Mark *
>



-- 
Thanks,

*Nipen Mark *

Re: filtering number and repeated contents

Posted by Jack Krupansky <ja...@basetechnology.com>.
My (very limited) understanding of "boilerpipe" in Tika is that it strips 
out "short text", which is great for all the menu and navigation text, but 
the typical disclaimer at the bottom of an email is not very short and 
frequently can be longer than the email message body itself. You may have to 
resort to a custom update processor that is programmed with some disclaimer 
signature text strings to be removed from field values.

-- Jack Krupansky

-----Original Message----- 
From: Mark , N
Sent: Tuesday, June 05, 2012 8:28 AM
To: solr-user@lucene.apache.org
Subject: filtering number and repeated contents

Is it possible to filter out numbers and disclaimer ( repeated contents)
while indexing to SOLR?
These are all surplus information and do not want to index it

I have tried using boilerpipe algorithm as well to remove surplus
infromation from web pages such as navigational elements, templates, and
advertisements , I think it works well but looking forward to see If I
could filter out  "disclaimer" information too mainly in email texts.
-- 
Thanks,

*Nipen Mark *