You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by "Mark , N" <ni...@gmail.com> on 2012/06/05 14:28:07 UTC
filtering number and repeated contents
Is it possible to filter out numbers and disclaimer ( repeated contents)
while indexing to SOLR?
These are all surplus information and do not want to index it
I have tried using boilerpipe algorithm as well to remove surplus
infromation from web pages such as navigational elements, templates, and
advertisements , I think it works well but looking forward to see If I
could filter out "disclaimer" information too mainly in email texts.
--
Thanks,
*Nipen Mark *
Re: filtering number and repeated contents
Posted by Jack Krupansky <ja...@basetechnology.com>.
Solr (Lucene actually) stores the source form of the data that was fed to
Solr, so it is not yet tokenized and will include all punctuation and
whitespace.
-- Jack Krupansky
-----Original Message-----
From: Mark , N
Sent: Thursday, June 07, 2012 7:45 AM
To: solr-user@lucene.apache.org
Subject: Re: filtering number and repeated contents
thanks Jack , I will try updateProcessor
Between does SOLR store tokenized "content" in fields if field have
property stored="true" ?
On Tue, Jun 5, 2012 at 8:23 PM, Jack Krupansky
<ja...@basetechnology.com>wrote:
> My (very limited) understanding of "boilerpipe" in Tika is that it strips
> out "short text", which is great for all the menu and navigation text, but
> the typical disclaimer at the bottom of an email is not very short and
> frequently can be longer than the email message body itself. You may have
> to resort to a custom update processor that is programmed with some
> disclaimer signature text strings to be removed from field values.
>
> -- Jack Krupansky
>
> -----Original Message----- From: Mark , N
> Sent: Tuesday, June 05, 2012 8:28 AM
> To: solr-user@lucene.apache.org
> Subject: filtering number and repeated contents
>
>
> Is it possible to filter out numbers and disclaimer ( repeated contents)
> while indexing to SOLR?
> These are all surplus information and do not want to index it
>
> I have tried using boilerpipe algorithm as well to remove surplus
> infromation from web pages such as navigational elements, templates, and
> advertisements , I think it works well but looking forward to see If I
> could filter out "disclaimer" information too mainly in email texts.
> --
> Thanks,
>
> *Nipen Mark *
>
--
Thanks,
*Nipen Mark *
Re: filtering number and repeated contents
Posted by "Mark , N" <ni...@gmail.com>.
thanks Jack , I will try updateProcessor
Between does SOLR store tokenized "content" in fields if field have
property stored="true" ?
On Tue, Jun 5, 2012 at 8:23 PM, Jack Krupansky <ja...@basetechnology.com>wrote:
> My (very limited) understanding of "boilerpipe" in Tika is that it strips
> out "short text", which is great for all the menu and navigation text, but
> the typical disclaimer at the bottom of an email is not very short and
> frequently can be longer than the email message body itself. You may have
> to resort to a custom update processor that is programmed with some
> disclaimer signature text strings to be removed from field values.
>
> -- Jack Krupansky
>
> -----Original Message----- From: Mark , N
> Sent: Tuesday, June 05, 2012 8:28 AM
> To: solr-user@lucene.apache.org
> Subject: filtering number and repeated contents
>
>
> Is it possible to filter out numbers and disclaimer ( repeated contents)
> while indexing to SOLR?
> These are all surplus information and do not want to index it
>
> I have tried using boilerpipe algorithm as well to remove surplus
> infromation from web pages such as navigational elements, templates, and
> advertisements , I think it works well but looking forward to see If I
> could filter out "disclaimer" information too mainly in email texts.
> --
> Thanks,
>
> *Nipen Mark *
>
--
Thanks,
*Nipen Mark *
Re: filtering number and repeated contents
Posted by Jack Krupansky <ja...@basetechnology.com>.
My (very limited) understanding of "boilerpipe" in Tika is that it strips
out "short text", which is great for all the menu and navigation text, but
the typical disclaimer at the bottom of an email is not very short and
frequently can be longer than the email message body itself. You may have to
resort to a custom update processor that is programmed with some disclaimer
signature text strings to be removed from field values.
-- Jack Krupansky
-----Original Message-----
From: Mark , N
Sent: Tuesday, June 05, 2012 8:28 AM
To: solr-user@lucene.apache.org
Subject: filtering number and repeated contents
Is it possible to filter out numbers and disclaimer ( repeated contents)
while indexing to SOLR?
These are all surplus information and do not want to index it
I have tried using boilerpipe algorithm as well to remove surplus
infromation from web pages such as navigational elements, templates, and
advertisements , I think it works well but looking forward to see If I
could filter out "disclaimer" information too mainly in email texts.
--
Thanks,
*Nipen Mark *