You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@manifoldcf.apache.org by Salih Sen <sa...@dilisim.com> on 2015/01/13 15:06:08 UTC
Removing header and footer of Sharepoint pages
Hi Everyone,
I'm trying to index Sharepoint 2013 web site with ManifoldCF 1.7.2
using Solr as ouput connection.
How can I remove header and footer of aspx files so they are not
indexed with the rest of the document?
I tried using custom updateRequestProcessorChain but since aspx pages
indexed through ExtractingRequestHandler html is stripped before it
reaches there.
--
Salih Şen
Dilişim Bilgi Bilgisayar ve İletişim Teknolojileri Sanayi ve Ticaret Ltd. Sti.
email: salih@dilisim.com
Tel: 0 222 330 20 21
GSM: 0 507 296 15 51
Re: Removing header and footer of Sharepoint pages
Posted by Karl Wright <da...@gmail.com>.
You could try using the Tika Extractor in ManifoldCF. There's support for
boilerplate removal, but I'm not sure how well it works.
Karl
On Tue, Jan 13, 2015 at 9:06 AM, Salih Sen <sa...@dilisim.com> wrote:
> Hi Everyone,
>
> I'm trying to index Sharepoint 2013 web site with ManifoldCF 1.7.2
> using Solr as ouput connection.
>
> How can I remove header and footer of aspx files so they are not
> indexed with the rest of the document?
>
> I tried using custom updateRequestProcessorChain but since aspx pages
> indexed through ExtractingRequestHandler html is stripped before it
> reaches there.
>
> --
> Salih Şen
>
> Dilişim Bilgi Bilgisayar ve İletişim Teknolojileri Sanayi ve Ticaret Ltd.
> Sti.
>
> email: salih@dilisim.com
>
> Tel: 0 222 330 20 21
>
> GSM: 0 507 296 15 51
>