You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@manifoldcf.apache.org by Salih Sen <sa...@dilisim.com> on 2015/01/13 15:06:08 UTC

Removing header and footer of Sharepoint pages

Hi Everyone,

I'm trying to index Sharepoint 2013 web site with ManifoldCF 1.7.2
using Solr as ouput connection.

How can I remove header and footer of aspx files so they are not
indexed with the rest of the document?

I tried using custom updateRequestProcessorChain but since aspx pages
indexed through ExtractingRequestHandler html is stripped before it
reaches there.

-- 
Salih Şen

Dilişim Bilgi Bilgisayar ve İletişim Teknolojileri Sanayi ve Ticaret Ltd. Sti.

email: salih@dilisim.com

Tel: 0 222 330 20 21

GSM: 0 507 296 15 51

Re: Removing header and footer of Sharepoint pages

Posted by Karl Wright <da...@gmail.com>.
You could try using the Tika Extractor in ManifoldCF.  There's support for
boilerplate removal, but I'm not sure how well it works.

Karl


On Tue, Jan 13, 2015 at 9:06 AM, Salih Sen <sa...@dilisim.com> wrote:

> Hi Everyone,
>
> I'm trying to index Sharepoint 2013 web site with ManifoldCF 1.7.2
> using Solr as ouput connection.
>
> How can I remove header and footer of aspx files so they are not
> indexed with the rest of the document?
>
> I tried using custom updateRequestProcessorChain but since aspx pages
> indexed through ExtractingRequestHandler html is stripped before it
> reaches there.
>
> --
> Salih Şen
>
> Dilişim Bilgi Bilgisayar ve İletişim Teknolojileri Sanayi ve Ticaret Ltd.
> Sti.
>
> email: salih@dilisim.com
>
> Tel: 0 222 330 20 21
>
> GSM: 0 507 296 15 51
>