You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by ma...@Automationdirect.com on 2015/10/01 14:20:44 UTC

Re: Remove Header Footer and Menus from crawled content

Camillo thank you so much for sharing your changes. I am checking it out.


On 9/30/15 3:37 PM, "Camilo Tejeiro" <ca...@gmail.com> wrote:

>I believe you can do it with Tika,
>
>I did it a different way...
>I recently had to do something similar and I wrote a little parse-filter
>plugin to accomplish this.
>
>For reference look into the Jira Issue 585, it will give you some ideas.
>https://issues.apache.org/jira/browse/NUTCH-585
>
>If it helps here is my open nutch install with the integrated plugin (look
>for the parse-html-filter-select-nodes plugin). I haven't created a patch
>but you are free to use it if it helps you...
>https://github.com/osohm/apache-nutch-1.10
>
>cheers,
>
>On Wed, Sep 30, 2015 at 11:57 AM, <ma...@automationdirect.com> wrote:
>
>> Hi All,
>>
>> We need to remove header, footer and menu from the crawled content
>>before
>> we index content into SOLR. I researched online and found references to
>> removal via Tika's boilerpipe support - Nutch-961
>>
>> We are currently using Nutch 1.7 but I am looking into updating to Nutch
>> 1.10. I am hoping that the newer version of Tika in Nutch 1.10 will do a
>> better job in removing extra content.
>>
>> I will be very thankful if you can let me know the best method and steps
>> to achieve this goal and how effective this is in removal.
>>
>> Thanks so much,
>> Madhvi
>>
>>
>
>
>-- 
>Camilo Tejeiro
>*Be **honest, be grateful, be humble.*
>https://www.linkedin.com/in/camilotejeiro
>http://camilotejeiro.wordpress.com

Re: Remove Header Footer and Menus from crawled content

Posted by Imtiaz Shakil Siddique <sh...@gmail.com>.

Hi,

If you are going to crawl whole web then there is a java library called
Boilerpipe https://code.google.com/p/boilerpipe/ that might help you.

The boilerpipe library provides algorithms to detect and remove the surplus
"clutter" (boilerplate, templates) around the main textual content of a web
page.

Imtiaz Shakil Siddique
On Oct 3, 2015 2:05 AM, "Camilo Tejeiro" <ca...@gmail.com> wrote:

> @marora: I am glad it helps!
> @john: I think you don't have to patch or modify the parse-html plugin, you
> can build a parse-filter that is executed afterwards, this is the way I am
> doing it currently, because I read somewhere (not remember where) that it
> is good practice to extend the parse-html plugin as opposed to modifying it
> directly, because as you mention your changes might have to be reapplied to
> new nutch releases. But if you are concerned about having to execute an
> extra process after parsing you could also make your own very similar parse
> plugin (and integrate the filter functionality at parse time) and replace
> the parse-html plugin in the nutch-site includes with your own.
>
>
>
> On Thu, Oct 1, 2015 at 9:17 AM, John Lafitte <jl...@brandextract.com>
> wrote:
>
> > I have been using something similar to this for a while because we came
> > from Google Search Appliance and had googleon and googleoff all over the
> > place.  I don't really like having to patch the parse-html plugin
> everytime
> > I do an upgrade, wish I could move that into it's own plugin somehow.
> >
> > Speaking of googleon/googleoff, is there any standard for denoting
> > indexable elements?  That one seems specific to GSA, it would be nice if
> > there was something other search engines might also take into
> > consideration.
> >
> > On Thu, Oct 1, 2015 at 7:20 AM, <ma...@automationdirect.com> wrote:
> >
> > > Camillo thank you so much for sharing your changes. I am checking it
> out.
> > >
> > >
> > > On 9/30/15 3:37 PM, "Camilo Tejeiro" <ca...@gmail.com> wrote:
> > >
> > > >I believe you can do it with Tika,
> > > >
> > > >I did it a different way...
> > > >I recently had to do something similar and I wrote a little
> parse-filter
> > > >plugin to accomplish this.
> > > >
> > > >For reference look into the Jira Issue 585, it will give you some
> ideas.
> > > >https://issues.apache.org/jira/browse/NUTCH-585
> > > >
> > > >If it helps here is my open nutch install with the integrated plugin
> > (look
> > > >for the parse-html-filter-select-nodes plugin). I haven't created a
> > patch
> > > >but you are free to use it if it helps you...
> > > >https://github.com/osohm/apache-nutch-1.10
> > > >
> > > >cheers,
> > > >
> > > >On Wed, Sep 30, 2015 at 11:57 AM, <ma...@automationdirect.com>
> wrote:
> > > >
> > > >> Hi All,
> > > >>
> > > >> We need to remove header, footer and menu from the crawled content
> > > >>before
> > > >> we index content into SOLR. I researched online and found references
> > to
> > > >> removal via Tika's boilerpipe support - Nutch-961
> > > >>
> > > >> We are currently using Nutch 1.7 but I am looking into updating to
> > Nutch
> > > >> 1.10. I am hoping that the newer version of Tika in Nutch 1.10 will
> > do a
> > > >> better job in removing extra content.
> > > >>
> > > >> I will be very thankful if you can let me know the best method and
> > steps
> > > >> to achieve this goal and how effective this is in removal.
> > > >>
> > > >> Thanks so much,
> > > >> Madhvi
> > > >>
> > > >>
> > > >
> > > >
> > > >--
> > > >Camilo Tejeiro
> > > >*Be **honest, be grateful, be humble.*
> > > >https://www.linkedin.com/in/camilotejeiro
> > > >http://camilotejeiro.wordpress.com
> > >
> > >
> >
>
>
>
> --
> Camilo Tejeiro
> *Be **honest, be grateful, be humble.*
> https://www.linkedin.com/in/camilotejeiro
> http://camilotejeiro.wordpress.com
>

Re: Remove Header Footer and Menus from crawled content

Posted by Camilo Tejeiro <ca...@gmail.com>.

@marora: I am glad it helps!
@john: I think you don't have to patch or modify the parse-html plugin, you
can build a parse-filter that is executed afterwards, this is the way I am
doing it currently, because I read somewhere (not remember where) that it
is good practice to extend the parse-html plugin as opposed to modifying it
directly, because as you mention your changes might have to be reapplied to
new nutch releases. But if you are concerned about having to execute an
extra process after parsing you could also make your own very similar parse
plugin (and integrate the filter functionality at parse time) and replace
the parse-html plugin in the nutch-site includes with your own.



On Thu, Oct 1, 2015 at 9:17 AM, John Lafitte <jl...@brandextract.com>
wrote:

> I have been using something similar to this for a while because we came
> from Google Search Appliance and had googleon and googleoff all over the
> place.  I don't really like having to patch the parse-html plugin everytime
> I do an upgrade, wish I could move that into it's own plugin somehow.
>
> Speaking of googleon/googleoff, is there any standard for denoting
> indexable elements?  That one seems specific to GSA, it would be nice if
> there was something other search engines might also take into
> consideration.
>
> On Thu, Oct 1, 2015 at 7:20 AM, <ma...@automationdirect.com> wrote:
>
> > Camillo thank you so much for sharing your changes. I am checking it out.
> >
> >
> > On 9/30/15 3:37 PM, "Camilo Tejeiro" <ca...@gmail.com> wrote:
> >
> > >I believe you can do it with Tika,
> > >
> > >I did it a different way...
> > >I recently had to do something similar and I wrote a little parse-filter
> > >plugin to accomplish this.
> > >
> > >For reference look into the Jira Issue 585, it will give you some ideas.
> > >https://issues.apache.org/jira/browse/NUTCH-585
> > >
> > >If it helps here is my open nutch install with the integrated plugin
> (look
> > >for the parse-html-filter-select-nodes plugin). I haven't created a
> patch
> > >but you are free to use it if it helps you...
> > >https://github.com/osohm/apache-nutch-1.10
> > >
> > >cheers,
> > >
> > >On Wed, Sep 30, 2015 at 11:57 AM, <ma...@automationdirect.com> wrote:
> > >
> > >> Hi All,
> > >>
> > >> We need to remove header, footer and menu from the crawled content
> > >>before
> > >> we index content into SOLR. I researched online and found references
> to
> > >> removal via Tika's boilerpipe support - Nutch-961
> > >>
> > >> We are currently using Nutch 1.7 but I am looking into updating to
> Nutch
> > >> 1.10. I am hoping that the newer version of Tika in Nutch 1.10 will
> do a
> > >> better job in removing extra content.
> > >>
> > >> I will be very thankful if you can let me know the best method and
> steps
> > >> to achieve this goal and how effective this is in removal.
> > >>
> > >> Thanks so much,
> > >> Madhvi
> > >>
> > >>
> > >
> > >
> > >--
> > >Camilo Tejeiro
> > >*Be **honest, be grateful, be humble.*
> > >https://www.linkedin.com/in/camilotejeiro
> > >http://camilotejeiro.wordpress.com
> >
> >
>



-- 
Camilo Tejeiro
*Be **honest, be grateful, be humble.*
https://www.linkedin.com/in/camilotejeiro
http://camilotejeiro.wordpress.com

Re: Remove Header Footer and Menus from crawled content

Posted by John Lafitte <jl...@brandextract.com>.

I have been using something similar to this for a while because we came
from Google Search Appliance and had googleon and googleoff all over the
place.  I don't really like having to patch the parse-html plugin everytime
I do an upgrade, wish I could move that into it's own plugin somehow.

Speaking of googleon/googleoff, is there any standard for denoting
indexable elements?  That one seems specific to GSA, it would be nice if
there was something other search engines might also take into consideration.

On Thu, Oct 1, 2015 at 7:20 AM, <ma...@automationdirect.com> wrote:

> Camillo thank you so much for sharing your changes. I am checking it out.
>
>
> On 9/30/15 3:37 PM, "Camilo Tejeiro" <ca...@gmail.com> wrote:
>
> >I believe you can do it with Tika,
> >
> >I did it a different way...
> >I recently had to do something similar and I wrote a little parse-filter
> >plugin to accomplish this.
> >
> >For reference look into the Jira Issue 585, it will give you some ideas.
> >https://issues.apache.org/jira/browse/NUTCH-585
> >
> >If it helps here is my open nutch install with the integrated plugin (look
> >for the parse-html-filter-select-nodes plugin). I haven't created a patch
> >but you are free to use it if it helps you...
> >https://github.com/osohm/apache-nutch-1.10
> >
> >cheers,
> >
> >On Wed, Sep 30, 2015 at 11:57 AM, <ma...@automationdirect.com> wrote:
> >
> >> Hi All,
> >>
> >> We need to remove header, footer and menu from the crawled content
> >>before
> >> we index content into SOLR. I researched online and found references to
> >> removal via Tika's boilerpipe support - Nutch-961
> >>
> >> We are currently using Nutch 1.7 but I am looking into updating to Nutch
> >> 1.10. I am hoping that the newer version of Tika in Nutch 1.10 will do a
> >> better job in removing extra content.
> >>
> >> I will be very thankful if you can let me know the best method and steps
> >> to achieve this goal and how effective this is in removal.
> >>
> >> Thanks so much,
> >> Madhvi
> >>
> >>
> >
> >
> >--
> >Camilo Tejeiro
> >*Be **honest, be grateful, be humble.*
> >https://www.linkedin.com/in/camilotejeiro
> >http://camilotejeiro.wordpress.com
>
>