You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by "Fritsch, Michael" <mi...@coremedia.com.INVALID> on 2023/09/21 13:46:42 UTC

Exclude HTML elements from Crawl

Hello,

I use Nutch 1.18 to crawl our documentation with the parse-html plugin. Each page has elements like TOCs which should not be included.
I know https://issues.apache.org/jira/browse/NUTCH-585 and included one of the patches.

However, I wonder if there is not already a build-in option to exclude HTML elements (like a div with a given id or class or other elements like header).
I also do not understand why this little patch has not already been added to Nutch? Are there drawbacks?

Regards,
Michael

Dr. Michael Fritsch
Technical Editor
[A picture containing graphics, graphic design, font, logo  Description automatically generated]<https://www.coremedia.com/>

Elevate Experience. Drive Impact.

E-Mail: michael.fritsch@coremedia.com<ma...@coremedia.com>
Phone: +49 (0) 40 325 587 0
www.coremedia.com<https://www.coremedia.com/>
[A pink and red letter on a black background  Description automatically generated with low confidence]<https://www.linkedin.com/company/coremedia-corp/>[A logo of a camera  Description automatically generated with low confidence]<https://www.instagram.com/coremediacc/>[A picture containing colorfulness, screenshot, graphics, red  Description automatically generated]<https://www.youtube.com/channel/UC3u29ExYv1263SfUBWnsgdQ>[A pink bird with wings  Description automatically generated with low confidence]<https://twitter.com/coremedia?lang=en>
[Diagram  Description automatically generated]<https://www.coremedia.com/blog/sustainability-matters/>
--------------------------------------------------------------------------------
CoreMedia GmbH
Rödingsmarkt 9, 20459 Hamburg, Germany
Managing Director: Sören Stamer
Commercial Register: Amtsgericht Hamburg, HRB 162480
--------------------------------------------------------------------------------

Re: Exclude HTML elements from Crawl

Posted by Sebastian Nagel <sn...@apache.org>.

Hi Michael,

 > I wonder if there is not already a build-in option to exclude HTML
 > elements (like a div with a given id or class or other elements like header).

No, there isn't one so far.


 > I know https://issues.apache.org/jira/browse/NUTCH-585

 > I also do not understand why this little patch has not already been added to
 > Nutch? Are there drawbacks?

Well, good question. Don't know. I'll have a look...


Maybe, one comment: I definitely agree that it would be very useful to have some 
configurable method to clean up the HTML-to-text extract from undesired content 
(headers, footers, etc.) - ideally, it should be possible to use the full 
expressive power of CSS for that.


Thanks for the suggestion and remembering us! Nutch is a community project and
any contribution is welcome and appreciated!


Best,
Sebastian


On 9/21/23 15:46, Fritsch, Michael wrote:
> Hello,
> 
> I use Nutch 1.18 to crawl our documentation with the parse-html plugin. Each 
> page has elements like TOCs which should not be included.
> 
> I know https://issues.apache.org/jira/browse/NUTCH-585 
> <https://issues.apache.org/jira/browse/NUTCH-585> and included one of the patches.
> 
> However, I wonder if there is not already a build-in option to exclude HTML 
> elements (like a div with a given id or class or other elements like header).
> 
> I also do not understand why this little patch has not already been added to 
> Nutch? Are there drawbacks?
> 
> Regards,
> 
> Michael
> 
> Dr. Michael Fritsch
> Technical Editor
> 
> A picture containing graphics, graphic design, font, logo Description 
> automatically generated <https://www.coremedia.com/>
> 
> **
> 
> *Elevate Experience. Drive Impact.*
> 
> 
> E-Mail: michael.fritsch@coremedia.com <ma...@coremedia.com>
> 
> Phone: +49 (0) 40 325 587 0
> *www.coremedia.com* <https://www.coremedia.com/>
> 
> A pink and red letter on a black background Description automatically generated 
> with low confidence <https://www.linkedin.com/company/coremedia-corp/>A logo of 
> a camera Description automatically generated with low confidence 
> <https://www.instagram.com/coremediacc/>A picture containing colorfulness, 
> screenshot, graphics, red Description automatically generated 
> <https://www.youtube.com/channel/UC3u29ExYv1263SfUBWnsgdQ>A pink bird with wings 
> Description automatically generated with low confidence 
> <https://twitter.com/coremedia?lang=en>
> 
> Diagram Description automatically generated 
> <https://www.coremedia.com/blog/sustainability-matters/>
> 
> --------------------------------------------------------------------------------
> 
> CoreMedia GmbH
> 
> Rödingsmarkt 9, 20459 Hamburg, Germany
> 
> Managing Director: Sören Stamer
> 
> Commercial Register: Amtsgericht Hamburg, HRB 162480
> 
> --------------------------------------------------------------------------------
>