You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by "Fritsch, Michael" <mi...@coremedia.com.INVALID> on 2023/09/21 13:46:42 UTC
Exclude HTML elements from Crawl
Hello,
I use Nutch 1.18 to crawl our documentation with the parse-html plugin. Each page has elements like TOCs which should not be included.
I know https://issues.apache.org/jira/browse/NUTCH-585 and included one of the patches.
However, I wonder if there is not already a build-in option to exclude HTML elements (like a div with a given id or class or other elements like header).
I also do not understand why this little patch has not already been added to Nutch? Are there drawbacks?
Regards,
Michael
Dr. Michael Fritsch
Technical Editor
[A picture containing graphics, graphic design, font, logo Description automatically generated]<https://www.coremedia.com/>
Elevate Experience. Drive Impact.
E-Mail: michael.fritsch@coremedia.com<ma...@coremedia.com>
Phone: +49 (0) 40 325 587 0
www.coremedia.com<https://www.coremedia.com/>
[A pink and red letter on a black background Description automatically generated with low confidence]<https://www.linkedin.com/company/coremedia-corp/>[A logo of a camera Description automatically generated with low confidence]<https://www.instagram.com/coremediacc/>[A picture containing colorfulness, screenshot, graphics, red Description automatically generated]<https://www.youtube.com/channel/UC3u29ExYv1263SfUBWnsgdQ>[A pink bird with wings Description automatically generated with low confidence]<https://twitter.com/coremedia?lang=en>
[Diagram Description automatically generated]<https://www.coremedia.com/blog/sustainability-matters/>
--------------------------------------------------------------------------------
CoreMedia GmbH
Rödingsmarkt 9, 20459 Hamburg, Germany
Managing Director: Sören Stamer
Commercial Register: Amtsgericht Hamburg, HRB 162480
--------------------------------------------------------------------------------
Re: Exclude HTML elements from Crawl
Posted by Sebastian Nagel <sn...@apache.org>.
Hi Michael,
> I wonder if there is not already a build-in option to exclude HTML
> elements (like a div with a given id or class or other elements like header).
No, there isn't one so far.
> I know https://issues.apache.org/jira/browse/NUTCH-585
> I also do not understand why this little patch has not already been added to
> Nutch? Are there drawbacks?
Well, good question. Don't know. I'll have a look...
Maybe, one comment: I definitely agree that it would be very useful to have some
configurable method to clean up the HTML-to-text extract from undesired content
(headers, footers, etc.) - ideally, it should be possible to use the full
expressive power of CSS for that.
Thanks for the suggestion and remembering us! Nutch is a community project and
any contribution is welcome and appreciated!
Best,
Sebastian
On 9/21/23 15:46, Fritsch, Michael wrote:
> Hello,
>
> I use Nutch 1.18 to crawl our documentation with the parse-html plugin. Each
> page has elements like TOCs which should not be included.
>
> I know https://issues.apache.org/jira/browse/NUTCH-585
> <https://issues.apache.org/jira/browse/NUTCH-585> and included one of the patches.
>
> However, I wonder if there is not already a build-in option to exclude HTML
> elements (like a div with a given id or class or other elements like header).
>
> I also do not understand why this little patch has not already been added to
> Nutch? Are there drawbacks?
>
> Regards,
>
> Michael
>
> Dr. Michael Fritsch
> Technical Editor
>
> A picture containing graphics, graphic design, font, logo Description
> automatically generated <https://www.coremedia.com/>
>
> **
>
> *Elevate Experience. Drive Impact.*
>
>
> E-Mail: michael.fritsch@coremedia.com <ma...@coremedia.com>
>
> Phone: +49 (0) 40 325 587 0
> *www.coremedia.com* <https://www.coremedia.com/>
>
> A pink and red letter on a black background Description automatically generated
> with low confidence <https://www.linkedin.com/company/coremedia-corp/>A logo of
> a camera Description automatically generated with low confidence
> <https://www.instagram.com/coremediacc/>A picture containing colorfulness,
> screenshot, graphics, red Description automatically generated
> <https://www.youtube.com/channel/UC3u29ExYv1263SfUBWnsgdQ>A pink bird with wings
> Description automatically generated with low confidence
> <https://twitter.com/coremedia?lang=en>
>
> Diagram Description automatically generated
> <https://www.coremedia.com/blog/sustainability-matters/>
>
> --------------------------------------------------------------------------------
>
> CoreMedia GmbH
>
> Rödingsmarkt 9, 20459 Hamburg, Germany
>
> Managing Director: Sören Stamer
>
> Commercial Register: Amtsgericht Hamburg, HRB 162480
>
> --------------------------------------------------------------------------------
>