You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Alexander Kingson (JIRA)" <ji...@apache.org> on 2015/04/01 23:59:54 UTC

[jira] [Commented] (NUTCH-961) Expose Tika's boilerpipe support

    [ https://issues.apache.org/jira/browse/NUTCH-961?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14391558#comment-14391558 ] 

Alexander Kingson commented on NUTCH-961:
-----------------------------------------

Hello,

Since I was not getting satisfactory results after upgrading to boilerpipe 1.2.0 with parse-tika (with boilerpipe support)  I have put some code to nutch-2.x parser to get the same results as the boilerpipe demo-website. Used some code from .v2.patch. 
Attaching the patch.

Thanks.
Alex.

> Expose Tika's boilerpipe support
> --------------------------------
>
>                 Key: NUTCH-961
>                 URL: https://issues.apache.org/jira/browse/NUTCH-961
>             Project: Nutch
>          Issue Type: New Feature
>          Components: parser
>            Reporter: Markus Jelsma
>            Assignee: Markus Jelsma
>             Fix For: 1.11
>
>         Attachments: BoilerpipeExtractorRepository.java, NUTCH-961-1.3-3.patch, NUTCH-961-1.3-tikaparser.patch, NUTCH-961-1.3-tikaparser1.patch, NUTCH-961-1.4-dombuilder-1.patch, NUTCH-961-1.5-1.patch, NUTCH-961-1.8-1.patch, NUTCH-961-2.1-v1.patch, NUTCH-961-2.1-v2.patch, NUTCH-961v2.patch
>
>
> Tika 0.8 comes with the Boilerpipe content handler which can be used to extract boilerplate content from HTML pages. We should see how we can expose Boilerplate in the Nutch cofiguration.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)