You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Markus Jelsma (JIRA)" <ji...@apache.org> on 2011/04/18 15:13:05 UTC

[jira] [Updated] (NUTCH-961) Expose Tika's boilerpipe support

     [ https://issues.apache.org/jira/browse/NUTCH-961?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Markus Jelsma updated NUTCH-961:
--------------------------------

    Attachment: NUTCH-961-1.3-tikaparser.patch
                BoilerpipeExtractorRepository.java

Here's a WIP for 1.3 adding a repository (or factory) and patching pars-tika. Use the following settings to enable:

tika.use_boilerpipe=true
tika.boilerpipe.extractor=ArticleExtractor|CanolaExtractor

Test with bin/nutch org.apache.nutch.parse.ParserChecker -dumpText <url>

There is an issue with extracting anchors of outlinks from the source text. There may also be issues with the repository of which im currently unaware of.

> Expose Tika's boilerpipe support
> --------------------------------
>
>                 Key: NUTCH-961
>                 URL: https://issues.apache.org/jira/browse/NUTCH-961
>             Project: Nutch
>          Issue Type: New Feature
>          Components: parser
>            Reporter: Markus Jelsma
>             Fix For: 2.0
>
>         Attachments: BoilerpipeExtractorRepository.java, NUTCH-961-1.3-tikaparser.patch
>
>
> Tika 0.8 comes with the Boilerpipe content handler which can be used to extract boilerplate content from HTML pages. We should see how we can expose Boilerplate in the Nutch cofiguration.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira