You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Markus Jelsma (JIRA)" <ji...@apache.org> on 2011/06/27 15:17:47 UTC

[jira] [Issue Comment Edited] (NUTCH-961) Expose Tika's boilerpipe support

    [ https://issues.apache.org/jira/browse/NUTCH-961?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13055530#comment-13055530 ] 

Markus Jelsma edited comment on NUTCH-961 at 6/27/11 1:16 PM:
--------------------------------------------------------------

Patch to include mark up from Tika. Anchors are now detected but less outlinks are found! Anyone has a good suggestion on where to fetch our outlinks with the anchors from?

      was (Author: markus17):
    Patch to include mark up from Tika. Anchors are now detected but less outlinks are found!
  
> Expose Tika's boilerpipe support
> --------------------------------
>
>                 Key: NUTCH-961
>                 URL: https://issues.apache.org/jira/browse/NUTCH-961
>             Project: Nutch
>          Issue Type: New Feature
>          Components: parser
>            Reporter: Markus Jelsma
>             Fix For: 1.4, 2.0
>
>         Attachments: BoilerpipeExtractorRepository.java, NUTCH-961-1.3-3.patch, NUTCH-961-1.3-tikaparser.patch, NUTCH-961-1.3-tikaparser1.patch, NUTCH-961v2.patch
>
>
> Tika 0.8 comes with the Boilerpipe content handler which can be used to extract boilerplate content from HTML pages. We should see how we can expose Boilerplate in the Nutch cofiguration.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira