You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Markus Jelsma (JIRA)" <ji...@apache.org> on 2012/06/26 10:55:42 UTC

[jira] [Updated] (NUTCH-1233) Rely on Tika for outlink extraction

     [ https://issues.apache.org/jira/browse/NUTCH-1233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Markus Jelsma updated NUTCH-1233:
---------------------------------

    Attachment: NUTCH-1233-1.6-1.patch

Here's a new patch without garbage and it actually compiles and runs. I did not yet remove the old getOutlink method and helpers from DOMContentUtils and i only commented out the old call in TikeParser for easier debugging.

So you can comment out whatever extractor you don't want to use in TikaParser around line 147.

The new method retries slightly more URL's in some cases but it also keeps in-anchor whitespace intact. This means <a>x   y    z</a> is not collapsed to "x y z" which is the case with the old extractor. 

Please comment on this patch so i can improve it and finally resolve the issue.
                
> Rely on Tika for outlink extraction
> -----------------------------------
>
>                 Key: NUTCH-1233
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1233
>             Project: Nutch
>          Issue Type: Improvement
>            Reporter: Markus Jelsma
>            Assignee: Markus Jelsma
>             Fix For: 1.6
>
>         Attachments: NUTCH-1233-1.5-wip.patch, NUTCH-1233-1.6-1.patch
>
>
> Tika provides outlink extraction features that are not used in Nutch. To be able to use it in Nutch we need Tika to return the rel attr value of each link, which it currently doesn't. There's a patch for Tika 1.1. If that patch is included in Tika and we upgraded to that new version this issue can be worked on. Here's preliminary code that does both Tika and current outlink extraction. This also includes parts of the Boilerpipe code.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira