You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Markus Jelsma (JIRA)" <ji...@apache.org> on 2016/10/05 12:42:21 UTC
[jira] [Commented] (NUTCH-2319) Link with "rel=alternate" doesn't
return in crawl
[ https://issues.apache.org/jira/browse/NUTCH-2319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15548591#comment-15548591 ]
Markus Jelsma commented on NUTCH-2319:
--------------------------------------
Try upgrading to 1.12 and/or using parse-tika as your HTML parser.
> Link with "rel=alternate" doesn't return in crawl
> --------------------------------------------------
>
> Key: NUTCH-2319
> URL: https://issues.apache.org/jira/browse/NUTCH-2319
> Project: Nutch
> Issue Type: Bug
> Reporter: Zuber
>
> I am using nutch-1.4. I am getting the issue that the nutch doesn't return the URLs from the link rel="alternate".
> For example, I am trying to crawl the URL http://rssfeeds.azcentral.com/phoenix/asu which contains the below link which I am not getting as result.
> <link rel="alternate" type="application/atom+xml" href="http://rssfeeds.azcentral.com/phoenix/asu&x=1" title="Phoenix - ASU">
> Could you please help
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)