You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by "Adler, Matthew (US)" <ma...@navaera.com> on 2016/10/05 12:08:46 UTC

Issue Crawling Alternate URLs

Hello Nutch Users:

I’m currently having an issue with Nutch 1.4, similar to the one logged here:

https://issues.apache.org/jira/browse/NUTCH-2319

Using the example in that JIRA issue, if I am on the following URL:
http://rssfeeds.azcentral.com/phoenix/asu

I expect that nutch will be able to find the alternate linked URL, specified in the following link tag:

<link rel="alternate" type="application/atom+xml" href="http://rssfeeds.azcentral.com/phoenix/asu&amp;x=1" title="Phoenix - ASU">

It does not however, even though I’ve tried to make a few changes to the RegEX in in suffix-urlfilter.txt, regex-normalize.xml, regex-urlfilter.txt, and prefix-urlfilter.txt but have not had any success.

Any feedback would be appreciated.

Please let me know,

MA
This message contains information which may be confidential and privileged. Unless you are the intended addressee (or authorized to receive for the intended addressee), you may not use, copy or disclose to anyone the message or any information contained in the message. If you have received the message in error, please advise the sender by reply and delete the message.

Re: Issue Crawling Alternate URLs

Posted by Sebastian Nagel <wa...@googlemail.com>.
Hi Matthew,

afaics, the content delivered to Nutch under the URL

  http://rssfeeds.azcentral.com/phoenix/asu

does not contain the link

  http://rssfeeds.azcentral.com/phoenix/asu&x=1

That's the simple answer. What you see in a browser is often not that what is delivered from the
server to a spider. I've tested both Nutch and wget, see below.

Best,
Sebastian


% bin/nutch plugin protocol-http org.apache.nutch.protocol.http.Http \
     -verbose http://rssfeeds.azcentral.com/phoenix/asu
Status: success(1), lastModified=0
Content Type: application/rss+xml
Content Length: null
Content:
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="http://rssfeeds.azcentral.com/feedblitz_rss.xslt"?><rss
xmlns:content="http://purl.org/rss/1.0/modules/content/"  version="2.0"
xmlns:feedburner="http://rssnamespace.org/feedburner/ext/1.0">
  <channel>
    <title>Phoenix - ASU</title>
    <link>http://api-internal.usatoday.com.akadns.net</link>
...

% wget -O azcentral.asu.wget.xml http://rssfeeds.azcentral.com/phoenix/asu
--2016-10-07 09:32:21--  http://rssfeeds.azcentral.com/phoenix/asu
Resolving rssfeeds.azcentral.com (rssfeeds.azcentral.com)... 198.251.67.124, 198.251.67.127,
198.71.59.197, ...
Connecting to rssfeeds.azcentral.com (rssfeeds.azcentral.com)|198.251.67.124|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/xml]
Saving to: \u2018azcentral.asu.wget.xml\u2019

azcentral.asu.wget.xml                      [ <=>
                      ] 136.25K  --.-KB/s    in 0.01s

2016-10-07 09:32:23 (11.6 MB/s) - \u2018azcentral.asu.wget.xml\u2019 saved [139517]

% grep -F 'http://rssfeeds.azcentral.com/phoenix/asu&x=1' azcentral.asu.wget.xml

(nothing found)


On 10/06/2016 05:37 PM, Adler, Matthew (US) wrote:
> Hi Sebastian:
> 
> You are correct in terms of the first URL, which isn't my issue.  The issue is that if I am attempting to crawl that initial page, http://rssfeeds.azcentral.com/phoenix/asu, I want nutch to find RSS page linked from it, which is this one:
> 
> http://rssfeeds.azcentral.com/phoenix/asu&x=1
> 
> The issue though, is nutch can't seem to find that link.  From what I can tell the reason is due to the structure of the link tag, which is:
> 
> <link rel="alternate" type="application/atom+xml" href="http://rssfeeds.azcentral.com/phoenix/asu&x=1" title="Phoenix - ASU">
> 
> Please let know if this clarifies the issue.
> 
> Cheers,
> MA
> 
> -----Original Message-----
> From: Sebastian Nagel [mailto:wastl.nagel@googlemail.com]
> Sent: Thursday, October 06, 2016 8:26 AM
> To: user@nutch.apache.org
> Subject: Re: Issue Crawling Alternate URLs
> 
> Hi,
> 
>> http://rssfeeds.azcentral.com/phoenix/asu
> 
> That's already an RSS feed which unluckily fails to parse:
> (using plugin "feed")
>  Status: failed(2,200): com.sun.syndication.io.ParsingFeedException: Invalid XML: Error on line 183:
> XML document structures must start and end within the same entity.
> (using "parse-tika")
>  Caused by: com.rometools.rome.io.ParsingFeedException: Invalid XML: Error on line 188: XML document structures must start and end within the same entity.
> 
> 
> When opening the URL in a browser (Firefox) the server sends a HTML page.
> At least, that's what I got when trying it:
> 
> % wget -q -O - http://rssfeeds.azcentral.com/phoenix/asu | head <?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="http://rssfeeds.azcentral.com/feedblitz_rss.xslt"?><rss
> xmlns:content="http://purl.org/rss/1.0/modules/content/"  version="2.0"
> xmlns:feedburner="http://rssnamespace.org/feedburner/ext/1.0">
>   <channel>
>     <title>Phoenix - ASU</title>
>     <link>http://api-internal.usatoday.com.akadns.net</link>
>     <description>Phoenix - ASU</description>
>     <copyright>Copyright 2016, GANNETT</copyright>
>     <language>en-us</language>
> <item>
> <feedburner:origLink>http://www.azcentral.com/story/sports/ncaaf/asu/2016/10/05/arizona-state-football-needs-reignite-run-game-against-ucla/91631636/</feedburner:origLink>
> 
> 
> Best,
> Sebastian
> 
> On 10/05/2016 02:08 PM, Adler, Matthew (US) wrote:
>> Hello Nutch Users:
>>
>> I\u2019m currently having an issue with Nutch 1.4, similar to the one logged here:
>>
>> https://issues.apache.org/jira/browse/NUTCH-2319
>>
>> Using the example in that JIRA issue, if I am on the following URL:
>> http://rssfeeds.azcentral.com/phoenix/asu
>>
>> I expect that nutch will be able to find the alternate linked URL, specified in the following link tag:
>>
>> <link rel="alternate" type="application/atom+xml"
>> href="http://rssfeeds.azcentral.com/phoenix/asu&amp;x=1"
>> title="Phoenix - ASU">
>>
>> It does not however, even though I\u2019ve tried to make a few changes to the RegEX in in suffix-urlfilter.txt, regex-normalize.xml, regex-urlfilter.txt, and prefix-urlfilter.txt but have not had any success.
>>
>> Any feedback would be appreciated.
>>
>> Please let me know,
>>
>> MA
>> This message contains information which may be confidential and privileged. Unless you are the intended addressee (or authorized to receive for the intended addressee), you may not use, copy or disclose to anyone the message or any information contained in the message. If you have received the message in error, please advise the sender by reply and delete the message.
>>
> 
> This message contains information which may be confidential and privileged. Unless you are the intended addressee (or authorized to receive for the intended addressee), you may not use, copy or disclose to anyone the message or any information contained in the message. If you have received the message in error, please advise the sender by reply and delete the message.
> 


RE: Issue Crawling Alternate URLs

Posted by "Adler, Matthew (US)" <ma...@navaera.com>.
Hi Sebastian:

You are correct in terms of the first URL, which isn't my issue.  The issue is that if I am attempting to crawl that initial page, http://rssfeeds.azcentral.com/phoenix/asu, I want nutch to find RSS page linked from it, which is this one:

http://rssfeeds.azcentral.com/phoenix/asu&x=1

The issue though, is nutch can't seem to find that link.  From what I can tell the reason is due to the structure of the link tag, which is:

<link rel="alternate" type="application/atom+xml" href="http://rssfeeds.azcentral.com/phoenix/asu&x=1" title="Phoenix - ASU">

Please let know if this clarifies the issue.

Cheers,
MA

-----Original Message-----
From: Sebastian Nagel [mailto:wastl.nagel@googlemail.com]
Sent: Thursday, October 06, 2016 8:26 AM
To: user@nutch.apache.org
Subject: Re: Issue Crawling Alternate URLs

Hi,

> http://rssfeeds.azcentral.com/phoenix/asu

That's already an RSS feed which unluckily fails to parse:
(using plugin "feed")
 Status: failed(2,200): com.sun.syndication.io.ParsingFeedException: Invalid XML: Error on line 183:
XML document structures must start and end within the same entity.
(using "parse-tika")
 Caused by: com.rometools.rome.io.ParsingFeedException: Invalid XML: Error on line 188: XML document structures must start and end within the same entity.


When opening the URL in a browser (Firefox) the server sends a HTML page.
At least, that's what I got when trying it:

% wget -q -O - http://rssfeeds.azcentral.com/phoenix/asu | head <?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="http://rssfeeds.azcentral.com/feedblitz_rss.xslt"?><rss
xmlns:content="http://purl.org/rss/1.0/modules/content/"  version="2.0"
xmlns:feedburner="http://rssnamespace.org/feedburner/ext/1.0">
  <channel>
    <title>Phoenix - ASU</title>
    <link>http://api-internal.usatoday.com.akadns.net</link>
    <description>Phoenix - ASU</description>
    <copyright>Copyright 2016, GANNETT</copyright>
    <language>en-us</language>
<item>
<feedburner:origLink>http://www.azcentral.com/story/sports/ncaaf/asu/2016/10/05/arizona-state-football-needs-reignite-run-game-against-ucla/91631636/</feedburner:origLink>


Best,
Sebastian

On 10/05/2016 02:08 PM, Adler, Matthew (US) wrote:
> Hello Nutch Users:
>
> I’m currently having an issue with Nutch 1.4, similar to the one logged here:
>
> https://issues.apache.org/jira/browse/NUTCH-2319
>
> Using the example in that JIRA issue, if I am on the following URL:
> http://rssfeeds.azcentral.com/phoenix/asu
>
> I expect that nutch will be able to find the alternate linked URL, specified in the following link tag:
>
> <link rel="alternate" type="application/atom+xml"
> href="http://rssfeeds.azcentral.com/phoenix/asu&amp;x=1"
> title="Phoenix - ASU">
>
> It does not however, even though I’ve tried to make a few changes to the RegEX in in suffix-urlfilter.txt, regex-normalize.xml, regex-urlfilter.txt, and prefix-urlfilter.txt but have not had any success.
>
> Any feedback would be appreciated.
>
> Please let me know,
>
> MA
> This message contains information which may be confidential and privileged. Unless you are the intended addressee (or authorized to receive for the intended addressee), you may not use, copy or disclose to anyone the message or any information contained in the message. If you have received the message in error, please advise the sender by reply and delete the message.
>

This message contains information which may be confidential and privileged. Unless you are the intended addressee (or authorized to receive for the intended addressee), you may not use, copy or disclose to anyone the message or any information contained in the message. If you have received the message in error, please advise the sender by reply and delete the message.

Re: Issue Crawling Alternate URLs

Posted by Sebastian Nagel <wa...@googlemail.com>.
Hi,

> http://rssfeeds.azcentral.com/phoenix/asu

That's already an RSS feed which unluckily fails to parse:
(using plugin "feed")
 Status: failed(2,200): com.sun.syndication.io.ParsingFeedException: Invalid XML: Error on line 183:
XML document structures must start and end within the same entity.
(using "parse-tika")
 Caused by: com.rometools.rome.io.ParsingFeedException: Invalid XML: Error on line 188: XML document
structures must start and end within the same entity.


When opening the URL in a browser (Firefox) the server sends a HTML page.
At least, that's what I got when trying it:

% wget -q -O - http://rssfeeds.azcentral.com/phoenix/asu | head
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="http://rssfeeds.azcentral.com/feedblitz_rss.xslt"?><rss
xmlns:content="http://purl.org/rss/1.0/modules/content/"  version="2.0"
xmlns:feedburner="http://rssnamespace.org/feedburner/ext/1.0">
  <channel>
    <title>Phoenix - ASU</title>
    <link>http://api-internal.usatoday.com.akadns.net</link>
    <description>Phoenix - ASU</description>
    <copyright>Copyright 2016, GANNETT</copyright>
    <language>en-us</language>
<item>
<feedburner:origLink>http://www.azcentral.com/story/sports/ncaaf/asu/2016/10/05/arizona-state-football-needs-reignite-run-game-against-ucla/91631636/</feedburner:origLink>


Best,
Sebastian

On 10/05/2016 02:08 PM, Adler, Matthew (US) wrote:
> Hello Nutch Users:
> 
> I\u2019m currently having an issue with Nutch 1.4, similar to the one logged here:
> 
> https://issues.apache.org/jira/browse/NUTCH-2319
> 
> Using the example in that JIRA issue, if I am on the following URL:
> http://rssfeeds.azcentral.com/phoenix/asu
> 
> I expect that nutch will be able to find the alternate linked URL, specified in the following link tag:
> 
> <link rel="alternate" type="application/atom+xml" href="http://rssfeeds.azcentral.com/phoenix/asu&amp;x=1" title="Phoenix - ASU">
> 
> It does not however, even though I\u2019ve tried to make a few changes to the RegEX in in suffix-urlfilter.txt, regex-normalize.xml, regex-urlfilter.txt, and prefix-urlfilter.txt but have not had any success.
> 
> Any feedback would be appreciated.
> 
> Please let me know,
> 
> MA
> This message contains information which may be confidential and privileged. Unless you are the intended addressee (or authorized to receive for the intended addressee), you may not use, copy or disclose to anyone the message or any information contained in the message. If you have received the message in error, please advise the sender by reply and delete the message.
>