You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@manifoldcf.apache.org by Andrea Asta <as...@gmail.com> on 2015/05/14 14:50:35 UTC

RSS Crawler error 403

Hi,
I'm trying to setup a job for crawling some RSS feeds.
A lot of feeds don't produce anything and looking at the simple history
they return an error 403.

An example of feed:
http://nypost.com/news/feed/

How can I manage this situation?

Thank you.
Andrea

Re: RSS Crawler error 403

Posted by Karl Wright <da...@gmail.com>.
Hi Andrea,

It sounds like you may have gotten blocked by the webmaster at nypost.
Hopefully they haven't blocked all accesses from the ManifoldCF crawler in
general, but just from your IP address.

curl on that url works fine from here.  As does MCF when I configure it to
use your url.

The other possibility is that you are trying to crawl through a proxy, and
that's not set up properly.

Karl



On Thu, May 14, 2015 at 8:50 AM, Andrea Asta <as...@gmail.com> wrote:

> Hi,
> I'm trying to setup a job for crawling some RSS feeds.
> A lot of feeds don't produce anything and looking at the simple history
> they return an error 403.
>
> An example of feed:
> http://nypost.com/news/feed/
>
> How can I manage this situation?
>
> Thank you.
> Andrea
>