You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@manifoldcf.apache.org by Karl Wright <da...@gmail.com> on 2012/11/16 14:54:28 UTC
Anyone out there using RSS connector, who wants to help?
Hi all,
The branch https://svn.apache.org/repos/asf/manifoldcf/branches/CONNECTORS-120
contains an RSS connector that has been updated to use httpcomponents
4.2.2. I'd love for people who are in a position to do significant
RSS crawling to try it out before I pull it into trunk. Any takers?
Karl
Re: Anyone out there using RSS connector, who wants to help?
Posted by Ahmet Arslan <io...@yahoo.com>.
Hi,
Regarding "WARN 2012-11-17 23:01:17,649 (Worker thread '31') - Pre-ingest service interruption reported for job 1353185325276 connection 'rss': Couldn't fetch robots.txt from http://www.milliyet.com.tr:-1"
I see that http://www.milliyet.com.tr/robots.txt exists.
Ahmet
--- On Sat, 11/17/12, Ahmet Arslan <io...@yahoo.com> wrote:
> From: Ahmet Arslan <io...@yahoo.com>
> Subject: Re: Anyone out there using RSS connector, who wants to help?
> To: dev@manifoldcf.apache.org
> Date: Saturday, November 17, 2012, 11:11 PM
> Hi Karl,
>
> Never used rss connector. But here is what I have done.
>
> I defined a job to crawl using mcf-trunk. mfc-trunk crawled
> following two URLs:
>
> http://www.milliyet.com.tr/D/rss/rss/Rss_24.xml
> http://rss.hurriyet.com.tr/rss.aspx?sectionId=2
>
> With CONNECTORS-120 branch I can crawl
>
> http://rss.hurriyet.com.tr/rss.aspx?sectionId=2
>
> but http://www.milliyet.com.tr/D/rss/rss/Rss_24.xml gives
> status of "Error: Repeated service interruptions - failure
> getting document version"
>
> I see these in the log file :
>
> WARN 2012-11-17 23:01:17,649 (Worker thread '31') -
> Pre-ingest service interruption reported for job
> 1353185325276 connection 'rss': Couldn't fetch robots.txt
> from http://www.milliyet.com.tr:-1
> ERROR 2012-11-17 23:01:17,802 (Worker thread '31') -
> Exception tossed: Repeated service interruptions - failure
> getting document version
> org.apache.manifoldcf.core.interfaces.ManifoldCFException:
> Repeated service interruptions - failure getting document
> version
> at
> org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:339)
> WARN 2012-11-17 23:02:27,307 (Worker thread '30') -
> Pre-ingest service interruption reported for job
> 1353185325276 connection 'rss': Couldn't fetch robots.txt
> from http://www.milliyet.com.tr:-1
> ERROR 2012-11-17 23:02:27,329 (Worker thread '30') -
> Exception tossed: Repeated service interruptions - failure
> getting document version
> org.apache.manifoldcf.core.interfaces.ManifoldCFException:
> Repeated service interruptions - failure getting document
> version
> at
> org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:339)
>
>
> By the way in "Dechromed Content" tab (Job Setting UI) I see
> four " "
>
> Thanks,
> Ahmet
> --- On Fri, 11/16/12, Karl Wright <da...@gmail.com>
> wrote:
>
> > From: Karl Wright <da...@gmail.com>
> > Subject: Anyone out there using RSS connector, who
> wants to help?
> > To: "dev" <de...@manifoldcf.apache.org>
> > Date: Friday, November 16, 2012, 3:54 PM
> > Hi all,
> >
> > The branch https://svn.apache.org/repos/asf/manifoldcf/branches/CONNECTORS-120
> > contains an RSS connector that has been updated to use
> > httpcomponents
> > 4.2.2. I'd love for people who are in a position to
> do
> > significant
> > RSS crawling to try it out before I pull it into
> > trunk. Any takers?
> >
> > Karl
> >
>
Re: Anyone out there using RSS connector, who wants to help?
Posted by Ahmet Arslan <io...@yahoo.com>.
Hi Karl,
Never used rss connector. But here is what I have done.
I defined a job to crawl using mcf-trunk. mfc-trunk crawled following two URLs:
http://www.milliyet.com.tr/D/rss/rss/Rss_24.xml
http://rss.hurriyet.com.tr/rss.aspx?sectionId=2
With CONNECTORS-120 branch I can crawl
http://rss.hurriyet.com.tr/rss.aspx?sectionId=2
but http://www.milliyet.com.tr/D/rss/rss/Rss_24.xml gives status of "Error: Repeated service interruptions - failure getting document version"
I see these in the log file :
WARN 2012-11-17 23:01:17,649 (Worker thread '31') - Pre-ingest service interruption reported for job 1353185325276 connection 'rss': Couldn't fetch robots.txt from http://www.milliyet.com.tr:-1
ERROR 2012-11-17 23:01:17,802 (Worker thread '31') - Exception tossed: Repeated service interruptions - failure getting document version
org.apache.manifoldcf.core.interfaces.ManifoldCFException: Repeated service interruptions - failure getting document version
at org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:339)
WARN 2012-11-17 23:02:27,307 (Worker thread '30') - Pre-ingest service interruption reported for job 1353185325276 connection 'rss': Couldn't fetch robots.txt from http://www.milliyet.com.tr:-1
ERROR 2012-11-17 23:02:27,329 (Worker thread '30') - Exception tossed: Repeated service interruptions - failure getting document version
org.apache.manifoldcf.core.interfaces.ManifoldCFException: Repeated service interruptions - failure getting document version
at org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:339)
By the way in "Dechromed Content" tab (Job Setting UI) I see four " "
Thanks,
Ahmet
--- On Fri, 11/16/12, Karl Wright <da...@gmail.com> wrote:
> From: Karl Wright <da...@gmail.com>
> Subject: Anyone out there using RSS connector, who wants to help?
> To: "dev" <de...@manifoldcf.apache.org>
> Date: Friday, November 16, 2012, 3:54 PM
> Hi all,
>
> The branch https://svn.apache.org/repos/asf/manifoldcf/branches/CONNECTORS-120
> contains an RSS connector that has been updated to use
> httpcomponents
> 4.2.2. I'd love for people who are in a position to do
> significant
> RSS crawling to try it out before I pull it into
> trunk. Any takers?
>
> Karl
>