You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@manifoldcf.apache.org by Karl Wright <da...@gmail.com> on 2012/11/16 14:54:28 UTC

Anyone out there using RSS connector, who wants to help?

Hi all,

The branch https://svn.apache.org/repos/asf/manifoldcf/branches/CONNECTORS-120
contains an RSS connector that has been updated to use httpcomponents
4.2.2.  I'd love for people who are in a position to do significant
RSS crawling to try it out before I pull it into trunk.  Any takers?

Karl

Re: Anyone out there using RSS connector, who wants to help?

Posted by Ahmet Arslan <io...@yahoo.com>.
Hi,

Regarding  "WARN 2012-11-17 23:01:17,649 (Worker thread '31') - Pre-ingest service interruption reported for job 1353185325276 connection 'rss': Couldn't fetch robots.txt from http://www.milliyet.com.tr:-1"

I see that http://www.milliyet.com.tr/robots.txt exists.

Ahmet

--- On Sat, 11/17/12, Ahmet Arslan <io...@yahoo.com> wrote:

> From: Ahmet Arslan <io...@yahoo.com>
> Subject: Re: Anyone out there using RSS connector, who wants to help?
> To: dev@manifoldcf.apache.org
> Date: Saturday, November 17, 2012, 11:11 PM
> Hi Karl,
> 
> Never used rss connector. But here is what I have done. 
> 
> I defined a job to crawl using mcf-trunk. mfc-trunk crawled
> following two URLs:
> 
> http://www.milliyet.com.tr/D/rss/rss/Rss_24.xml
> http://rss.hurriyet.com.tr/rss.aspx?sectionId=2
> 
> With CONNECTORS-120 branch I can crawl 
> 
> http://rss.hurriyet.com.tr/rss.aspx?sectionId=2
> 
> but  http://www.milliyet.com.tr/D/rss/rss/Rss_24.xml gives
> status of "Error: Repeated service interruptions - failure
> getting document version"
> 
> I see these in the log file :
> 
>  WARN 2012-11-17 23:01:17,649 (Worker thread '31') -
> Pre-ingest service interruption reported for job
> 1353185325276 connection 'rss': Couldn't fetch robots.txt
> from http://www.milliyet.com.tr:-1
> ERROR 2012-11-17 23:01:17,802 (Worker thread '31') -
> Exception tossed: Repeated service interruptions - failure
> getting document version
> org.apache.manifoldcf.core.interfaces.ManifoldCFException:
> Repeated service interruptions - failure getting document
> version
>     at
> org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:339)
>  WARN 2012-11-17 23:02:27,307 (Worker thread '30') -
> Pre-ingest service interruption reported for job
> 1353185325276 connection 'rss': Couldn't fetch robots.txt
> from http://www.milliyet.com.tr:-1
> ERROR 2012-11-17 23:02:27,329 (Worker thread '30') -
> Exception tossed: Repeated service interruptions - failure
> getting document version
> org.apache.manifoldcf.core.interfaces.ManifoldCFException:
> Repeated service interruptions - failure getting document
> version
>     at
> org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:339)
> 
> 
> By the way in "Dechromed Content" tab (Job Setting UI) I see
> four "&nbsp;"   
> 
> Thanks,
> Ahmet
> --- On Fri, 11/16/12, Karl Wright <da...@gmail.com>
> wrote:
> 
> > From: Karl Wright <da...@gmail.com>
> > Subject: Anyone out there using RSS connector, who
> wants to help?
> > To: "dev" <de...@manifoldcf.apache.org>
> > Date: Friday, November 16, 2012, 3:54 PM
> > Hi all,
> > 
> > The branch https://svn.apache.org/repos/asf/manifoldcf/branches/CONNECTORS-120
> > contains an RSS connector that has been updated to use
> > httpcomponents
> > 4.2.2.  I'd love for people who are in a position to
> do
> > significant
> > RSS crawling to try it out before I pull it into
> > trunk.  Any takers?
> > 
> > Karl
> >
> 

Re: Anyone out there using RSS connector, who wants to help?

Posted by Ahmet Arslan <io...@yahoo.com>.
Hi Karl,

Never used rss connector. But here is what I have done. 

I defined a job to crawl using mcf-trunk. mfc-trunk crawled following two URLs:

http://www.milliyet.com.tr/D/rss/rss/Rss_24.xml
http://rss.hurriyet.com.tr/rss.aspx?sectionId=2

With CONNECTORS-120 branch I can crawl 

http://rss.hurriyet.com.tr/rss.aspx?sectionId=2

but  http://www.milliyet.com.tr/D/rss/rss/Rss_24.xml gives status of "Error: Repeated service interruptions - failure getting document version"

I see these in the log file :

 WARN 2012-11-17 23:01:17,649 (Worker thread '31') - Pre-ingest service interruption reported for job 1353185325276 connection 'rss': Couldn't fetch robots.txt from http://www.milliyet.com.tr:-1
ERROR 2012-11-17 23:01:17,802 (Worker thread '31') - Exception tossed: Repeated service interruptions - failure getting document version
org.apache.manifoldcf.core.interfaces.ManifoldCFException: Repeated service interruptions - failure getting document version
	at org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:339)
 WARN 2012-11-17 23:02:27,307 (Worker thread '30') - Pre-ingest service interruption reported for job 1353185325276 connection 'rss': Couldn't fetch robots.txt from http://www.milliyet.com.tr:-1
ERROR 2012-11-17 23:02:27,329 (Worker thread '30') - Exception tossed: Repeated service interruptions - failure getting document version
org.apache.manifoldcf.core.interfaces.ManifoldCFException: Repeated service interruptions - failure getting document version
	at org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:339)


By the way in "Dechromed Content" tab (Job Setting UI) I see four "&nbsp;"   

Thanks,
Ahmet
--- On Fri, 11/16/12, Karl Wright <da...@gmail.com> wrote:

> From: Karl Wright <da...@gmail.com>
> Subject: Anyone out there using RSS connector, who wants to help?
> To: "dev" <de...@manifoldcf.apache.org>
> Date: Friday, November 16, 2012, 3:54 PM
> Hi all,
> 
> The branch https://svn.apache.org/repos/asf/manifoldcf/branches/CONNECTORS-120
> contains an RSS connector that has been updated to use
> httpcomponents
> 4.2.2.  I'd love for people who are in a position to do
> significant
> RSS crawling to try it out before I pull it into
> trunk.  Any takers?
> 
> Karl
>