You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@manifoldcf.apache.org by "Wunderlich, Tobias" <to...@igd-r.fraunhofer.de> on 2011/10/06 13:18:02 UTC

MCF 0.3 - WebCrawlerConnector - Ingestion Problems

Hey guys,

I try to crawl a website generated with a Mediawiki-extension and always get the message:

"[WebcrawlerConnector.java:1312] - WEB: Decided not to ingest 'http://wiki.<host>/index.php?title=Spezial%3AAlle+Seiten&from=p&to=s&namespace=0' because it did not match ingestability criteria"

Seed-url: 'http://wiki.<host>/index.php?title=Spezial%3AAlle+Seiten&from=p&to=s&namespace=0
Inclusions (crawl and index): .*
Exclusions: none

Other sites are crawled without problems, so I'm wondering what those ingestability criteria exactly are.

Best regards,
Tobias

AW: MCF 0.3 - WebCrawlerConnector - Ingestion Problems

Posted by "Wunderlich, Tobias" <to...@igd-r.fraunhofer.de>.

Hey Karl,

thanks for your reply ...

I found the reason The crawler didn't want to ingest the site, it was none of your mentioned criteria ... the site simply didn't want to be seeded or indexed:

<html>
	<head>
		...
		<meta name="robots" content="noindex,nofollow" />
		...
	</head>
	...
</html>


Tobias


-----Ursprüngliche Nachricht-----
Von: Karl Wright [mailto:daddywri@gmail.com] 
Gesendet: Donnerstag, 6. Oktober 2011 16:42
An: connectors-user@incubator.apache.org
Betreff: Re: MCF 0.3 - WebCrawlerConnector - Ingestion Problems

Hi Tobias,

Sorry for the delay.
There are a number of reasons a document can be rejected for indexing.
 They are:

(1) URL criteria, as specified in the Web job's specification information
(2) Maximum document length, as controlled by the output connection (you never told us what that was)
(3) Mime type criteria, as controlled by the output connection

So I bet this is a mime type issue.  What content-type does the page have?  What output connector are you using?

Karl

On Thu, Oct 6, 2011 at 7:18 AM, Wunderlich, Tobias <to...@igd-r.fraunhofer.de> wrote:
> Hey guys,
>
>
>
> I try to crawl a website generated with a Mediawiki-extension and 
> always get the message:
>
>
>
> "[WebcrawlerConnector.java:1312] - WEB: Decided not to ingest 
> 'http://wiki.<host>/index.php?title=Spezial%3AAlle+Seiten&from=p&to=s&namespace=0'
> because it did not match ingestability criteria"
>
>
>
> Seed-url:
> 'http://wiki.<host>/index.php?title=Spezial%3AAlle+Seiten&from=p&to=s&
> namespace=0
>
> Inclusions (crawl and index): .*
>
> Exclusions: none
>
>
>
> Other sites are crawled without problems, so I'm wondering what those 
> ingestability criteria exactly are.
>
>
>
> Best regards,
>
> Tobias
>
>

Re: MCF 0.3 - WebCrawlerConnector - Ingestion Problems

Posted by Karl Wright <da...@gmail.com>.

Hi Tobias,

Sorry for the delay.
There are a number of reasons a document can be rejected for indexing.
 They are:

(1) URL criteria, as specified in the Web job's specification information
(2) Maximum document length, as controlled by the output connection
(you never told us what that was)
(3) Mime type criteria, as controlled by the output connection

So I bet this is a mime type issue.  What content-type does the page
have?  What output connector are you using?

Karl

On Thu, Oct 6, 2011 at 7:18 AM, Wunderlich, Tobias
<to...@igd-r.fraunhofer.de> wrote:
> Hey guys,
>
>
>
> I try to crawl a website generated with a Mediawiki-extension and always get
> the message:
>
>
>
> “[WebcrawlerConnector.java:1312] - WEB: Decided not to ingest
> 'http://wiki.<host>/index.php?title=Spezial%3AAlle+Seiten&from=p&to=s&namespace=0'
> because it did not match ingestability criteria”
>
>
>
> Seed-url:
> 'http://wiki.<host>/index.php?title=Spezial%3AAlle+Seiten&from=p&to=s&namespace=0
>
> Inclusions (crawl and index): .*
>
> Exclusions: none
>
>
>
> Other sites are crawled without problems, so I’m wondering what those
> ingestability criteria exactly are.
>
>
>
> Best regards,
>
> Tobias
>
>