You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by eyal edri <ey...@gmail.com> on 2007/10/15 11:18:34 UTC

ParseException: parser not found for contentType=image/bmp [or how to disallow certain contentTypes from fetching]

Hello,

During a fetch, the fetcher failed to retrieve a certain page with the
following exception:

// url is masked ****
Error parsing: http://*********/validCode.asp:
org.apache.nutch.parse.ParseException: parser not found for
contentType=image/bmp url=http://0086jia.com/include/validCode.asp
        at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:81)
        at org.apache.nutch.fetcher.Fetcher$FetcherThread.output(
Fetcher.java:349)
        at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java
:194)

i've configed both regex-urlfilter.txt;

# skip image and other suffixes we can't yet parse
-\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|wmv|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|jpeg|JPEG|
bmp|BMP|swf)$

and suffix-urlfilter.txt:

### prohibit these
# pictures
.gif
.jpg
.jpeg
.bmp
.png
.tif
.tiff

both plugins are in the nutch-site "plugin-include" property:

<property>
  <name>plugin.includes</name>
  <value>protocol-http|urlfilter-regex|urlfilter-suffix|
parse-(text|html|js|zip)|query-(basic|site|url)|index-basic|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)</value>


and my crawling is done by running: nutch inject/generate/fetch loops.

Am i missing some property i should config  in  order to avoid
fetching/crawling contentTypes i don't to? (same goes for xml/jpeg... and
other filetypes).

Thanks!

Eyal.

Re: ParseException: parser not found for contentType=image/bmp [or how to disallow certain contentTypes from fetching]

Posted by Dennis Kubes <ku...@apache.org>.

Marcin is correct about the .asp extension and the regex filter, but 
nutch is not downloading this as an image src.  The page itself 
http://0086jia.com/include/validCode.asp, returns an image with content 
type of bmp.  It looks like a simple captcha to me.  Since nutch can't 
parse this type of content it throws an error and moves on.  It 
shouldn't stop the fetching process, it should just log the error and 
continue.  AFAIK there is no way currently to filter content types, 
although that might be an interesting addition.

Dennis Kubes

Marcin Okraszewski wrote:
> The regex filter just filters URL, not content types. As the URL ends with .asp it does not fall into the prohibited URL patterns. The problem is that Nutch fallows img/@src, so it downloads images. There is a patch for this under http://issues.apache.org/jira/browse/Nutch-488 which allows selecting tags to take for outlinks.
> 
> See more in this thread:
> http://www.mail-archive.com/nutch-dev@lucene.apache.org/msg06961.html
> 
> Regards,
> Marcin
> 
> 
> Dnia 15 października 2007 11:18 "eyal edri" <ey...@gmail.com> napisał(a):
> 
>> Hello,
>>
>> During a fetch, the fetcher failed to retrieve a certain page with the
>> following exception:
>>
>> // url is masked ****
>> Error parsing: http://*********/validCode.asp:
>> org.apache.nutch.parse.ParseException: parser not found for
>> contentType=image/bmp url=http://0086jia.com/include/validCode.asp
>>         at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:81)
>>         at org.apache.nutch.fetcher.Fetcher$FetcherThread.output(
>> Fetcher.java:349)
>>         at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java
>> :194)
>>
>> i've configed both regex-urlfilter.txt;
>>
>> # skip image and other suffixes we can't yet parse
>> -\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|wmv|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|jpeg|JPEG|
>> bmp|BMP|swf)$
>>
>> and suffix-urlfilter.txt:
>>
>> ### prohibit these
>> # pictures
>> .gif
>> .jpg
>> .jpeg
>> .bmp
>> .png
>> .tif
>> .tiff
>>
>> both plugins are in the nutch-site "plugin-include" property:
>>
>>
>>   plugin.includes
>>   protocol-http|urlfilter-regex|urlfilter-suffix|
>> parse-(text|html|js|zip)|query-(basic|site|url)|index-basic|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)
>>
>>
>> and my crawling is done by running: nutch inject/generate/fetch loops.
>>
>> Am i missing some property i should config  in  order to avoid
>> fetching/crawling contentTypes i don't to? (same goes for xml/jpeg... and
>> other filetypes).
>>
>> Thanks!
>>
>> Eyal.
>>

Re: ParseException: parser not found for contentType=image/bmp [or how to disallow certain contentTypes from fetching]

Posted by Marcin Okraszewski <ok...@o2.pl>.

The regex filter just filters URL, not content types. As the URL ends with .asp it does not fall into the prohibited URL patterns. The problem is that Nutch fallows img/@src, so it downloads images. There is a patch for this under http://issues.apache.org/jira/browse/Nutch-488 which allows selecting tags to take for outlinks.

See more in this thread:
http://www.mail-archive.com/nutch-dev@lucene.apache.org/msg06961.html

Regards,
Marcin


Dnia 15 października 2007 11:18 "eyal edri" <ey...@gmail.com> napisał(a):

> Hello,
> 
> During a fetch, the fetcher failed to retrieve a certain page with the
> following exception:
> 
> // url is masked ****
> Error parsing: http://*********/validCode.asp:
> org.apache.nutch.parse.ParseException: parser not found for
> contentType=image/bmp url=http://0086jia.com/include/validCode.asp
>         at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:81)
>         at org.apache.nutch.fetcher.Fetcher$FetcherThread.output(
> Fetcher.java:349)
>         at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java
> :194)
> 
> i've configed both regex-urlfilter.txt;
> 
> # skip image and other suffixes we can't yet parse
> -\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|wmv|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|jpeg|JPEG|
> bmp|BMP|swf)$
> 
> and suffix-urlfilter.txt:
> 
> ### prohibit these
> # pictures
> .gif
> .jpg
> .jpeg
> .bmp
> .png
> .tif
> .tiff
> 
> both plugins are in the nutch-site "plugin-include" property:
> 
> 
>   plugin.includes
>   protocol-http|urlfilter-regex|urlfilter-suffix|
> parse-(text|html|js|zip)|query-(basic|site|url)|index-basic|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)
> 
> 
> and my crawling is done by running: nutch inject/generate/fetch loops.
> 
> Am i missing some property i should config  in  order to avoid
> fetching/crawling contentTypes i don't to? (same goes for xml/jpeg... and
> other filetypes).
> 
> Thanks!
> 
> Eyal.
>