You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Ken Krugler <kk...@transpac.com> on 2009/07/14 16:54:36 UTC
Re: A few questions about crawl-urlfilter.txt
>Here are few questions I had about crawl-urlfilter.txt.
Some very quick responses - others will know better.
>- Does Nutch obey crawl-urlfilter.txt properly? By default,
>it is set to not download css, but when I do the crawl, I do see
>parse.ParseUtil exceptions in my Hadoop.log
>(org.apache.nutch.parse.ParseException: parser not found for
>contentType=text/css)
>Doesn't this mean that Nutch has actually downloaded a css file and
>is trying to parse it?
Yes, but it could be that the file name didn't end in .css, however
the mime type was set to text/css and thus Nutch is looking for a CSS
parser.
But I'd suggest posting your crawl-urlfilter.txt file here.
>- Can I put a positive filter in crawl-urlfilter.txt? Like
>
>+\.(html, htm)
>
>Instead of current one which starts with "-"? Will it make Nutch
>only download files with extension htm and html?
Yes...see crawl-urlfilter.txt.template:
# Each non-comment, non-blank line contains a regular expression
# prefixed by '+' or '-'. The first matching pattern in the file
# determines whether a URL is included or ignored. If no pattern
# matches, the URL is ignored.
>- Are the extensions in crawl-urlfilter.txt case sensitive
>or not? i.e. do I have to add mp3, MP3, Mp3 to tell Nutch to not to
>download mp3 files?
They aren't, but you can set up the regex to do a case-insensitive comparison.
>- How does Nutch handle URLs which are GET but does not end
>with extension? i.e. if there is a URL like
>http://www.mysite.com/images/1 which returns an image, will Nutch be
>able to identify it and avoid it's download?
I think Nutch will download the file, since filtering of URLs happens
before fetching.
-- Ken
--
Ken Krugler
+1 530-210-6378
RE: A few questions about crawl-urlfilter.txt
Posted by Pravin Karne <pr...@persistent.co.in>.
Hi I have same problem.
I have following regex-urlfilter.txt
# skip file: ftp: and mailto: urls
-^(file|ftp|mailto):
# skip image and other suffixes we can't yet parse
-\.(xd|AXD|bmp|BMP|class|CLASS|css|CSS|csv|CSV|dmg|DMG|doc|DOC|eps|EPS|exe|EXE|gif|GIF|gz|GZ|ico|ICO|ics|ICS|jpeg|JPEG|jpg|JPG|js|JS|m3u|M3U|mid|MID|mov|MOV|mp3|MP3|mp4|MP4|mpeg|MPEG|MPG|mpg|msi|MSI|pdf|PDF|php|PHP|pl|PL|png|PNG|ppt|PPT|ps|PS|ram|RAM|rdf|RDF|rm|RM|rpm|RPM|rtf|RTF|sit|SIT|snd|SND|swf|SWF|tex|TEX|texi|TEXI|tgz|TGZ|tif|TIF|wav|WAV|wma|WMA|wmf|WMF|wml|WML|wmv|WMV|wvx|WVX|xls|XLS|xml|XML|xpm|XPM|zip|ZIP)$
# skip URLs containing certain characters as probable queries, etc.
-[?*!@=]
# skip URLs with slash-delimited segment that repeats 3+ times, to break loops
-.*(/[^/]+)/[^/]+\1/[^/]+\1/
# accept anything else
+.*
But I have following error with above file:
2009-07-15 14:55:12,052 WARN fetcher.Fetcher - Error parsing: http://www.beachhouse.com/12802-1: org.apache.nutch.parse.ParseException: parser not found for contentType=image/jpeg url=http://www.beachhouse.com/12802-1
2009-07-15 14:55:12,053 WARN fetcher.Fetcher - Error parsing: http://www.avillarentals.com/image/7209/thumb: org.apache.nutch.parse.ParseException: parser not found for contentType=image/jpeg url=http://www.avillarentals.com/image/7209/thumb
2009-07-15 14:55:13,021 WARN fetcher.Fetcher - Error parsing: http://www.beachhouse.com/4937-1: org.apache.nutch.parse.ParseException: parser not found for contentType=image/jpeg url=http://www.beachhouse.com/4937-1
2009-07-15 14:55:13,482 WARN fetcher.Fetcher - Error parsing: http://www.luxury-holiday-villas.co.uk/feed/rss/: org.apache.nutch.parse.ParseException: parser not found for contentType=application/rss+xml url=http://www.luxury-holiday-villas.co.uk/feed/rss/
2009-07-15 14:55:34,371 WARN fetcher.Fetcher - Error parsing: http://www.uidaho.edu/~/media/69BB4619469C4FF2BD3462E5C6011755.ashx: org.apache.nutch.parse.ParseException: parser not found for contentType=image/jpeg url=http://www.uidaho.edu/~/media/69BB4619469C4FF2BD3462E5C6011755.ashx
2009-07-15 14:55:34,533 WARN fetcher.Fetcher - Error parsing: http://www.uidaho.edu/~/media/6DB08ABA1B12448B92A3DB2F1CEA35F4.ashx: org.apache.nutch.parse.ParseException: parser not found for contentType=image/jpeg url=http://www.uidaho.edu/~/media/6DB08ABA1B12448B92A3DB2F1CEA35F4.ashx
2009-07-15 14:56:03,503 WARN fetcher.Fetcher - Error parsing: http://www.beachhouse.com/19463-1: org.apache.nutch.parse.ParseException: parser not found for contentType=image/jpeg url=http://www.beachhouse.com/19463-1
2009-07-15 14:56:03,799 WARN fetcher.Fetcher - Error parsing: http://www.beachhouse.com/15407-1: org.apache.nutch.parse.ParseException: parser not found for contentType=image/jpeg url=http://www.beachhouse.com/15407-1
2009-07-15 14:56:05,327 WARN fetcher.Fetcher - Error parsing: http://www.typepad.com/services/rsd/6a00d83451bd4869e200d8341c5c0553ef: org.apache.nutch.parse.ParseException: parser not found for contentType=application/rsd+xml url=http://www.typepad.com/services/rsd/6a00d83451bd4869e200d8341c5c0553ef
2009-07-15 14:56:18,047 WARN fetcher.Fetcher - Error parsing: http://www.beachhouse.com/4811-1: org.apache.nutch.parse.ParseException: parser not found for contentType=image/jpeg url=http://www.beachhouse.com/4811-1
2009-07-15 14:56:18,298 WARN fetcher.Fetcher - Error parsing: http://www.beachhouse.com/17056-1: org.apache.nutch.parse.ParseException: parser not found for contentType=image/jpeg url=http://www.beachhouse.com/17056-1
So I think it impossible to apply negative filter ...(as there are lot many extn)
So how can I apply positive filter....in regex-urlfilter.txt file...
Thanks in advance..
Pravin
-----Original Message-----
From: Ken Krugler [mailto:kkrugler_lists@transpac.com]
Sent: Tuesday, July 14, 2009 8:25 PM
To: nutch-user@lucene.apache.org
Subject: Re: A few questions about crawl-urlfilter.txt
>Here are few questions I had about crawl-urlfilter.txt.
Some very quick responses - others will know better.
>- Does Nutch obey crawl-urlfilter.txt properly? By default,
>it is set to not download css, but when I do the crawl, I do see
>parse.ParseUtil exceptions in my Hadoop.log
>(org.apache.nutch.parse.ParseException: parser not found for
>contentType=text/css)
>Doesn't this mean that Nutch has actually downloaded a css file and
>is trying to parse it?
Yes, but it could be that the file name didn't end in .css, however
the mime type was set to text/css and thus Nutch is looking for a CSS
parser.
But I'd suggest posting your crawl-urlfilter.txt file here.
>- Can I put a positive filter in crawl-urlfilter.txt? Like
>
>+\.(html, htm)
>
>Instead of current one which starts with "-"? Will it make Nutch
>only download files with extension htm and html?
Yes...see crawl-urlfilter.txt.template:
# Each non-comment, non-blank line contains a regular expression
# prefixed by '+' or '-'. The first matching pattern in the file
# determines whether a URL is included or ignored. If no pattern
# matches, the URL is ignored.
>- Are the extensions in crawl-urlfilter.txt case sensitive
>or not? i.e. do I have to add mp3, MP3, Mp3 to tell Nutch to not to
>download mp3 files?
They aren't, but you can set up the regex to do a case-insensitive comparison.
>- How does Nutch handle URLs which are GET but does not end
>with extension? i.e. if there is a URL like
>http://www.mysite.com/images/1 which returns an image, will Nutch be
>able to identify it and avoid it's download?
I think Nutch will download the file, since filtering of URLs happens
before fetching.
-- Ken
--
Ken Krugler
+1 530-210-6378
DISCLAIMER
==========
This e-mail may contain privileged and confidential information which is the property of Persistent Systems Ltd. It is intended only for the use of the individual or entity to which it is addressed. If you are not the intended recipient, you are not authorized to read, retain, copy, print, distribute or use this message. If you have received this communication in error, please notify the sender and delete all copies of this message. Persistent Systems Ltd. does not accept any liability for virus infected mails.