You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Ken Krugler <kk...@transpac.com> on 2009/07/14 16:54:36 UTC

Re: A few questions about crawl-urlfilter.txt

>Here are few questions I had about crawl-urlfilter.txt.

Some very quick responses - others will know better.

>-          Does Nutch obey crawl-urlfilter.txt properly? By default, 
>it is set to not download css, but when I do the crawl, I do see 
>parse.ParseUtil exceptions in my Hadoop.log 
>(org.apache.nutch.parse.ParseException: parser not found for 
>contentType=text/css)
>Doesn't this mean that Nutch has actually downloaded a css file and 
>is trying to parse it?

Yes, but it could be that the file name didn't end in .css, however 
the mime type was set to text/css and thus Nutch is looking for a CSS 
parser.

But I'd suggest posting your crawl-urlfilter.txt file here.

>-          Can I put a positive filter in crawl-urlfilter.txt? Like
>
>+\.(html, htm)
>
>Instead of current one which starts with "-"? Will it make Nutch 
>only download files with extension htm and html?

Yes...see crawl-urlfilter.txt.template:

# Each non-comment, non-blank line contains a regular expression
# prefixed by '+' or '-'.  The first matching pattern in the file
# determines whether a URL is included or ignored.  If no pattern
# matches, the URL is ignored.

>-          Are the extensions in crawl-urlfilter.txt case sensitive 
>or not?  i.e. do I have to add mp3, MP3, Mp3 to tell Nutch to not to 
>download mp3 files?

They aren't, but you can set up the regex to do a case-insensitive comparison.

>-          How does Nutch handle URLs which are GET but does not end 
>with extension? i.e. if there is a URL like 
>http://www.mysite.com/images/1 which returns an image, will Nutch be 
>able to identify it and avoid it's download?

I think Nutch will download the file, since filtering of URLs happens 
before fetching.

-- Ken
-- 
Ken Krugler
+1 530-210-6378

RE: A few questions about crawl-urlfilter.txt

Posted by Pravin Karne <pr...@persistent.co.in>.

Hi I have same problem.

I have following regex-urlfilter.txt

# skip file: ftp: and mailto: urls
-^(file|ftp|mailto):

# skip image and other suffixes we can't yet parse
-\.(xd|AXD|bmp|BMP|class|CLASS|css|CSS|csv|CSV|dmg|DMG|doc|DOC|eps|EPS|exe|EXE|gif|GIF|gz|GZ|ico|ICO|ics|ICS|jpeg|JPEG|jpg|JPG|js|JS|m3u|M3U|mid|MID|mov|MOV|mp3|MP3|mp4|MP4|mpeg|MPEG|MPG|mpg|msi|MSI|pdf|PDF|php|PHP|pl|PL|png|PNG|ppt|PPT|ps|PS|ram|RAM|rdf|RDF|rm|RM|rpm|RPM|rtf|RTF|sit|SIT|snd|SND|swf|SWF|tex|TEX|texi|TEXI|tgz|TGZ|tif|TIF|wav|WAV|wma|WMA|wmf|WMF|wml|WML|wmv|WMV|wvx|WVX|xls|XLS|xml|XML|xpm|XPM|zip|ZIP)$

# skip URLs containing certain characters as probable queries, etc.
-[?*!@=]

# skip URLs with slash-delimited segment that repeats 3+ times, to break loops
-.*(/[^/]+)/[^/]+\1/[^/]+\1/

# accept anything else
+.*


But I have following error with above file:

2009-07-15 14:55:12,052 WARN  fetcher.Fetcher - Error parsing: http://www.beachhouse.com/12802-1: org.apache.nutch.parse.ParseException: parser not found for contentType=image/jpeg url=http://www.beachhouse.com/12802-1
2009-07-15 14:55:12,053 WARN  fetcher.Fetcher - Error parsing: http://www.avillarentals.com/image/7209/thumb: org.apache.nutch.parse.ParseException: parser not found for contentType=image/jpeg url=http://www.avillarentals.com/image/7209/thumb
2009-07-15 14:55:13,021 WARN  fetcher.Fetcher - Error parsing: http://www.beachhouse.com/4937-1: org.apache.nutch.parse.ParseException: parser not found for contentType=image/jpeg url=http://www.beachhouse.com/4937-1
2009-07-15 14:55:13,482 WARN  fetcher.Fetcher - Error parsing: http://www.luxury-holiday-villas.co.uk/feed/rss/: org.apache.nutch.parse.ParseException: parser not found for contentType=application/rss+xml url=http://www.luxury-holiday-villas.co.uk/feed/rss/
2009-07-15 14:55:34,371 WARN  fetcher.Fetcher - Error parsing: http://www.uidaho.edu/~/media/69BB4619469C4FF2BD3462E5C6011755.ashx: org.apache.nutch.parse.ParseException: parser not found for contentType=image/jpeg url=http://www.uidaho.edu/~/media/69BB4619469C4FF2BD3462E5C6011755.ashx
2009-07-15 14:55:34,533 WARN  fetcher.Fetcher - Error parsing: http://www.uidaho.edu/~/media/6DB08ABA1B12448B92A3DB2F1CEA35F4.ashx: org.apache.nutch.parse.ParseException: parser not found for contentType=image/jpeg url=http://www.uidaho.edu/~/media/6DB08ABA1B12448B92A3DB2F1CEA35F4.ashx
2009-07-15 14:56:03,503 WARN  fetcher.Fetcher - Error parsing: http://www.beachhouse.com/19463-1: org.apache.nutch.parse.ParseException: parser not found for contentType=image/jpeg url=http://www.beachhouse.com/19463-1
2009-07-15 14:56:03,799 WARN  fetcher.Fetcher - Error parsing: http://www.beachhouse.com/15407-1: org.apache.nutch.parse.ParseException: parser not found for contentType=image/jpeg url=http://www.beachhouse.com/15407-1
2009-07-15 14:56:05,327 WARN  fetcher.Fetcher - Error parsing: http://www.typepad.com/services/rsd/6a00d83451bd4869e200d8341c5c0553ef: org.apache.nutch.parse.ParseException: parser not found for contentType=application/rsd+xml url=http://www.typepad.com/services/rsd/6a00d83451bd4869e200d8341c5c0553ef
2009-07-15 14:56:18,047 WARN  fetcher.Fetcher - Error parsing: http://www.beachhouse.com/4811-1: org.apache.nutch.parse.ParseException: parser not found for contentType=image/jpeg url=http://www.beachhouse.com/4811-1
2009-07-15 14:56:18,298 WARN  fetcher.Fetcher - Error parsing: http://www.beachhouse.com/17056-1: org.apache.nutch.parse.ParseException: parser not found for contentType=image/jpeg url=http://www.beachhouse.com/17056-1


So I think it impossible to apply negative filter ...(as there are lot many extn)

So how can I apply positive filter....in regex-urlfilter.txt file...




Thanks in advance..

Pravin

-----Original Message-----
From: Ken Krugler [mailto:kkrugler_lists@transpac.com] 
Sent: Tuesday, July 14, 2009 8:25 PM
To: nutch-user@lucene.apache.org
Subject: Re: A few questions about crawl-urlfilter.txt

>Here are few questions I had about crawl-urlfilter.txt.

Some very quick responses - others will know better.

>-          Does Nutch obey crawl-urlfilter.txt properly? By default, 
>it is set to not download css, but when I do the crawl, I do see 
>parse.ParseUtil exceptions in my Hadoop.log 
>(org.apache.nutch.parse.ParseException: parser not found for 
>contentType=text/css)
>Doesn't this mean that Nutch has actually downloaded a css file and 
>is trying to parse it?

Yes, but it could be that the file name didn't end in .css, however 
the mime type was set to text/css and thus Nutch is looking for a CSS 
parser.

But I'd suggest posting your crawl-urlfilter.txt file here.

>-          Can I put a positive filter in crawl-urlfilter.txt? Like
>
>+\.(html, htm)
>
>Instead of current one which starts with "-"? Will it make Nutch 
>only download files with extension htm and html?

Yes...see crawl-urlfilter.txt.template:

# Each non-comment, non-blank line contains a regular expression
# prefixed by '+' or '-'.  The first matching pattern in the file
# determines whether a URL is included or ignored.  If no pattern
# matches, the URL is ignored.

>-          Are the extensions in crawl-urlfilter.txt case sensitive 
>or not?  i.e. do I have to add mp3, MP3, Mp3 to tell Nutch to not to 
>download mp3 files?

They aren't, but you can set up the regex to do a case-insensitive comparison.

>-          How does Nutch handle URLs which are GET but does not end 
>with extension? i.e. if there is a URL like 
>http://www.mysite.com/images/1 which returns an image, will Nutch be 
>able to identify it and avoid it's download?

I think Nutch will download the file, since filtering of URLs happens 
before fetching.

-- Ken
-- 
Ken Krugler
+1 530-210-6378

DISCLAIMER
==========
This e-mail may contain privileged and confidential information which is the property of Persistent Systems Ltd. It is intended only for the use of the individual or entity to which it is addressed. If you are not the intended recipient, you are not authorized to read, retain, copy, print, distribute or use this message. If you have received this communication in error, please notify the sender and delete all copies of this message. Persistent Systems Ltd. does not accept any liability for virus infected mails.