You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Hrishikesh Agashe <hr...@persistent.co.in> on 2009/07/14 14:12:51 UTC

A few questions about crawl-urlfilter.txt

Here are few questions I had about crawl-urlfilter.txt.

- Does Nutch obey crawl-urlfilter.txt properly? By default, it is set to not download css, but when I do the crawl, I do see parse.ParseUtil exceptions in my Hadoop.log (org.apache.nutch.parse.ParseException: parser not found for contentType=text/css)
Doesn't this mean that Nutch has actually downloaded a css file and is trying to parse it?

- Can I put a positive filter in crawl-urlfilter.txt? Like

+\.(html, htm)

Instead of current one which starts with "-"? Will it make Nutch only download files with extension htm and html?

- Are the extensions in crawl-urlfilter.txt case sensitive or not? i.e. do I have to add mp3, MP3, Mp3 to tell Nutch to not to download mp3 files?

- How does Nutch handle URLs which are GET but does not end with extension? i.e. if there is a URL like http://www.mysite.com/images/1 which returns an image, will Nutch be able to identify it and avoid it's download?

TIA,
--Hrishi

DISCLAIMER
==========
This e-mail may contain privileged and confidential information which is the property of Persistent Systems Ltd. It is intended only for the use of the individual or entity to which it is addressed. If you are not the intended recipient, you are not authorized to read, retain, copy, print, distribute or use this message. If you have received this communication in error, please notify the sender and delete all copies of this message. Persistent Systems Ltd. does not accept any liability for virus infected mails.

Re: A few questions about crawl-urlfilter.txt

Posted by reinhard schwab <re...@aon.at>.

Hrishikesh Agashe schrieb:
> Here are few questions I had about crawl-urlfilter.txt.
>
>
> -          Does Nutch obey crawl-urlfilter.txt properly? By default, it is set to not download css, but when I do the crawl, I do see parse.ParseUtil exceptions in my Hadoop.log (org.apache.nutch.parse.ParseException: parser not found for contentType=text/css)
> Doesn't this mean that Nutch has actually downloaded a css file and is trying to parse it?
>   
crawl-urfilter.txt describes the filter rules for regexp filtering of urls.
see  conf/crawl-tool.xml
it only filters urls by matching regexps.

if the file has no css file extension and if it is matched positive by
one of the filter rules, it will be downloaded and parsed.
this is your case.
>
> -          Can I put a positive filter in crawl-urlfilter.txt? Like
>
> +\.(html, htm)
>
> Instead of current one which starts with "-"? Will it make Nutch only download files with extension htm and html?
>   
yes.
see the code in

RegexURLFilterBase

it iterates through the rules and the first matching rule is applied.
if no rule matches, the url is filtered out.

btw, your pattern should be
+\.(html|htm|HTML|HTM)$
the pattern has to be a regular expression.

>
>
> -          Are the extensions in crawl-urlfilter.txt case sensitive or not?  i.e. do I have to add mp3, MP3, Mp3 to tell Nutch to not to download mp3 files?
>   
they are case sensitive.
see RegexURLFilter
patterns are compiled

pattern = Pattern.compile(regex);

if you want them to be case insensitive,
patterns have to be compiled with
pattern = Pattern.compile(regex, Pattern.CASE_INSENSITIVE);

>
>
> -          How does Nutch handle URLs which are GET but does not end with extension? i.e. if there is a URL like http://www.mysite.com/images/1 which returns an image, will Nutch be able to identify it and avoid it's download?
>   
download can only be avoided by defining filter rules.
if you have a rule like

-images/1$

it will not be downloaded.

reinhard


> TIA,
> --Hrishi
>
>
> DISCLAIMER
> ==========
> This e-mail may contain privileged and confidential information which is the property of Persistent Systems Ltd. It is intended only for the use of the individual or entity to which it is addressed. If you are not the intended recipient, you are not authorized to read, retain, copy, print, distribute or use this message. If you have received this communication in error, please notify the sender and delete all copies of this message. Persistent Systems Ltd. does not accept any liability for virus infected mails.
>
>

RE: A few questions about crawl-urlfilter.txt

Posted by Pravin Karne <pr...@persistent.co.in>.

Hi I have same problem.

I have following regex-urlfilter.txt

# skip file: ftp: and mailto: urls
-^(file|ftp|mailto):

# skip image and other suffixes we can't yet parse
-\.(xd|AXD|bmp|BMP|class|CLASS|css|CSS|csv|CSV|dmg|DMG|doc|DOC|eps|EPS|exe|EXE|gif|GIF|gz|GZ|ico|ICO|ics|ICS|jpeg|JPEG|jpg|JPG|js|JS|m3u|M3U|mid|MID|mov|MOV|mp3|MP3|mp4|MP4|mpeg|MPEG|MPG|mpg|msi|MSI|pdf|PDF|php|PHP|pl|PL|png|PNG|ppt|PPT|ps|PS|ram|RAM|rdf|RDF|rm|RM|rpm|RPM|rtf|RTF|sit|SIT|snd|SND|swf|SWF|tex|TEX|texi|TEXI|tgz|TGZ|tif|TIF|wav|WAV|wma|WMA|wmf|WMF|wml|WML|wmv|WMV|wvx|WVX|xls|XLS|xml|XML|xpm|XPM|zip|ZIP)$

# skip URLs containing certain characters as probable queries, etc.
-[?*!@=]

# skip URLs with slash-delimited segment that repeats 3+ times, to break loops
-.*(/[^/]+)/[^/]+\1/[^/]+\1/

# accept anything else
+.*


But I have following error with above file:

2009-07-15 14:55:12,052 WARN  fetcher.Fetcher - Error parsing: http://www.beachhouse.com/12802-1: org.apache.nutch.parse.ParseException: parser not found for contentType=image/jpeg url=http://www.beachhouse.com/12802-1
2009-07-15 14:55:12,053 WARN  fetcher.Fetcher - Error parsing: http://www.avillarentals.com/image/7209/thumb: org.apache.nutch.parse.ParseException: parser not found for contentType=image/jpeg url=http://www.avillarentals.com/image/7209/thumb
2009-07-15 14:55:13,021 WARN  fetcher.Fetcher - Error parsing: http://www.beachhouse.com/4937-1: org.apache.nutch.parse.ParseException: parser not found for contentType=image/jpeg url=http://www.beachhouse.com/4937-1
2009-07-15 14:55:13,482 WARN  fetcher.Fetcher - Error parsing: http://www.luxury-holiday-villas.co.uk/feed/rss/: org.apache.nutch.parse.ParseException: parser not found for contentType=application/rss+xml url=http://www.luxury-holiday-villas.co.uk/feed/rss/
2009-07-15 14:55:34,371 WARN  fetcher.Fetcher - Error parsing: http://www.uidaho.edu/~/media/69BB4619469C4FF2BD3462E5C6011755.ashx: org.apache.nutch.parse.ParseException: parser not found for contentType=image/jpeg url=http://www.uidaho.edu/~/media/69BB4619469C4FF2BD3462E5C6011755.ashx
2009-07-15 14:55:34,533 WARN  fetcher.Fetcher - Error parsing: http://www.uidaho.edu/~/media/6DB08ABA1B12448B92A3DB2F1CEA35F4.ashx: org.apache.nutch.parse.ParseException: parser not found for contentType=image/jpeg url=http://www.uidaho.edu/~/media/6DB08ABA1B12448B92A3DB2F1CEA35F4.ashx
2009-07-15 14:56:03,503 WARN  fetcher.Fetcher - Error parsing: http://www.beachhouse.com/19463-1: org.apache.nutch.parse.ParseException: parser not found for contentType=image/jpeg url=http://www.beachhouse.com/19463-1
2009-07-15 14:56:03,799 WARN  fetcher.Fetcher - Error parsing: http://www.beachhouse.com/15407-1: org.apache.nutch.parse.ParseException: parser not found for contentType=image/jpeg url=http://www.beachhouse.com/15407-1
2009-07-15 14:56:05,327 WARN  fetcher.Fetcher - Error parsing: http://www.typepad.com/services/rsd/6a00d83451bd4869e200d8341c5c0553ef: org.apache.nutch.parse.ParseException: parser not found for contentType=application/rsd+xml url=http://www.typepad.com/services/rsd/6a00d83451bd4869e200d8341c5c0553ef
2009-07-15 14:56:18,047 WARN  fetcher.Fetcher - Error parsing: http://www.beachhouse.com/4811-1: org.apache.nutch.parse.ParseException: parser not found for contentType=image/jpeg url=http://www.beachhouse.com/4811-1
2009-07-15 14:56:18,298 WARN  fetcher.Fetcher - Error parsing: http://www.beachhouse.com/17056-1: org.apache.nutch.parse.ParseException: parser not found for contentType=image/jpeg url=http://www.beachhouse.com/17056-1


So I think it impossible to apply negative filter ...(as there are lot many extn)

So how can I apply positive filter....in regex-urlfilter.txt file...




Thanks in advance..

Pravin

-----Original Message-----
From: Ken Krugler [mailto:kkrugler_lists@transpac.com] 
Sent: Tuesday, July 14, 2009 8:25 PM
To: nutch-user@lucene.apache.org
Subject: Re: A few questions about crawl-urlfilter.txt

>Here are few questions I had about crawl-urlfilter.txt.

Some very quick responses - others will know better.

>-          Does Nutch obey crawl-urlfilter.txt properly? By default, 
>it is set to not download css, but when I do the crawl, I do see 
>parse.ParseUtil exceptions in my Hadoop.log 
>(org.apache.nutch.parse.ParseException: parser not found for 
>contentType=text/css)
>Doesn't this mean that Nutch has actually downloaded a css file and 
>is trying to parse it?

Yes, but it could be that the file name didn't end in .css, however 
the mime type was set to text/css and thus Nutch is looking for a CSS 
parser.

But I'd suggest posting your crawl-urlfilter.txt file here.

>-          Can I put a positive filter in crawl-urlfilter.txt? Like
>
>+\.(html, htm)
>
>Instead of current one which starts with "-"? Will it make Nutch 
>only download files with extension htm and html?

Yes...see crawl-urlfilter.txt.template:

# Each non-comment, non-blank line contains a regular expression
# prefixed by '+' or '-'.  The first matching pattern in the file
# determines whether a URL is included or ignored.  If no pattern
# matches, the URL is ignored.

>-          Are the extensions in crawl-urlfilter.txt case sensitive 
>or not?  i.e. do I have to add mp3, MP3, Mp3 to tell Nutch to not to 
>download mp3 files?

They aren't, but you can set up the regex to do a case-insensitive comparison.

>-          How does Nutch handle URLs which are GET but does not end 
>with extension? i.e. if there is a URL like 
>http://www.mysite.com/images/1 which returns an image, will Nutch be 
>able to identify it and avoid it's download?

I think Nutch will download the file, since filtering of URLs happens 
before fetching.

-- Ken
-- 
Ken Krugler
+1 530-210-6378

DISCLAIMER
==========
This e-mail may contain privileged and confidential information which is the property of Persistent Systems Ltd. It is intended only for the use of the individual or entity to which it is addressed. If you are not the intended recipient, you are not authorized to read, retain, copy, print, distribute or use this message. If you have received this communication in error, please notify the sender and delete all copies of this message. Persistent Systems Ltd. does not accept any liability for virus infected mails.

Re: A few questions about crawl-urlfilter.txt

Posted by Ken Krugler <kk...@transpac.com>.

>Here are few questions I had about crawl-urlfilter.txt.

Some very quick responses - others will know better.

>-          Does Nutch obey crawl-urlfilter.txt properly? By default, 
>it is set to not download css, but when I do the crawl, I do see 
>parse.ParseUtil exceptions in my Hadoop.log 
>(org.apache.nutch.parse.ParseException: parser not found for 
>contentType=text/css)
>Doesn't this mean that Nutch has actually downloaded a css file and 
>is trying to parse it?

Yes, but it could be that the file name didn't end in .css, however 
the mime type was set to text/css and thus Nutch is looking for a CSS 
parser.

But I'd suggest posting your crawl-urlfilter.txt file here.

>-          Can I put a positive filter in crawl-urlfilter.txt? Like
>
>+\.(html, htm)
>
>Instead of current one which starts with "-"? Will it make Nutch 
>only download files with extension htm and html?

Yes...see crawl-urlfilter.txt.template:

# Each non-comment, non-blank line contains a regular expression
# prefixed by '+' or '-'.  The first matching pattern in the file
# determines whether a URL is included or ignored.  If no pattern
# matches, the URL is ignored.

>-          Are the extensions in crawl-urlfilter.txt case sensitive 
>or not?  i.e. do I have to add mp3, MP3, Mp3 to tell Nutch to not to 
>download mp3 files?

They aren't, but you can set up the regex to do a case-insensitive comparison.

>-          How does Nutch handle URLs which are GET but does not end 
>with extension? i.e. if there is a URL like 
>http://www.mysite.com/images/1 which returns an image, will Nutch be 
>able to identify it and avoid it's download?

I think Nutch will download the file, since filtering of URLs happens 
before fetching.

-- Ken
-- 
Ken Krugler
+1 530-210-6378