You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@nutch.apache.org by Rod Taylor <rb...@sitesell.com> on 2005/11/11 19:48:14 UTC

Urlfilter Patch

Add a few more extensions which I commonly see and cannot be parsed
(that I am aware of). ZIP, mso, jar, bz2, XLS, pps, PPS, dot, etc.

Add in additional lines (commented out by default) for quickly rejecting
URLs for extended content areas (doc, png, pdf, rtf, etc.) for people
who do not want anything but HTML or items with URLs that can get us the
HTML.

-- 
Rod Taylor <rb...@sitesell.com>

Re: Urlfilter Patch

Posted by Piotr Kosiorowski <pk...@gmail.com>.

Jérôme Charron wrote:
[...]
> build a list of file extensions to include (other ones will be excluded) in
> the fecth process.
[...]
I would not like to exclude all others - as for example many extensions 
are valid for html - especially dynamicly generated pages (jsp,asp,cgi 
just to name the easy ones and a lot of custom ones).  But the idea of 
automatically allowing extensions for which plugins are enabled is good 
in my opinion.
Anyway I will try to find my own list of forbidden extensions I prepared 
based on  80mln of urls - I just prepared the list of most common ones 
and went through it manually. I will try to find it over weekend so we 
can combine it with the list discussed in this thread.
P.

Re: Urlfilter Patch

Posted by Matt Kangas <ka...@gmail.com>.

Doug,

After sleeping on this idea, I realized that there's a middle ground  
that may give us (and website operators) the best of both worlds.

The question: how to avoid fetching unparseable content?

Value in answering this:
- save crawl operators bandwidth, disk space, cpu time
- save website operators bandwidth (and maybe cpu time) = be better  
web citizens

Tools availble:
- regex-urlfilter.txt (nearly free to run, but is only an approximate  
answer)
- HTTP HEAD before GET (cheaper than blind GET, but mainly saves  
bandwidth, not server cpu)

Proposed strategy:

1) Define regex-urlfilter.txt, as we do now. Continue to weed out  
known-unparseable file extensions as early as possible.
2) Also define another regex list for extensions that are very likely  
to be text/html. (e.g. .html, .php).
Fetch these blindly with HTTP GET.
3) For everything else, perform HTTP HEAD first. If the mime-type is  
unparseable, do not follow with HTTP GET.

Advantages to this approach:
- still weeds out known-bad stuff as early as possible
- saves crawl+server bandwidth in questionable cases
- saves server load in high-confidence cases (eliminates HTTP HEAD)

Disadvantages: ?


On Dec 1, 2005, at 5:23 PM, Matt Kangas wrote:

> Totally agreed. Neither approach replaces the other. I just wanted  
> to mention possibility so people don't over-focus on trying to  
> build a hyper-optimized regex list. :)
>
> For the content provider, an HTTP HEAD request saves them bandwidth  
> if we don't do a GET. That's some cost savings for them over doing  
> a blind fetch (esp. if we discard it).
>
> I guess the question is, what's worse:
> - two server hits when we find content we want?, or
> - spending bandwidth on pages that the Nutch installation will  
> ignore anyway?
>
> --matt
>
> On Dec 1, 2005, at 4:40 PM, Doug Cutting wrote:
>
>> Matt Kangas wrote:
>>> The latter is not strictly true. Nutch could issue an HTTP HEAD   
>>> before the HTTP GET, and determine the mime-type before actually   
>>> grabbing the content.
>>> It's not how Nutch works now, but this might be more useful than  
>>> a  super-detailed set of regexes...
>>
>> This could be a useful addition, but it could not replace url- 
>> based filters.  A HEAD request must still be polite, so this could  
>> substantially slow fetching, as it would incur more delays.  Also,  
>> for most dynamic pages, a HEAD is as expensive for the server as a  
>> GET, so this would cause more load on servers.
>>
>> Doug
>
> --
> Matt Kangas / kangas@gmail.com
>
>

--
Matt Kangas / kangas@gmail.com

Re: Urlfilter Patch

Posted by Matt Kangas <ka...@gmail.com>.

Totally agreed. Neither approach replaces the other. I just wanted to  
mention possibility so people don't over-focus on trying to build a  
hyper-optimized regex list. :)

For the content provider, an HTTP HEAD request saves them bandwidth  
if we don't do a GET. That's some cost savings for them over doing a  
blind fetch (esp. if we discard it).

I guess the question is, what's worse:
- two server hits when we find content we want?, or
- spending bandwidth on pages that the Nutch installation will ignore  
anyway?

--matt

On Dec 1, 2005, at 4:40 PM, Doug Cutting wrote:

> Matt Kangas wrote:
>> The latter is not strictly true. Nutch could issue an HTTP HEAD   
>> before the HTTP GET, and determine the mime-type before actually   
>> grabbing the content.
>> It's not how Nutch works now, but this might be more useful than  
>> a  super-detailed set of regexes...
>
> This could be a useful addition, but it could not replace url-based  
> filters.  A HEAD request must still be polite, so this could  
> substantially slow fetching, as it would incur more delays.  Also,  
> for most dynamic pages, a HEAD is as expensive for the server as a  
> GET, so this would cause more load on servers.
>
> Doug

--
Matt Kangas / kangas@gmail.com

Re: Urlfilter Patch

Posted by Doug Cutting <cu...@nutch.org>.

Matt Kangas wrote:
> The latter is not strictly true. Nutch could issue an HTTP HEAD  before 
> the HTTP GET, and determine the mime-type before actually  grabbing the 
> content.
> 
> It's not how Nutch works now, but this might be more useful than a  
> super-detailed set of regexes...

This could be a useful addition, but it could not replace url-based 
filters.  A HEAD request must still be polite, so this could 
substantially slow fetching, as it would incur more delays.  Also, for 
most dynamic pages, a HEAD is as expensive for the server as a GET, so 
this would cause more load on servers.

Doug

Re: Urlfilter Patch

Posted by Matt Kangas <ka...@gmail.com>.

The latter is not strictly true. Nutch could issue an HTTP HEAD  
before the HTTP GET, and determine the mime-type before actually  
grabbing the content.

It's not how Nutch works now, but this might be more useful than a  
super-detailed set of regexes...

kangas@kangas-dev:~$ telnet localhost 80
Trying 127.0.0.1...
Connected to localhost.localdomain.
Escape character is '^]'.
HEAD / HTTP/1.0

HTTP/1.1 200 OK
Date: Thu, 01 Dec 2005 21:25:38 GMT
Server: Apache/2.0
Connection: close
Content-Type: text/html; charset=UTF-8

Connection closed by foreign host



On Dec 1, 2005, at 4:21 PM, Doug Cutting wrote:

> Chris Mattmann wrote:
>>   In principle, the mimeType system should give us some guidance on
>> determining the appropriate mimeType for the content, regardless  
>> of whether
>> it ends in .foo, .bar or the like.
>
> Right, but the URL filters run long before we know the mime type,  
> in order to try to keep us from fetching lots of stuff we can't  
> process. The mime type is not known until we've fetched it.
>
> Doug

--
Matt Kangas / kangas@gmail.com

RE: Urlfilter Patch

Posted by Chris Mattmann <ch...@jpl.nasa.gov>.

Hi Jerome,

> Yes, the fetcher can't rely on the document mime-type.
> The only thing we can use for filtering is the document's URL.
> So, another alternative, could be to exclude only files extensions that
> are
> registered in the mime-type repository
> (some well known file extensions) but for which no parser is activated.
> And
> accepting all other ones.
> So that the .foo files will be fetched...

Yup, the key phrase is "well known". It would sort of be an optimization, or
heuristic, to save some work on the regex...

Cheers,
  Chris


> 
> Jérôme

Re: Urlfilter Patch

Posted by Jérôme Charron <je...@gmail.com>.

> Right, but the URL filters run long before we know the mime type, in
> order to try to keep us from fetching lots of stuff we can't process.
> The mime type is not known until we've fetched it.

Yes, the fetcher can't rely on the document mime-type.
The only thing we can use for filtering is the document's URL.
So, another alternative, could be to exclude only files extensions that are
registered in the mime-type repository
(some well known file extensions) but for which no parser is activated. And
accepting all other ones.
So that the .foo files will be fetched...

Jérôme

RE: Urlfilter Patch

Posted by Chris Mattmann <ch...@jpl.nasa.gov>.

Hi Doug,

> 
> Chris Mattmann wrote:
> >   In principle, the mimeType system should give us some guidance on
> > determining the appropriate mimeType for the content, regardless of
> whether
> > it ends in .foo, .bar or the like.
> 
> Right, but the URL filters run long before we know the mime type, in
> order to try to keep us from fetching lots of stuff we can't process.
> The mime type is not known until we've fetched it.

Duh, you're right. Sorry about that. 

Matt Kangas wrote:
> The latter is not strictly true. Nutch could issue an HTTP HEAD  
> before the HTTP GET, and determine the mime-type before actually  
> grabbing the content.
> 
> It's not how Nutch works now, but this might be more useful than a 
> super-detailed set of regexes...

I liked Matt's idea of the HEAD request though. I wonder if some benchmarks
on performance of this would be useful, because in some cases (such as
focused crawling, or "non-whole-internet" crawling, such as intranet, etc.),
it would seem that the performance penalty of performing the HEAD to get the
content-type would be useful, and worth the cost...

Cheers,
  Chris

Re: Urlfilter Patch

Posted by Doug Cutting <cu...@nutch.org>.

Chris Mattmann wrote:
>   In principle, the mimeType system should give us some guidance on
> determining the appropriate mimeType for the content, regardless of whether
> it ends in .foo, .bar or the like.

Right, but the URL filters run long before we know the mime type, in 
order to try to keep us from fetching lots of stuff we can't process. 
The mime type is not known until we've fetched it.

Doug

Re: Urlfilter Patch

Posted by Chris Mattmann <ch...@jpl.nasa.gov>.

Hi Doug,

On 12/1/05 1:11 PM, "Doug Cutting" <cu...@nutch.org> wrote:

> Jérôme Charron wrote:
[...]
> 
> What about a site that develops a content system that has urls that end
> in .foo, which we would exclude, even though they return html?
> 
> Doug

  In principle, the mimeType system should give us some guidance on
determining the appropriate mimeType for the content, regardless of whether
it ends in .foo, .bar or the like. I'm not sure if the mime type registry is
there yet, but I know that Jerome was working on a major update that would
help in recognizing these types of situations. Of course, efficiency comes
into play as well, in terms of now slowing down the fetch/parse, but it
would be nice to have a general solution that made use of the information
available in parse-plugins.xml to determine the appropriate set of allowed
extensions in a URLFilter, if possible. It may be a pipe dream, but I'd say
it's worth exploring...

Cheers,
  Chris

______________________________________________
Chris A. Mattmann
Chris.Mattmann@jpl.nasa.gov
Staff Member
Modeling and Data Management Systems Section (387)
Data Management Systems and Technologies Group

_________________________________________________
Jet Propulsion Laboratory            Pasadena, CA
Office: 171-266B                        Mailstop:  171-246
_______________________________________________________

Disclaimer:  The opinions presented within are my own and do not reflect
those of either NASA, JPL, or the California Institute of Technology.

Re: Urlfilter Patch

Posted by Doug Cutting <cu...@nutch.org>.

Jérôme Charron wrote:
> For consistency purpose, and easy of nutch management, why not filtering the
> extensions based on the activated plugins?
> By looking at the mime-types defined in the parse-plugins.xml file and the
> activated plugins, we know which content-types will be parsed.
> So, by getting the file extensions associated to each content-type, we can
> build a list of file extensions to include (other ones will be excluded) in
> the fecth process.
> No?

What about a site that develops a content system that has urls that end 
in .foo, which we would exclude, even though they return html?

Doug

Re: Urlfilter Patch

Posted by Ken Krugler <kk...@transpac.com>.

>Suggestion:
>For consistency purpose, and easy of nutch management, why not filtering the
>extensions based on the activated plugins?
>By looking at the mime-types defined in the parse-plugins.xml file and the
>activated plugins, we know which content-types will be parsed.
>So, by getting the file extensions associated to each content-type, we can
>build a list of file extensions to include (other ones will be excluded) in
>the fetch process.

I'd asked a Nutch consultant this exact same question a few months ago.

It does seem odd that there's an implicit dependency between the file 
suffixes found in regex-urlfilter.txt and the enabled plug-ins found 
in nutch-default.xml and nutch-site.xml. What's the point of 
downloading a 100MB .bz2 file if there's nobody available to handle 
it?

It's also odd that there's a nutch-site.xml, but no equivalent for 
regex-urlfilter.txt.

There are the cases of some suffixes (like .php) that can return any 
kind of mime-type content, and other suffixes (like .xml) that can 
mean any number of things. So I think you'd still want 
regex-urlfilter.txt files (both a default and a site version) that 
provide explicit additions/deletions to the list generated from the 
installed and enabled parse-plugins.

-- Ken
-- 
Ken Krugler
Krugle, Inc.
+1 530-470-9200

Re: Urlfilter Patch

Posted by Chris Mattmann <ch...@jpl.nasa.gov>.

Jerome,

 I think that this is a great idea and ensures that there isn't replication
of so-called "management information" across the system. It could be easily
implemented as a utility method because we have utility java classes that
represent the ParsePluginList, that you could get the mimeTypes from.
Additionally, we could create a utility method that searches the extension
point list for parsing plugins and returns a boolean true or false whether
they are activated or not. Using this information, I believe that the url
filtering would be a snap.

+1

Cheers,
  Chris

On 12/1/05 12:11 PM, "Jérôme Charron" <je...@gmail.com> wrote:

> Suggestion:
> For consistency purpose, and easy of nutch management, why not filtering the
> extensions based on the activated plugins?
> By looking at the mime-types defined in the parse-plugins.xml file and the
> activated plugins, we know which content-types will be parsed.
> So, by getting the file extensions associated to each content-type, we can
> build a list of file extensions to include (other ones will be excluded) in
> the fecth process.
> No?
> 
> Jérôme
> 
> --
> http://motrech.free.fr/
> http://www.frutch.org/

______________________________________________
Chris A. Mattmann
Chris.Mattmann@jpl.nasa.gov
Staff Member
Modeling and Data Management Systems Section (387)
Data Management Systems and Technologies Group

_________________________________________________
Jet Propulsion Laboratory            Pasadena, CA
Office: 171-266B                        Mailstop:  171-246
_______________________________________________________

Disclaimer:  The opinions presented within are my own and do not reflect
those of either NASA, JPL, or the California Institute of Technology.

Re: Urlfilter Patch

Posted by Jérôme Charron <je...@gmail.com>.

Suggestion:
For consistency purpose, and easy of nutch management, why not filtering the
extensions based on the activated plugins?
By looking at the mime-types defined in the parse-plugins.xml file and the
activated plugins, we know which content-types will be parsed.
So, by getting the file extensions associated to each content-type, we can
build a list of file extensions to include (other ones will be excluded) in
the fecth process.
No?

Jérôme

--
http://motrech.free.fr/
http://www.frutch.org/

Re: Urlfilter Patch

Posted by Ken Krugler <kk...@transpac.com>.

Agreed - looks like this list is too aggressive. A better one would be:

-(?i)\.(ai|asf|au|avi|bz2|bin|bmp|c|class|css|dmg|doc|dot|dvi|eps|exe|gif|gz|h|hqx|ico|iso|jar|java|jnlp|jpeg|jpg|js|lha|md5|mov|mp3|mp4|mpg|msi|ogg|pdf|png|pps|ppt|ps|psd|ram|rdf|rm|rpm|rss|rtf|sit|swf|tar|tbz|tbz2|tgz|tif|wav|wmf|wmv|xls|z|zip)\)?$

This removes xhtml, xml, php, jsp, py, pl, and cgi.

We've seen php/jsp/py/pl/cgi in our error logs as un-parsable, but 
looks like most cases are when the server is miss-configured and 
winds up returning the source code, as opposed to the result of 
executing the code.

-- Ken

>On Thu, 2005-12-01 at 18:53 +0000, Howie Wang wrote:
>>  .And .xhtml seem like they
>>  would be parsable by the default HTML parser.
>
>Ditto for .xml. It is a valid (though seldom used) xhtml extension.
>
>>  Howie
>>
>>  >From: Doug Cutting <cu...@nutch.org>
>>  >
>>  >Ken Krugler wrote:
>>  >>For what it's worth, below is the filter list we're using for doing an
>>  >>html-centric crawl (no word docs, for example). Using the (?i) means we
>>  >>don't need to have upper & lower-case versions of the suffixes.
>>  >>
>  > >>-(?i)\.(ai|asf|au|avi|bz2|bin|bmp|c|cgi|class|css|dmg|doc|dot|dvi|eps|exe|gif|gz|h|hqx|ico|iso|jar|java|jnlp|jpeg|jpg|js|jsp|lha|md5|mov|mp3|mp4|mpg|msi|ogg|pdf|php|pl|png|pps|ppt|ps|psd|py|ram|rdf|rm|rpm|rss|rtf|sit|swf|tar|tbz|tbz2|tgz|tif|wav|wmf|wmv|xhtml|xls|xml|z|zip)\)?$
>  > >
>>  >This looks like a more complete suffix list.
>>  >
>>  >Should we use this as the default?  By default only html and text parsers
>>  >are enabled, so perhaps that's all we should accept.
>>  >
>  > >Why do you exclude .php urls?  These are simply dynamic pages, no?
>  > >Similarly, .jsp and .py are frequently suffixes that return html.  Are
>>  >there other suffixes we should remove from this list before we make it the
>>  >default exclusion list?
>>  >
>>  >Doug
>>
>>
>>
>--
>Rod Taylor <rb...@sitesell.com>


-- 
Ken Krugler
Krugle, Inc.
+1 530-470-9200

Re: Urlfilter Patch

Posted by Rod Taylor <rb...@sitesell.com>.

On Thu, 2005-12-01 at 18:53 +0000, Howie Wang wrote:
> .And .xhtml seem like they
> would be parsable by the default HTML parser.

Ditto for .xml. It is a valid (though seldom used) xhtml extension.

> Howie
> 
> >From: Doug Cutting <cu...@nutch.org>
> >
> >Ken Krugler wrote:
> >>For what it's worth, below is the filter list we're using for doing an 
> >>html-centric crawl (no word docs, for example). Using the (?i) means we 
> >>don't need to have upper & lower-case versions of the suffixes.
> >>
> >>-(?i)\.(ai|asf|au|avi|bz2|bin|bmp|c|cgi|class|css|dmg|doc|dot|dvi|eps|exe|gif|gz|h|hqx|ico|iso|jar|java|jnlp|jpeg|jpg|js|jsp|lha|md5|mov|mp3|mp4|mpg|msi|ogg|pdf|php|pl|png|pps|ppt|ps|psd|py|ram|rdf|rm|rpm|rss|rtf|sit|swf|tar|tbz|tbz2|tgz|tif|wav|wmf|wmv|xhtml|xls|xml|z|zip)\)?$
> >
> >This looks like a more complete suffix list.
> >
> >Should we use this as the default?  By default only html and text parsers 
> >are enabled, so perhaps that's all we should accept.
> >
> >Why do you exclude .php urls?  These are simply dynamic pages, no? 
> >Similarly, .jsp and .py are frequently suffixes that return html.  Are 
> >there other suffixes we should remove from this list before we make it the 
> >default exclusion list?
> >
> >Doug
> 
> 
> 
-- 
Rod Taylor <rb...@sitesell.com>

Re: Urlfilter Patch

Posted by Howie Wang <ho...@hotmail.com>.

.pl  files are often just perl CGI scripts. And .xhtml seem like they
would be parsable by the default HTML parser.

Howie

>From: Doug Cutting <cu...@nutch.org>
>
>Ken Krugler wrote:
>>For what it's worth, below is the filter list we're using for doing an 
>>html-centric crawl (no word docs, for example). Using the (?i) means we 
>>don't need to have upper & lower-case versions of the suffixes.
>>
>>-(?i)\.(ai|asf|au|avi|bz2|bin|bmp|c|cgi|class|css|dmg|doc|dot|dvi|eps|exe|gif|gz|h|hqx|ico|iso|jar|java|jnlp|jpeg|jpg|js|jsp|lha|md5|mov|mp3|mp4|mpg|msi|ogg|pdf|php|pl|png|pps|ppt|ps|psd|py|ram|rdf|rm|rpm|rss|rtf|sit|swf|tar|tbz|tbz2|tgz|tif|wav|wmf|wmv|xhtml|xls|xml|z|zip)\)?$
>
>This looks like a more complete suffix list.
>
>Should we use this as the default?  By default only html and text parsers 
>are enabled, so perhaps that's all we should accept.
>
>Why do you exclude .php urls?  These are simply dynamic pages, no? 
>Similarly, .jsp and .py are frequently suffixes that return html.  Are 
>there other suffixes we should remove from this list before we make it the 
>default exclusion list?
>
>Doug

Re: Urlfilter Patch

Posted by Doug Cutting <cu...@nutch.org>.

Ken Krugler wrote:
> For what it's worth, below is the filter list we're using for doing an 
> html-centric crawl (no word docs, for example). Using the (?i) means we 
> don't need to have upper & lower-case versions of the suffixes.
> 
> -(?i)\.(ai|asf|au|avi|bz2|bin|bmp|c|cgi|class|css|dmg|doc|dot|dvi|eps|exe|gif|gz|h|hqx|ico|iso|jar|java|jnlp|jpeg|jpg|js|jsp|lha|md5|mov|mp3|mp4|mpg|msi|ogg|pdf|php|pl|png|pps|ppt|ps|psd|py|ram|rdf|rm|rpm|rss|rtf|sit|swf|tar|tbz|tbz2|tgz|tif|wav|wmf|wmv|xhtml|xls|xml|z|zip)\)?$ 

This looks like a more complete suffix list.

Should we use this as the default?  By default only html and text 
parsers are enabled, so perhaps that's all we should accept.

Why do you exclude .php urls?  These are simply dynamic pages, no? 
Similarly, .jsp and .py are frequently suffixes that return html.  Are 
there other suffixes we should remove from this list before we make it 
the default exclusion list?

Doug

Re: Urlfilter Patch

Posted by Ken Krugler <kk...@transpac.com>.

>On Mon, 2005-11-28 at 11:44 -0800, Doug Cutting wrote:
>>  Rod Taylor wrote:
>>  > Add a few more extensions which I commonly see and cannot be parsed
>>  > (that I am aware of). ZIP, mso, jar, bz2, XLS, pps, PPS, dot, etc.
>>
>>  [ ... ]
>>
>>  >  # skip image and other suffixes we can't yet parse
>>  > 
>>--\.(gif|GIF|jpg|JPG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|png)$
>  > > 
>+-\.(gif|GIF|jpg|JPG|ico|ICO|css|sit|eps|ps|wmf|zip|ZIP|ppt|mpg|xls|XLS||bz2|gz|mso|jar|rpm|tgz|Z|mov|MOV|exe|dot|pps|PPS)$

For what it's worth, below is the filter list we're using for doing 
an html-centric crawl (no word docs, for example). Using the (?i) 
means we don't need to have upper & lower-case versions of the 
suffixes.

-- Ken

-(?i)\.(ai|asf|au|avi|bz2|bin|bmp|c|cgi|class|css|dmg|doc|dot|dvi|eps|exe|gif|gz|h|hqx|ico|iso|jar|java|jnlp|jpeg|jpg|js|jsp|lha|md5|mov|mp3|mp4|mpg|msi|ogg|pdf|php|pl|png|pps|ppt|ps|psd|py|ram|rdf|rm|rpm|rss|rtf|sit|swf|tar|tbz|tbz2|tgz|tif|wav|wmf|wmv|xhtml|xls|xml|z|zip)\)?$


-- 
Ken Krugler
Krugle, Inc.
+1 530-470-9200

Re: Urlfilter Patch

Posted by Rod Taylor <rb...@sitesell.com>.

On Mon, 2005-11-28 at 11:44 -0800, Doug Cutting wrote:
> Rod Taylor wrote:
> > Add a few more extensions which I commonly see and cannot be parsed
> > (that I am aware of). ZIP, mso, jar, bz2, XLS, pps, PPS, dot, etc.
> 
> [ ... ]
> 
> >  # skip image and other suffixes we can't yet parse
> > --\.(gif|GIF|jpg|JPG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|png)$
> > +-\.(gif|GIF|jpg|JPG|ico|ICO|css|sit|eps|ps|wmf|zip|ZIP|ppt|mpg|xls|XLS||bz2|gz|mso|jar|rpm|tgz|Z|mov|MOV|exe|dot|pps|PPS)$
> 
> Is the '||' intentional, or a typo?  Do you mean to prohibit files 
> ending with just '.'?

It is just a typo.

I rearranged some of the names before submitting and must have done it
then.

-- 
Rod Taylor <rb...@sitesell.com>

Re: Urlfilter Patch

Posted by Doug Cutting <cu...@nutch.org>.

Rod Taylor wrote:
> Add a few more extensions which I commonly see and cannot be parsed
> (that I am aware of). ZIP, mso, jar, bz2, XLS, pps, PPS, dot, etc.

[ ... ]

>  # skip image and other suffixes we can't yet parse
> --\.(gif|GIF|jpg|JPG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|png)$
> +-\.(gif|GIF|jpg|JPG|ico|ICO|css|sit|eps|ps|wmf|zip|ZIP|ppt|mpg|xls|XLS||bz2|gz|mso|jar|rpm|tgz|Z|mov|MOV|exe|dot|pps|PPS)$

Is the '||' intentional, or a typo?  Do you mean to prohibit files 
ending with just '.'?

Doug