You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@nutch.apache.org by Scott Gonyea <sc...@aitrus.org> on 2011/03/04 02:40:09 UTC

Re: Nutch Parser annoyingly faulty

Has anyone looked into this?  This is especially a problem when folks
like Juergen are a customer and, quite rightfully, raise hell.  I
wasn't aware of this, since Nutch is a software metaphor for a
firehose.  But what I have noticed is that the URL Parser is really,
really terrible.  Expletive-worthy.

The problem I am experiencing is the lack of subdomain support.
Dumping thousands of regexes into a flatfile is a terrible hack.  More
than that, pushing meta-data down through a given site becomes
unreliable.  If one site links to another, and that sites links are
crawled, your meta data is now unreliable.

Etc.  I don't want to come across as whiney, but I just did.  I really
think Nutch needs to hunker down tests.  I'm guilty of not caring
about it myself, but it's because testing Java is pretty painful
compared to BDD tools like RSpec:

http://www.codecommit.com/blog/java/the-brilliance-of-bdd

Scott

On Fri, Feb 25, 2011 at 4:08 PM, Juergen Specht
<ju...@shakodo.com> wrote:
> Hi Nutch Team,
>
> before I permanently reject Nutch from all my sites, I better tell
> you why...your URL parser is extremely faulty and creates a lot of
> trouble.
>
> Here is an example, if you have a link on a page, say:
>
> http://www.somesite/somepage/
>
> and the link in HTML looks like:
>
> <a href=".">This Page</a>
>
> the parser should identify that the "." (dot) refers
> to this URL:
>
> http://www.somesite/somepage/
>
> and not to:
>
> http://www.somesite/somepage/.
>
> Every single browser does it correctly, why not Nutch?
>
> Why is this important? Many new sites don't use the traditional
> mapping of directories from the URL model anymore, but instead
> have controllers, actions, parameters etc. encoded in the URL.
>
> They get split by a separator, which often is "/" (slash), so if
> there is a trailing dot, it requests a different resource than
> without the dot. By ignoring the dot in the backend to cope with
> Nutch' faulty parser would create at least 2 URL's sending the
> same content, which then again might affect your Google ranking.
>
> Also, Nutch parses "compressed" Javascript files, which are all
> written in one long line, then somehow take part of the code and
> add it to the URL, creating a huge array of 404's on the server
> side.
>
> Example, you have a URL to a Javascript file like this:
>
>  http://www.somesite/javascript/foo.js
>
> Nutch parses this and then accesses random (?) new URLs which look like:
>
> http://www.somesite/javascript/someFunction();
>
> etc etc.
>
> Please, please, please fix Nutch!
>
> Thanks,
>
> Juergen
> --
> Shakodo - The road to profitable photography: http://www.shakodo.com/
>

Re: Nutch Parser annoyingly faulty

Posted by Juergen Specht <ju...@shakodo.com>.

Hi Julien,

On 3/4/11 7:09 PM, Julien Nioche wrote:
> Thanks for reporting the problem Jurgen. and sorry that you felt you
>  were being ignored. The few active developers Nutch has contribute
> during their spare time, the reason why you did not get any comments
> on this, is that no one had an instant answer or time to investigate
> in more details. You definitely raised an important issue which is
> worth investigating.

thanks for taking the time to reply and checking my settings!

> To answer your first email : the javascript parser is notoriously
> noisy and generates all sorts of monstrosities. It used to be
> activated by default but this won't be the case as of the forthcoming
> 1.3 release.

I see. Monstrosities describes it quite well :)

> I have not been able to reproduce the issue with the dot though. I
> Any particular URL on your site that you had this problem with?

No, its not on particular URLs, but all over the place. However,
I just checked and it seems to happen with Nutch 0.9 and 1.0,
here is an example:

216.24.131.152 - - [26/Feb/2011:00:53:44 +0900] "GET 
/assignments/tags/advertisement/. HTTP/1.0" 404 820 "-" "Lijit 
Crawler/Nutch-0.9 (Reports crawler; http://www.lijit.com/robot/crawler; 
info(a)lijit(d)com)"
216.24.131.152 - - [26/Feb/2011:00:55:03 +0900] "GET 
/assignments/tags/assignments_design/. HTTP/1.0" 404 820 "-" "Lijit 
Crawler/Nutch-0.9 (Reports crawler; http://www.lijit.com/robot/crawler; 
info(a)lijit(d)com)"
216.24.131.152 - - [26/Feb/2011:00:55:56 +0900] "GET 
/assignments/tags/assignments_commercial-photography/. HTTP/1.0" 404 820 
"-" "Lijit Crawler/Nutch-0.9 (Reports crawler; 
http://www.lijit.com/robot/crawler; info(a)lijit(d)com)"
216.24.131.152 - - [26/Feb/2011:00:56:19 +0900] "GET 
/assignments/tags/apartment_rental/. HTTP/1.0" 404 820 "-" "Lijit 
Crawler/Nutch-0.9 (Reports crawler; http://www.lijit.com/robot/crawler; 
info(a)lijit(d)com)"
216.24.131.152 - - [26/Feb/2011:00:57:09 +0900] "GET 
/assignments/tags/assignments_church/. HTTP/1.0" 404 820 "-" "Lijit 
Crawler/Nutch-0.9 (Reports crawler; http://www.lijit.com/robot/crawler; 
info(a)lijit(d)com)"
216.24.131.152 - - [26/Feb/2011:00:57:26 +0900] "GET 
/assignments/tags/assignments_corporate/. HTTP/1.0" 404 820 "-" "Lijit 
Crawler/Nutch-0.9 (Reports crawler; http://www.lijit.com/robot/crawler; 
info(a)lijit(d)com)"
216.24.131.152 - - [26/Feb/2011:00:57:44 +0900] "GET 
/assignments/tags/assignments_cd-cover/. HTTP/1.0" 404 820 "-" "Lijit 
Crawler/Nutch-0.9 (Reports crawler; http://www.lijit.com/robot/crawler; 
info(a)lijit(d)com)"
216.24.131.152 - - [26/Feb/2011:00:58:16 +0900] "GET 
/assignments/tags/amateur_assignments/. HTTP/1.0" 404 820 "-" "Lijit 
Crawler/Nutch-0.9 (Reports crawler; http://www.lijit.com/robot/crawler; 
info(a)lijit(d)com)"
216.24.131.152 - - [26/Feb/2011:00:58:18 +0900] "GET 
/assignments/tags/assignments_event/. HTTP/1.0" 404 820 "-" "Lijit 
Crawler/Nutch-0.9 (Reports crawler; http://www.lijit.com/robot/crawler; 
info(a)lijit(d)com)"
216.24.131.152 - - [26/Feb/2011:00:59:16 +0900] "GET 
/assignments/tags/agent/. HTTP/1.0" 404 820 "-" "Lijit Crawler/Nutch-0.9 
(Reports crawler; http://www.lijit.com/robot/crawler; info(a)lijit(d)com)"

> By default, Nutch does respect robots.txt and the community as a
> whole encourages server-politeness and reasonable use however we
> can't prevent people from using ridiculous settings (e.g. high number
> of threads per host, low time gap between calls) or modifying the
> code to bypass the robots checking (see my comment below)

Understand.

> I have checked your robots.txt and it looks correct. I tried parsing
>  http://www.shakodo.com with the user-agents you specified, Nutch
> fully respected robots.txt and the content has not been fetched

Thanks a lot for the confirmation!

> That's indeed a possibility

And now also confirmed. I might add another disallow: /badrobot/ trap
in my robots.txt to see if I get more violations.

> Doesn't this violate your license?
> not as far as I know. The Apache license allows people to modify the
>  code, most people do that for positive reasons and unfortunately we
>  can't prevent people from bypassing the robots check.

Too bad, but you can use a hammer to put a nail into the wall (useful)
or to put a nail into somebodies head (not so useful - with exceptions).

> Another option is to see if the companies you want to block use
> constantly the same IP range and configure your servers so that they
>  prevent access to these IPs.  You could file a complain with the
> company hosting the crawl, I know that Amazon are pretty reactive
> with EC2 and would take measures to make sure their users do the
> right things

They are already blocked with most existing IPs I could find, plus
I reported them to their ISPs, but they seem to have better arguments
(i.e. they pay their ISPs) than I have.

Anyway, thanks a lot for checking and coming back to me with info,
very much appreciated! I will not add Nutch 1.3 to my "disallow" rule
set! :)

Thanks,

Juergen
-- 
Shakodo - The road to profitable photography: http://www.shakodo.com/

Re: Nutch Parser annoyingly faulty

Posted by Juergen Specht <ju...@shakodo.com>.

Hi Julien,

On 3/4/11 7:09 PM, Julien Nioche wrote:
> Thanks for reporting the problem Jurgen. and sorry that you felt you
>  were being ignored. The few active developers Nutch has contribute
> during their spare time, the reason why you did not get any comments
> on this, is that no one had an instant answer or time to investigate
> in more details. You definitely raised an important issue which is
> worth investigating.

thanks for taking the time to reply and checking my settings!

> To answer your first email : the javascript parser is notoriously
> noisy and generates all sorts of monstrosities. It used to be
> activated by default but this won't be the case as of the forthcoming
> 1.3 release.

I see. Monstrosities describes it quite well :)

> I have not been able to reproduce the issue with the dot though. I
> Any particular URL on your site that you had this problem with?

No, its not on particular URLs, but all over the place. However,
I just checked and it seems to happen with Nutch 0.9 and 1.0,
here is an example:

216.24.131.152 - - [26/Feb/2011:00:53:44 +0900] "GET 
/assignments/tags/advertisement/. HTTP/1.0" 404 820 "-" "Lijit 
Crawler/Nutch-0.9 (Reports crawler; http://www.lijit.com/robot/crawler; 
info(a)lijit(d)com)"
216.24.131.152 - - [26/Feb/2011:00:55:03 +0900] "GET 
/assignments/tags/assignments_design/. HTTP/1.0" 404 820 "-" "Lijit 
Crawler/Nutch-0.9 (Reports crawler; http://www.lijit.com/robot/crawler; 
info(a)lijit(d)com)"
216.24.131.152 - - [26/Feb/2011:00:55:56 +0900] "GET 
/assignments/tags/assignments_commercial-photography/. HTTP/1.0" 404 820 
"-" "Lijit Crawler/Nutch-0.9 (Reports crawler; 
http://www.lijit.com/robot/crawler; info(a)lijit(d)com)"
216.24.131.152 - - [26/Feb/2011:00:56:19 +0900] "GET 
/assignments/tags/apartment_rental/. HTTP/1.0" 404 820 "-" "Lijit 
Crawler/Nutch-0.9 (Reports crawler; http://www.lijit.com/robot/crawler; 
info(a)lijit(d)com)"
216.24.131.152 - - [26/Feb/2011:00:57:09 +0900] "GET 
/assignments/tags/assignments_church/. HTTP/1.0" 404 820 "-" "Lijit 
Crawler/Nutch-0.9 (Reports crawler; http://www.lijit.com/robot/crawler; 
info(a)lijit(d)com)"
216.24.131.152 - - [26/Feb/2011:00:57:26 +0900] "GET 
/assignments/tags/assignments_corporate/. HTTP/1.0" 404 820 "-" "Lijit 
Crawler/Nutch-0.9 (Reports crawler; http://www.lijit.com/robot/crawler; 
info(a)lijit(d)com)"
216.24.131.152 - - [26/Feb/2011:00:57:44 +0900] "GET 
/assignments/tags/assignments_cd-cover/. HTTP/1.0" 404 820 "-" "Lijit 
Crawler/Nutch-0.9 (Reports crawler; http://www.lijit.com/robot/crawler; 
info(a)lijit(d)com)"
216.24.131.152 - - [26/Feb/2011:00:58:16 +0900] "GET 
/assignments/tags/amateur_assignments/. HTTP/1.0" 404 820 "-" "Lijit 
Crawler/Nutch-0.9 (Reports crawler; http://www.lijit.com/robot/crawler; 
info(a)lijit(d)com)"
216.24.131.152 - - [26/Feb/2011:00:58:18 +0900] "GET 
/assignments/tags/assignments_event/. HTTP/1.0" 404 820 "-" "Lijit 
Crawler/Nutch-0.9 (Reports crawler; http://www.lijit.com/robot/crawler; 
info(a)lijit(d)com)"
216.24.131.152 - - [26/Feb/2011:00:59:16 +0900] "GET 
/assignments/tags/agent/. HTTP/1.0" 404 820 "-" "Lijit Crawler/Nutch-0.9 
(Reports crawler; http://www.lijit.com/robot/crawler; info(a)lijit(d)com)"

> By default, Nutch does respect robots.txt and the community as a
> whole encourages server-politeness and reasonable use however we
> can't prevent people from using ridiculous settings (e.g. high number
> of threads per host, low time gap between calls) or modifying the
> code to bypass the robots checking (see my comment below)

Understand.

> I have checked your robots.txt and it looks correct. I tried parsing
>  http://www.shakodo.com with the user-agents you specified, Nutch
> fully respected robots.txt and the content has not been fetched

Thanks a lot for the confirmation!

> That's indeed a possibility

And now also confirmed. I might add another disallow: /badrobot/ trap
in my robots.txt to see if I get more violations.

> Doesn't this violate your license?
> not as far as I know. The Apache license allows people to modify the
>  code, most people do that for positive reasons and unfortunately we
>  can't prevent people from bypassing the robots check.

Too bad, but you can use a hammer to put a nail into the wall (useful)
or to put a nail into somebodies head (not so useful - with exceptions).

> Another option is to see if the companies you want to block use
> constantly the same IP range and configure your servers so that they
>  prevent access to these IPs.  You could file a complain with the
> company hosting the crawl, I know that Amazon are pretty reactive
> with EC2 and would take measures to make sure their users do the
> right things

They are already blocked with most existing IPs I could find, plus
I reported them to their ISPs, but they seem to have better arguments
(i.e. they pay their ISPs) than I have.

Anyway, thanks a lot for checking and coming back to me with info,
very much appreciated! I will not add Nutch 1.3 to my "disallow" rule
set! :)

Thanks,

Juergen
-- 
Shakodo - The road to profitable photography: http://www.shakodo.com/

Re: Nutch Parser annoyingly faulty

Posted by Julien Nioche <li...@gmail.com>.

Hi Jurgen,


> Since I wrote this email - which I thought got ignored by the
> Nutch developers -


Thanks for reporting the problem Jurgen. and sorry that you felt you were
being ignored. The few active developers Nutch has contribute during their
spare time, the reason why you did not get any comments on this, is that no
one had an instant answer or time to investigate in more details. You
definitely raised an important issue which is worth investigating.

To answer your first email : the javascript parser is notoriously noisy and
generates all sorts of monstrosities. It used to be activated by default but
this won't be the case as of the forthcoming 1.3 release.

I have not been able to reproduce the issue with the dot though. I put this

<html>
<a href=".">This Page</a>
</html>

on our server : http://www.digitalpebble.com/dummy.html

ran : ./nutch org.apache.nutch.parse.ParserChecker
http://www.digitalpebble.com/dummy.html

and got

Outlinks: 1
  outlink: toUrl: http://www.digitalpebble.com/ anchor: This Page

as expected.

Any particular URL on your site that you had this problem with?



> I am getting bombed on my server by 2 especially
> annoying and non-reacting companies which use Nutch. The companies
> (and Nutch) are both blocked by my robots.txt file, see:
>
> http://www.shakodo.com/robots.txt
>

> but while they both access this file a couple of times
> per day, they ignore it completely.
> The company http://www.lijit.com/ called me an "idiot" to
> complain about their faulty configuration and the other
> company http://www.comodo.com/ ignored every complaint.
>

By default, Nutch does respect robots.txt and the community as a whole
encourages server-politeness and reasonable use however we can't prevent
people from using ridiculous settings (e.g. high number of threads per host,
low time gap between calls) or modifying the code to bypass the robots
checking (see my comment below)


>
> Can you please check if my robots.txt file has the correct
> syntax and if I reject Nutch in general correctly or can you
> please help me to fix the syntax that Nutch powered crawler
> don't access our server(s) anymore?


I have checked your robots.txt and it looks correct. I tried parsing
http://www.shakodo.com with the user-agents you specified, Nutch fully
respected robots.txt and the content has not been fetched



> If the syntax in fact is
> correct, then I must assume that at least these 2 companies
> altered the source to actively abuse the robots.txt rules.
>

That's indeed a possibility


>
> Doesn't this violate your license?
>

not as far as I know. The Apache license allows people to modify the code,
most people do that for positive reasons and unfortunately we can't prevent
people from bypassing the robots check.


>
> Help is appreciated!


Another option is to see if the companies you want to block use constantly
the same IP range and configure your servers so that they prevent access to
these IPs.  You could file a complain with the company hosting the crawl, I
know that Amazon are pretty reactive with EC2 and would take measures to
make sure their users do the right things

Thanks

Julien

-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com

Re: Nutch Parser annoyingly faulty

Posted by Julien Nioche <li...@gmail.com>.

Hi Jurgen,


> Since I wrote this email - which I thought got ignored by the
> Nutch developers -


Thanks for reporting the problem Jurgen. and sorry that you felt you were
being ignored. The few active developers Nutch has contribute during their
spare time, the reason why you did not get any comments on this, is that no
one had an instant answer or time to investigate in more details. You
definitely raised an important issue which is worth investigating.

To answer your first email : the javascript parser is notoriously noisy and
generates all sorts of monstrosities. It used to be activated by default but
this won't be the case as of the forthcoming 1.3 release.

I have not been able to reproduce the issue with the dot though. I put this

<html>
<a href=".">This Page</a>
</html>

on our server : http://www.digitalpebble.com/dummy.html

ran : ./nutch org.apache.nutch.parse.ParserChecker
http://www.digitalpebble.com/dummy.html

and got

Outlinks: 1
  outlink: toUrl: http://www.digitalpebble.com/ anchor: This Page

as expected.

Any particular URL on your site that you had this problem with?



> I am getting bombed on my server by 2 especially
> annoying and non-reacting companies which use Nutch. The companies
> (and Nutch) are both blocked by my robots.txt file, see:
>
> http://www.shakodo.com/robots.txt
>

> but while they both access this file a couple of times
> per day, they ignore it completely.
> The company http://www.lijit.com/ called me an "idiot" to
> complain about their faulty configuration and the other
> company http://www.comodo.com/ ignored every complaint.
>

By default, Nutch does respect robots.txt and the community as a whole
encourages server-politeness and reasonable use however we can't prevent
people from using ridiculous settings (e.g. high number of threads per host,
low time gap between calls) or modifying the code to bypass the robots
checking (see my comment below)


>
> Can you please check if my robots.txt file has the correct
> syntax and if I reject Nutch in general correctly or can you
> please help me to fix the syntax that Nutch powered crawler
> don't access our server(s) anymore?


I have checked your robots.txt and it looks correct. I tried parsing
http://www.shakodo.com with the user-agents you specified, Nutch fully
respected robots.txt and the content has not been fetched



> If the syntax in fact is
> correct, then I must assume that at least these 2 companies
> altered the source to actively abuse the robots.txt rules.
>

That's indeed a possibility


>
> Doesn't this violate your license?
>

not as far as I know. The Apache license allows people to modify the code,
most people do that for positive reasons and unfortunately we can't prevent
people from bypassing the robots check.


>
> Help is appreciated!


Another option is to see if the companies you want to block use constantly
the same IP range and configure your servers so that they prevent access to
these IPs.  You could file a complain with the company hosting the crawl, I
know that Amazon are pretty reactive with EC2 and would take measures to
make sure their users do the right things

Thanks

Julien

-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com

Re: Nutch Parser annoyingly faulty

Posted by Juergen Specht <ju...@shakodo.com>.

Thanks Scott!

Since I wrote this email - which I thought got ignored by the
Nutch developers - I am getting bombed on my server by 2 especially
annoying and non-reacting companies which use Nutch. The companies
(and Nutch) are both blocked by my robots.txt file, see:

http://www.shakodo.com/robots.txt

but while they both access this file a couple of times
per day, they ignore it completely.
The company http://www.lijit.com/ called me an "idiot" to
complain about their faulty configuration and the other
company http://www.comodo.com/ ignored every complaint.

Can you please check if my robots.txt file has the correct
syntax and if I reject Nutch in general correctly or can you
please help me to fix the syntax that Nutch powered crawler
don't access our server(s) anymore? If the syntax in fact is
correct, then I must assume that at least these 2 companies
altered the source to actively abuse the robots.txt rules.

Doesn't this violate your license?

Help is appreciated!

Juergen
-- 
Shakodo - The road to profitable photography: http://www.shakodo.com/


On 3/4/11 10:40 AM, Scott Gonyea wrote:
> Has anyone looked into this?  This is especially a problem when folks
> like Juergen are a customer and, quite rightfully, raise hell.  I
> wasn't aware of this, since Nutch is a software metaphor for a
> firehose.  But what I have noticed is that the URL Parser is really,
> really terrible.  Expletive-worthy.
>
> The problem I am experiencing is the lack of subdomain support.
> Dumping thousands of regexes into a flatfile is a terrible hack.  More
> than that, pushing meta-data down through a given site becomes
> unreliable.  If one site links to another, and that sites links are
> crawled, your meta data is now unreliable.
>
> Etc.  I don't want to come across as whiney, but I just did.  I really
> think Nutch needs to hunker down tests.  I'm guilty of not caring
> about it myself, but it's because testing Java is pretty painful
> compared to BDD tools like RSpec:
>
> http://www.codecommit.com/blog/java/the-brilliance-of-bdd
>
> Scott
>
> On Fri, Feb 25, 2011 at 4:08 PM, Juergen Specht
> <ju...@shakodo.com>  wrote:
>> Hi Nutch Team,
>>
>> before I permanently reject Nutch from all my sites, I better tell
>> you why...your URL parser is extremely faulty and creates a lot of
>> trouble.
>>
>> Here is an example, if you have a link on a page, say:
>>
>> http://www.somesite/somepage/
>>
>> and the link in HTML looks like:
>>
>> <a href=".">This Page</a>
>>
>> the parser should identify that the "." (dot) refers
>> to this URL:
>>
>> http://www.somesite/somepage/
>>
>> and not to:
>>
>> http://www.somesite/somepage/.
>>
>> Every single browser does it correctly, why not Nutch?
>>
>> Why is this important? Many new sites don't use the traditional
>> mapping of directories from the URL model anymore, but instead
>> have controllers, actions, parameters etc. encoded in the URL.
>>
>> They get split by a separator, which often is "/" (slash), so if
>> there is a trailing dot, it requests a different resource than
>> without the dot. By ignoring the dot in the backend to cope with
>> Nutch' faulty parser would create at least 2 URL's sending the
>> same content, which then again might affect your Google ranking.
>>
>> Also, Nutch parses "compressed" Javascript files, which are all
>> written in one long line, then somehow take part of the code and
>> add it to the URL, creating a huge array of 404's on the server
>> side.
>>
>> Example, you have a URL to a Javascript file like this:
>>
>>   http://www.somesite/javascript/foo.js
>>
>> Nutch parses this and then accesses random (?) new URLs which look like:
>>
>> http://www.somesite/javascript/someFunction();
>>
>> etc etc.
>>
>> Please, please, please fix Nutch!
>>
>> Thanks,
>>
>> Juergen
>> --
>> Shakodo - The road to profitable photography: http://www.shakodo.com/
>>

Re: Nutch Parser annoyingly faulty

Posted by Juergen Specht <ju...@shakodo.com>.

Thanks Scott!

Since I wrote this email - which I thought got ignored by the
Nutch developers - I am getting bombed on my server by 2 especially
annoying and non-reacting companies which use Nutch. The companies
(and Nutch) are both blocked by my robots.txt file, see:

http://www.shakodo.com/robots.txt

but while they both access this file a couple of times
per day, they ignore it completely.
The company http://www.lijit.com/ called me an "idiot" to
complain about their faulty configuration and the other
company http://www.comodo.com/ ignored every complaint.

Can you please check if my robots.txt file has the correct
syntax and if I reject Nutch in general correctly or can you
please help me to fix the syntax that Nutch powered crawler
don't access our server(s) anymore? If the syntax in fact is
correct, then I must assume that at least these 2 companies
altered the source to actively abuse the robots.txt rules.

Doesn't this violate your license?

Help is appreciated!

Juergen
-- 
Shakodo - The road to profitable photography: http://www.shakodo.com/


On 3/4/11 10:40 AM, Scott Gonyea wrote:
> Has anyone looked into this?  This is especially a problem when folks
> like Juergen are a customer and, quite rightfully, raise hell.  I
> wasn't aware of this, since Nutch is a software metaphor for a
> firehose.  But what I have noticed is that the URL Parser is really,
> really terrible.  Expletive-worthy.
>
> The problem I am experiencing is the lack of subdomain support.
> Dumping thousands of regexes into a flatfile is a terrible hack.  More
> than that, pushing meta-data down through a given site becomes
> unreliable.  If one site links to another, and that sites links are
> crawled, your meta data is now unreliable.
>
> Etc.  I don't want to come across as whiney, but I just did.  I really
> think Nutch needs to hunker down tests.  I'm guilty of not caring
> about it myself, but it's because testing Java is pretty painful
> compared to BDD tools like RSpec:
>
> http://www.codecommit.com/blog/java/the-brilliance-of-bdd
>
> Scott
>
> On Fri, Feb 25, 2011 at 4:08 PM, Juergen Specht
> <ju...@shakodo.com>  wrote:
>> Hi Nutch Team,
>>
>> before I permanently reject Nutch from all my sites, I better tell
>> you why...your URL parser is extremely faulty and creates a lot of
>> trouble.
>>
>> Here is an example, if you have a link on a page, say:
>>
>> http://www.somesite/somepage/
>>
>> and the link in HTML looks like:
>>
>> <a href=".">This Page</a>
>>
>> the parser should identify that the "." (dot) refers
>> to this URL:
>>
>> http://www.somesite/somepage/
>>
>> and not to:
>>
>> http://www.somesite/somepage/.
>>
>> Every single browser does it correctly, why not Nutch?
>>
>> Why is this important? Many new sites don't use the traditional
>> mapping of directories from the URL model anymore, but instead
>> have controllers, actions, parameters etc. encoded in the URL.
>>
>> They get split by a separator, which often is "/" (slash), so if
>> there is a trailing dot, it requests a different resource than
>> without the dot. By ignoring the dot in the backend to cope with
>> Nutch' faulty parser would create at least 2 URL's sending the
>> same content, which then again might affect your Google ranking.
>>
>> Also, Nutch parses "compressed" Javascript files, which are all
>> written in one long line, then somehow take part of the code and
>> add it to the URL, creating a huge array of 404's on the server
>> side.
>>
>> Example, you have a URL to a Javascript file like this:
>>
>>   http://www.somesite/javascript/foo.js
>>
>> Nutch parses this and then accesses random (?) new URLs which look like:
>>
>> http://www.somesite/javascript/someFunction();
>>
>> etc etc.
>>
>> Please, please, please fix Nutch!
>>
>> Thanks,
>>
>> Juergen
>> --
>> Shakodo - The road to profitable photography: http://www.shakodo.com/
>>