You are viewing a plain text version of this content. The canonical link for it is here.
Posted to droids-dev@incubator.apache.org by "Fuad Efendi (JIRA)" <ji...@apache.org> on 2010/11/30 19:40:23 UTC

[jira] Created: (DROIDS-109) Several defects in robots exclusion protocol (robots.txt) implementation

Several defects in robots exclusion protocol (robots.txt) implementation
------------------------------------------------------------------------

                 Key: DROIDS-109
                 URL: https://issues.apache.org/jira/browse/DROIDS-109
             Project: Droids
          Issue Type: Bug
          Components: core, norobots
            Reporter: Fuad Efendi


Referenced (even by Google!) http://www.robotstxt.org/wc/norobots-rfc.html is at least 12 years outdated.

1. Googlebot and many others support query part rules; Droids currently supports only URI.getPath() (without query part)
2. %2F represents "/" (slash) character inside a path; it shouldn't be decoded before applying rule
3. Double decoding is used by NoRobotClient.isUrlAllowed(URI uri) (method body; baseURI.getPath(); returns decoded string; then we call another URLDecoder.decode(path, US_ASCII);
4. URLDecoder.decode(path, US_ASCII); - UTF-8 must be used!
5. The longest matching directive path (not including wildcard expansion) should be the one applied to any page URL
6. Wildcard characters should be recognized
7. Sitemaps
8. Crawl rate
9. BOM sequence is not removed before processing robots.txt (http://unicode.org/faq/utf_bom.html, bytes: 0xEF 0xBB 0xBF

and most probably many more defects (Nutch & BIXO haven't done it in-full yet). I am working on it right now... 


Some references:
http://nikitathespider.com/python/rerp/
http://en.wikipedia.org/wiki/Uniform_Resource_Identifier
http://www.searchtools.com/robots/robots-txt.html




-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (DROIDS-109) Several defects in robots exclusion protocol (robots.txt) implementation

Posted by "Paul Rogalinski (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/DROIDS-109?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12968912#action_12968912 ] 

Paul Rogalinski commented on DROIDS-109:
----------------------------------------

@Fuad:

can you design some tests for those issues? I understand that designing (j)unit tests for this kind of problems is very time consuming so a bunch of folders, each representing one test-scenario and a description of the expected outcome, would be just fine to start with.

Currently I am working on a different part of droids, but I *will* have to deal with robots.txt pretty soon and I would be more than happy to commit an drop-in replacement for the current implementation addressing those issues.

> Several defects in robots exclusion protocol (robots.txt) implementation
> ------------------------------------------------------------------------
>
>                 Key: DROIDS-109
>                 URL: https://issues.apache.org/jira/browse/DROIDS-109
>             Project: Droids
>          Issue Type: Bug
>          Components: core, norobots
>    Affects Versions: Graduating from the Incubator
>            Reporter: Fuad Efendi
>             Fix For: Graduating from the Incubator
>
>   Original Estimate: 672h
>  Remaining Estimate: 672h
>
> 1. Googlebot and many others support query part rules; Droids currently supports only URI.getPath() (without query part)
> 2. %2F represents "/" (slash) character inside a path; it shouldn't be decoded before applying rule
> 3. Double decoding is used by NoRobotClient.isUrlAllowed(URI uri) (method body; baseURI.getPath(); returns decoded string; then we call another URLDecoder.decode(path, US_ASCII);
> 4. URLDecoder.decode(path, US_ASCII); - UTF-8 must be used!
> 5. The longest matching directive path (not including wildcard expansion) should be the one applied to any page URL
> 6. Wildcard characters should be recognized
> 7. Sitemaps
> 8. Crawl rate
> 9. BOM sequence is not removed before processing robots.txt (http://unicode.org/faq/utf_bom.html, bytes: 0xEF 0xBB 0xBF
> and most probably many more defects (Nutch & BIXO haven't done it in-full yet). I am working on it right now... 
> Some references:
> http://nikitathespider.com/python/rerp/
> http://en.wikipedia.org/wiki/Uniform_Resource_Identifier
> http://www.searchtools.com/robots/robots-txt.html
> http://en.wikipedia.org/wiki/Robots.txt
> Referenced (even by Google!) http://www.robotstxt.org/wc/norobots-rfc.html seems at least outdated...
> Proper reference: http://www.robotstxt.org/norobots-rfc.txt (1996).
> We need WIKI page explaining all rules implemented by Droids; hopefully it will become unofficial standard.
> *Update from Google:*
> http://code.google.com/web/controlcrawlindex/

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (DROIDS-109) Several defects in robots exclusion protocol (robots.txt) implementation

Posted by "Fuad Efendi (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/DROIDS-109?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Fuad Efendi updated DROIDS-109:
-------------------------------

    Description: 
Referenced (even by Google!) http://www.robotstxt.org/wc/norobots-rfc.html is at least 12 years outdated.

1. Googlebot and many others support query part rules; Droids currently supports only URI.getPath() (without query part)
2. %2F represents "/" (slash) character inside a path; it shouldn't be decoded before applying rule
3. Double decoding is used by NoRobotClient.isUrlAllowed(URI uri) (method body; baseURI.getPath(); returns decoded string; then we call another URLDecoder.decode(path, US_ASCII);
4. URLDecoder.decode(path, US_ASCII); - UTF-8 must be used!
5. The longest matching directive path (not including wildcard expansion) should be the one applied to any page URL
6. Wildcard characters should be recognized
7. Sitemaps
8. Crawl rate
9. BOM sequence is not removed before processing robots.txt (http://unicode.org/faq/utf_bom.html, bytes: 0xEF 0xBB 0xBF

and most probably many more defects (Nutch & BIXO haven't done it in-full yet). I am working on it right now... 


Some references:
http://nikitathespider.com/python/rerp/
http://en.wikipedia.org/wiki/Uniform_Resource_Identifier
http://www.searchtools.com/robots/robots-txt.html
http://en.wikipedia.org/wiki/Robots.txt



  was:
Referenced (even by Google!) http://www.robotstxt.org/wc/norobots-rfc.html is at least 12 years outdated.

1. Googlebot and many others support query part rules; Droids currently supports only URI.getPath() (without query part)
2. %2F represents "/" (slash) character inside a path; it shouldn't be decoded before applying rule
3. Double decoding is used by NoRobotClient.isUrlAllowed(URI uri) (method body; baseURI.getPath(); returns decoded string; then we call another URLDecoder.decode(path, US_ASCII);
4. URLDecoder.decode(path, US_ASCII); - UTF-8 must be used!
5. The longest matching directive path (not including wildcard expansion) should be the one applied to any page URL
6. Wildcard characters should be recognized
7. Sitemaps
8. Crawl rate
9. BOM sequence is not removed before processing robots.txt (http://unicode.org/faq/utf_bom.html, bytes: 0xEF 0xBB 0xBF

and most probably many more defects (Nutch & BIXO haven't done it in-full yet). I am working on it right now... 


Some references:
http://nikitathespider.com/python/rerp/
http://en.wikipedia.org/wiki/Uniform_Resource_Identifier
http://www.searchtools.com/robots/robots-txt.html





We need also to deal with HTTP response headers. For instance, to decode into proper charset robots.txt; to deal with expiration header; etc.
I should modify ContentLoader interface, then implementations, and subsequently break whole framework :) let's think...

> Several defects in robots exclusion protocol (robots.txt) implementation
> ------------------------------------------------------------------------
>
>                 Key: DROIDS-109
>                 URL: https://issues.apache.org/jira/browse/DROIDS-109
>             Project: Droids
>          Issue Type: Bug
>          Components: core, norobots
>            Reporter: Fuad Efendi
>   Original Estimate: 672h
>  Remaining Estimate: 672h
>
> Referenced (even by Google!) http://www.robotstxt.org/wc/norobots-rfc.html is at least 12 years outdated.
> 1. Googlebot and many others support query part rules; Droids currently supports only URI.getPath() (without query part)
> 2. %2F represents "/" (slash) character inside a path; it shouldn't be decoded before applying rule
> 3. Double decoding is used by NoRobotClient.isUrlAllowed(URI uri) (method body; baseURI.getPath(); returns decoded string; then we call another URLDecoder.decode(path, US_ASCII);
> 4. URLDecoder.decode(path, US_ASCII); - UTF-8 must be used!
> 5. The longest matching directive path (not including wildcard expansion) should be the one applied to any page URL
> 6. Wildcard characters should be recognized
> 7. Sitemaps
> 8. Crawl rate
> 9. BOM sequence is not removed before processing robots.txt (http://unicode.org/faq/utf_bom.html, bytes: 0xEF 0xBB 0xBF
> and most probably many more defects (Nutch & BIXO haven't done it in-full yet). I am working on it right now... 
> Some references:
> http://nikitathespider.com/python/rerp/
> http://en.wikipedia.org/wiki/Uniform_Resource_Identifier
> http://www.searchtools.com/robots/robots-txt.html
> http://en.wikipedia.org/wiki/Robots.txt

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (DROIDS-109) Several defects in robots exclusion protocol (robots.txt) implementation

Posted by "Fuad Efendi (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/DROIDS-109?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Fuad Efendi updated DROIDS-109:
-------------------------------

    Description: 
1. Googlebot and many others support query part rules; Droids currently supports only URI.getPath() (without query part)
2. %2F represents "/" (slash) character inside a path; it shouldn't be decoded before applying rule
3. Double decoding is used by NoRobotClient.isUrlAllowed(URI uri) (method body; baseURI.getPath(); returns decoded string; then we call another URLDecoder.decode(path, US_ASCII);
4. URLDecoder.decode(path, US_ASCII); - UTF-8 must be used!
5. The longest matching directive path (not including wildcard expansion) should be the one applied to any page URL
6. Wildcard characters should be recognized
7. Sitemaps
8. Crawl rate
9. BOM sequence is not removed before processing robots.txt (http://unicode.org/faq/utf_bom.html, bytes: 0xEF 0xBB 0xBF

and most probably many more defects (Nutch & BIXO haven't done it in-full yet). I am working on it right now... 


Some references:
http://nikitathespider.com/python/rerp/
http://en.wikipedia.org/wiki/Uniform_Resource_Identifier
http://www.searchtools.com/robots/robots-txt.html
http://en.wikipedia.org/wiki/Robots.txt

Referenced (even by Google!) http://www.robotstxt.org/wc/norobots-rfc.html seems at least outdated...
Proper reference: http://www.robotstxt.org/norobots-rfc.txt (1996).
We need WIKI page explaining all rules implemented by Droids; hopefully it will become unofficial standard.

Recent update from Google:
*http://code.google.com/web/controlcrawlindex/*

  was:
1. Googlebot and many others support query part rules; Droids currently supports only URI.getPath() (without query part)
2. %2F represents "/" (slash) character inside a path; it shouldn't be decoded before applying rule
3. Double decoding is used by NoRobotClient.isUrlAllowed(URI uri) (method body; baseURI.getPath(); returns decoded string; then we call another URLDecoder.decode(path, US_ASCII);
4. URLDecoder.decode(path, US_ASCII); - UTF-8 must be used!
5. The longest matching directive path (not including wildcard expansion) should be the one applied to any page URL
6. Wildcard characters should be recognized
7. Sitemaps
8. Crawl rate
9. BOM sequence is not removed before processing robots.txt (http://unicode.org/faq/utf_bom.html, bytes: 0xEF 0xBB 0xBF

and most probably many more defects (Nutch & BIXO haven't done it in-full yet). I am working on it right now... 


Some references:
http://nikitathespider.com/python/rerp/
http://en.wikipedia.org/wiki/Uniform_Resource_Identifier
http://www.searchtools.com/robots/robots-txt.html
http://en.wikipedia.org/wiki/Robots.txt

Referenced (even by Google!) http://www.robotstxt.org/wc/norobots-rfc.html seems at least outdated...
Proper reference: http://www.robotstxt.org/norobots-rfc.txt (1996).
We need WIKI page explaining all rules implemented by Droids; hopefully it will become unofficial standard.

Recent update from Google:
{b}http://code.google.com/web/controlcrawlindex/{b}


> Several defects in robots exclusion protocol (robots.txt) implementation
> ------------------------------------------------------------------------
>
>                 Key: DROIDS-109
>                 URL: https://issues.apache.org/jira/browse/DROIDS-109
>             Project: Droids
>          Issue Type: Bug
>          Components: core, norobots
>    Affects Versions: Graduating from the Incubator
>            Reporter: Fuad Efendi
>             Fix For: Graduating from the Incubator
>
>   Original Estimate: 672h
>  Remaining Estimate: 672h
>
> 1. Googlebot and many others support query part rules; Droids currently supports only URI.getPath() (without query part)
> 2. %2F represents "/" (slash) character inside a path; it shouldn't be decoded before applying rule
> 3. Double decoding is used by NoRobotClient.isUrlAllowed(URI uri) (method body; baseURI.getPath(); returns decoded string; then we call another URLDecoder.decode(path, US_ASCII);
> 4. URLDecoder.decode(path, US_ASCII); - UTF-8 must be used!
> 5. The longest matching directive path (not including wildcard expansion) should be the one applied to any page URL
> 6. Wildcard characters should be recognized
> 7. Sitemaps
> 8. Crawl rate
> 9. BOM sequence is not removed before processing robots.txt (http://unicode.org/faq/utf_bom.html, bytes: 0xEF 0xBB 0xBF
> and most probably many more defects (Nutch & BIXO haven't done it in-full yet). I am working on it right now... 
> Some references:
> http://nikitathespider.com/python/rerp/
> http://en.wikipedia.org/wiki/Uniform_Resource_Identifier
> http://www.searchtools.com/robots/robots-txt.html
> http://en.wikipedia.org/wiki/Robots.txt
> Referenced (even by Google!) http://www.robotstxt.org/wc/norobots-rfc.html seems at least outdated...
> Proper reference: http://www.robotstxt.org/norobots-rfc.txt (1996).
> We need WIKI page explaining all rules implemented by Droids; hopefully it will become unofficial standard.
> Recent update from Google:
> *http://code.google.com/web/controlcrawlindex/*

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (DROIDS-109) Several defects in robots exclusion protocol (robots.txt) implementation

Posted by "Otis Gospodnetic (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/DROIDS-109?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12974863#action_12974863 ] 

Otis Gospodnetic commented on DROIDS-109:
-----------------------------------------

Isn't this sort of stuff dealt with in Crawler Commons project?  See http://code.google.com/p/crawler-commons/

Shouldn't Droids make use of the effort and functionality in that project? (n.b. I don't know what the state of the project is or what functionality it actually provides..... just had a quick look - I don't see anything in the repo there about robots.txt handling, but I bet Ken Krugler could tell us about the plans, timelines, and such).


> Several defects in robots exclusion protocol (robots.txt) implementation
> ------------------------------------------------------------------------
>
>                 Key: DROIDS-109
>                 URL: https://issues.apache.org/jira/browse/DROIDS-109
>             Project: Droids
>          Issue Type: Bug
>          Components: core, norobots
>    Affects Versions: Graduating from the Incubator
>            Reporter: Fuad Efendi
>             Fix For: Graduating from the Incubator
>
>   Original Estimate: 672h
>  Remaining Estimate: 672h
>
> 1. Googlebot and many others support query part rules; Droids currently supports only URI.getPath() (without query part)
> 2. %2F represents "/" (slash) character inside a path; it shouldn't be decoded before applying rule
> 3. Double decoding is used by NoRobotClient.isUrlAllowed(URI uri) (method body; baseURI.getPath(); returns decoded string; then we call another URLDecoder.decode(path, US_ASCII);
> 4. URLDecoder.decode(path, US_ASCII); - UTF-8 must be used!
> 5. The longest matching directive path (not including wildcard expansion) should be the one applied to any page URL
> 6. Wildcard characters should be recognized
> 7. Sitemaps
> 8. Crawl rate
> 9. BOM sequence is not removed before processing robots.txt (http://unicode.org/faq/utf_bom.html, bytes: 0xEF 0xBB 0xBF
> and most probably many more defects (Nutch & BIXO haven't done it in-full yet). I am working on it right now... 
> Some references:
> http://nikitathespider.com/python/rerp/
> http://en.wikipedia.org/wiki/Uniform_Resource_Identifier
> http://www.searchtools.com/robots/robots-txt.html
> http://en.wikipedia.org/wiki/Robots.txt
> Referenced (even by Google!) http://www.robotstxt.org/wc/norobots-rfc.html seems at least outdated...
> Proper reference: http://www.robotstxt.org/norobots-rfc.txt (1996).
> We need WIKI page explaining all rules implemented by Droids; hopefully it will become unofficial standard.
> *Update from Google:*
> http://code.google.com/web/controlcrawlindex/

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (DROIDS-109) Several defects in robots exclusion protocol (robots.txt) implementation

Posted by "Fuad Efendi (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/DROIDS-109?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12965466#action_12965466 ] 

Fuad Efendi commented on DROIDS-109:
------------------------------------

1. I need to introduce "Entity" with HTTP Headers, expiration settings, last retrieval date, response code, exception message, and etc.; I need to decode properly bytearray representing robots.txt
2. I need to modify some interfaces; so that droids-norobots should use (refactored) HttpContentEntity. 
And we have cyclic loop of dependencies...

It's better to unite "core" and "norobots" into same package... otherwise we need to move some interfaces from "core" into "norobots" (which doesn't seem nice)

> Several defects in robots exclusion protocol (robots.txt) implementation
> ------------------------------------------------------------------------
>
>                 Key: DROIDS-109
>                 URL: https://issues.apache.org/jira/browse/DROIDS-109
>             Project: Droids
>          Issue Type: Bug
>          Components: core, norobots
>            Reporter: Fuad Efendi
>   Original Estimate: 672h
>  Remaining Estimate: 672h
>
> 1. Googlebot and many others support query part rules; Droids currently supports only URI.getPath() (without query part)
> 2. %2F represents "/" (slash) character inside a path; it shouldn't be decoded before applying rule
> 3. Double decoding is used by NoRobotClient.isUrlAllowed(URI uri) (method body; baseURI.getPath(); returns decoded string; then we call another URLDecoder.decode(path, US_ASCII);
> 4. URLDecoder.decode(path, US_ASCII); - UTF-8 must be used!
> 5. The longest matching directive path (not including wildcard expansion) should be the one applied to any page URL
> 6. Wildcard characters should be recognized
> 7. Sitemaps
> 8. Crawl rate
> 9. BOM sequence is not removed before processing robots.txt (http://unicode.org/faq/utf_bom.html, bytes: 0xEF 0xBB 0xBF
> and most probably many more defects (Nutch & BIXO haven't done it in-full yet). I am working on it right now... 
> Some references:
> http://nikitathespider.com/python/rerp/
> http://en.wikipedia.org/wiki/Uniform_Resource_Identifier
> http://www.searchtools.com/robots/robots-txt.html
> http://en.wikipedia.org/wiki/Robots.txt
> Referenced (even by Google!) http://www.robotstxt.org/wc/norobots-rfc.html seems at least outdated...
> Proper reference: http://www.robotstxt.org/norobots-rfc.txt (1996).
> We need WIKI page explaining all rules implemented by Droids; hopefully it will become unofficial standard.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (DROIDS-109) Several defects in robots exclusion protocol (robots.txt) implementation

Posted by "Paul Rogalinski (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/DROIDS-109?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12997500#comment-12997500 ] 

Paul Rogalinski commented on DROIDS-109:
----------------------------------------

I've successfully ported the BIXO's implementation over to "my version" of droids. Why no patch? Two issues: a) my copy of droids is too far away from the current trunk and b) this patch would IMHO change too much (DroidsHttpClient/Protocol have been altered for instance). Anybody with commit permissions up to the task? 

> Several defects in robots exclusion protocol (robots.txt) implementation
> ------------------------------------------------------------------------
>
>                 Key: DROIDS-109
>                 URL: https://issues.apache.org/jira/browse/DROIDS-109
>             Project: Droids
>          Issue Type: Bug
>          Components: core, norobots
>    Affects Versions: Graduating from the Incubator
>            Reporter: Fuad Efendi
>             Fix For: Graduating from the Incubator
>
>   Original Estimate: 672h
>  Remaining Estimate: 672h
>
> 1. Googlebot and many others support query part rules; Droids currently supports only URI.getPath() (without query part)
> 2. %2F represents "/" (slash) character inside a path; it shouldn't be decoded before applying rule
> 3. Double decoding is used by NoRobotClient.isUrlAllowed(URI uri) (method body; baseURI.getPath(); returns decoded string; then we call another URLDecoder.decode(path, US_ASCII);
> 4. URLDecoder.decode(path, US_ASCII); - UTF-8 must be used!
> 5. The longest matching directive path (not including wildcard expansion) should be the one applied to any page URL
> 6. Wildcard characters should be recognized
> 7. Sitemaps
> 8. Crawl rate
> 9. BOM sequence is not removed before processing robots.txt (http://unicode.org/faq/utf_bom.html, bytes: 0xEF 0xBB 0xBF
> and most probably many more defects (Nutch & BIXO haven't done it in-full yet). I am working on it right now... 
> Some references:
> http://nikitathespider.com/python/rerp/
> http://en.wikipedia.org/wiki/Uniform_Resource_Identifier
> http://www.searchtools.com/robots/robots-txt.html
> http://en.wikipedia.org/wiki/Robots.txt
> Referenced (even by Google!) http://www.robotstxt.org/wc/norobots-rfc.html seems at least outdated...
> Proper reference: http://www.robotstxt.org/norobots-rfc.txt (1996).
> We need WIKI page explaining all rules implemented by Droids; hopefully it will become unofficial standard.
> *Update from Google:*
> http://code.google.com/web/controlcrawlindex/

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] Updated: (DROIDS-109) Several defects in robots exclusion protocol (robots.txt) implementation

Posted by "Fuad Efendi (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/DROIDS-109?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Fuad Efendi updated DROIDS-109:
-------------------------------

        Fix Version/s: Graduating from the Incubator
          Description: 
1. Googlebot and many others support query part rules; Droids currently supports only URI.getPath() (without query part)
2. %2F represents "/" (slash) character inside a path; it shouldn't be decoded before applying rule
3. Double decoding is used by NoRobotClient.isUrlAllowed(URI uri) (method body; baseURI.getPath(); returns decoded string; then we call another URLDecoder.decode(path, US_ASCII);
4. URLDecoder.decode(path, US_ASCII); - UTF-8 must be used!
5. The longest matching directive path (not including wildcard expansion) should be the one applied to any page URL
6. Wildcard characters should be recognized
7. Sitemaps
8. Crawl rate
9. BOM sequence is not removed before processing robots.txt (http://unicode.org/faq/utf_bom.html, bytes: 0xEF 0xBB 0xBF

and most probably many more defects (Nutch & BIXO haven't done it in-full yet). I am working on it right now... 


Some references:
http://nikitathespider.com/python/rerp/
http://en.wikipedia.org/wiki/Uniform_Resource_Identifier
http://www.searchtools.com/robots/robots-txt.html
http://en.wikipedia.org/wiki/Robots.txt

Referenced (even by Google!) http://www.robotstxt.org/wc/norobots-rfc.html seems at least outdated...
Proper reference: http://www.robotstxt.org/norobots-rfc.txt (1996).
We need WIKI page explaining all rules implemented by Droids; hopefully it will become unofficial standard.


  was:

1. Googlebot and many others support query part rules; Droids currently supports only URI.getPath() (without query part)
2. %2F represents "/" (slash) character inside a path; it shouldn't be decoded before applying rule
3. Double decoding is used by NoRobotClient.isUrlAllowed(URI uri) (method body; baseURI.getPath(); returns decoded string; then we call another URLDecoder.decode(path, US_ASCII);
4. URLDecoder.decode(path, US_ASCII); - UTF-8 must be used!
5. The longest matching directive path (not including wildcard expansion) should be the one applied to any page URL
6. Wildcard characters should be recognized
7. Sitemaps
8. Crawl rate
9. BOM sequence is not removed before processing robots.txt (http://unicode.org/faq/utf_bom.html, bytes: 0xEF 0xBB 0xBF

and most probably many more defects (Nutch & BIXO haven't done it in-full yet). I am working on it right now... 


Some references:
http://nikitathespider.com/python/rerp/
http://en.wikipedia.org/wiki/Uniform_Resource_Identifier
http://www.searchtools.com/robots/robots-txt.html
http://en.wikipedia.org/wiki/Robots.txt

Referenced (even by Google!) http://www.robotstxt.org/wc/norobots-rfc.html seems at least outdated...
Proper reference: http://www.robotstxt.org/norobots-rfc.txt (1996).
We need WIKI page explaining all rules implemented by Droids; hopefully it will become unofficial standard.


    Affects Version/s: Graduating from the Incubator

> Several defects in robots exclusion protocol (robots.txt) implementation
> ------------------------------------------------------------------------
>
>                 Key: DROIDS-109
>                 URL: https://issues.apache.org/jira/browse/DROIDS-109
>             Project: Droids
>          Issue Type: Bug
>          Components: core, norobots
>    Affects Versions: Graduating from the Incubator
>            Reporter: Fuad Efendi
>             Fix For: Graduating from the Incubator
>
>   Original Estimate: 672h
>  Remaining Estimate: 672h
>
> 1. Googlebot and many others support query part rules; Droids currently supports only URI.getPath() (without query part)
> 2. %2F represents "/" (slash) character inside a path; it shouldn't be decoded before applying rule
> 3. Double decoding is used by NoRobotClient.isUrlAllowed(URI uri) (method body; baseURI.getPath(); returns decoded string; then we call another URLDecoder.decode(path, US_ASCII);
> 4. URLDecoder.decode(path, US_ASCII); - UTF-8 must be used!
> 5. The longest matching directive path (not including wildcard expansion) should be the one applied to any page URL
> 6. Wildcard characters should be recognized
> 7. Sitemaps
> 8. Crawl rate
> 9. BOM sequence is not removed before processing robots.txt (http://unicode.org/faq/utf_bom.html, bytes: 0xEF 0xBB 0xBF
> and most probably many more defects (Nutch & BIXO haven't done it in-full yet). I am working on it right now... 
> Some references:
> http://nikitathespider.com/python/rerp/
> http://en.wikipedia.org/wiki/Uniform_Resource_Identifier
> http://www.searchtools.com/robots/robots-txt.html
> http://en.wikipedia.org/wiki/Robots.txt
> Referenced (even by Google!) http://www.robotstxt.org/wc/norobots-rfc.html seems at least outdated...
> Proper reference: http://www.robotstxt.org/norobots-rfc.txt (1996).
> We need WIKI page explaining all rules implemented by Droids; hopefully it will become unofficial standard.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (DROIDS-109) Several defects in robots exclusion protocol (robots.txt) implementation

Posted by "Fuad Efendi (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/DROIDS-109?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12975389#action_12975389 ] 

Fuad Efendi commented on DROIDS-109:
------------------------------------

And another project hosted at Google by Google, just documentation:
http://code.google.com/web/controlcrawlindex/
For instance, it documents X-Robots-Tags HTTP headers, Punicode (unicode encoding for domain names), and etc.



> Several defects in robots exclusion protocol (robots.txt) implementation
> ------------------------------------------------------------------------
>
>                 Key: DROIDS-109
>                 URL: https://issues.apache.org/jira/browse/DROIDS-109
>             Project: Droids
>          Issue Type: Bug
>          Components: core, norobots
>    Affects Versions: Graduating from the Incubator
>            Reporter: Fuad Efendi
>             Fix For: Graduating from the Incubator
>
>   Original Estimate: 672h
>  Remaining Estimate: 672h
>
> 1. Googlebot and many others support query part rules; Droids currently supports only URI.getPath() (without query part)
> 2. %2F represents "/" (slash) character inside a path; it shouldn't be decoded before applying rule
> 3. Double decoding is used by NoRobotClient.isUrlAllowed(URI uri) (method body; baseURI.getPath(); returns decoded string; then we call another URLDecoder.decode(path, US_ASCII);
> 4. URLDecoder.decode(path, US_ASCII); - UTF-8 must be used!
> 5. The longest matching directive path (not including wildcard expansion) should be the one applied to any page URL
> 6. Wildcard characters should be recognized
> 7. Sitemaps
> 8. Crawl rate
> 9. BOM sequence is not removed before processing robots.txt (http://unicode.org/faq/utf_bom.html, bytes: 0xEF 0xBB 0xBF
> and most probably many more defects (Nutch & BIXO haven't done it in-full yet). I am working on it right now... 
> Some references:
> http://nikitathespider.com/python/rerp/
> http://en.wikipedia.org/wiki/Uniform_Resource_Identifier
> http://www.searchtools.com/robots/robots-txt.html
> http://en.wikipedia.org/wiki/Robots.txt
> Referenced (even by Google!) http://www.robotstxt.org/wc/norobots-rfc.html seems at least outdated...
> Proper reference: http://www.robotstxt.org/norobots-rfc.txt (1996).
> We need WIKI page explaining all rules implemented by Droids; hopefully it will become unofficial standard.
> *Update from Google:*
> http://code.google.com/web/controlcrawlindex/

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (DROIDS-109) Several defects in robots exclusion protocol (robots.txt) implementation

Posted by "Fuad Efendi (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/DROIDS-109?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Fuad Efendi updated DROIDS-109:
-------------------------------

    Description: 
1. Googlebot and many others support query part rules; Droids currently supports only URI.getPath() (without query part)
2. %2F represents "/" (slash) character inside a path; it shouldn't be decoded before applying rule
3. Double decoding is used by NoRobotClient.isUrlAllowed(URI uri) (method body; baseURI.getPath(); returns decoded string; then we call another URLDecoder.decode(path, US_ASCII);
4. URLDecoder.decode(path, US_ASCII); - UTF-8 must be used!
5. The longest matching directive path (not including wildcard expansion) should be the one applied to any page URL
6. Wildcard characters should be recognized
7. Sitemaps
8. Crawl rate
9. BOM sequence is not removed before processing robots.txt (http://unicode.org/faq/utf_bom.html, bytes: 0xEF 0xBB 0xBF

and most probably many more defects (Nutch & BIXO haven't done it in-full yet). I am working on it right now... 


Some references:
http://nikitathespider.com/python/rerp/
http://en.wikipedia.org/wiki/Uniform_Resource_Identifier
http://www.searchtools.com/robots/robots-txt.html
http://en.wikipedia.org/wiki/Robots.txt

Referenced (even by Google!) http://www.robotstxt.org/wc/norobots-rfc.html seems at least outdated...
Proper reference: http://www.robotstxt.org/norobots-rfc.txt (1996).
We need WIKI page explaining all rules implemented by Droids; hopefully it will become unofficial standard.

Recent update from Google:
{b}http://code.google.com/web/controlcrawlindex/{b}

  was:
1. Googlebot and many others support query part rules; Droids currently supports only URI.getPath() (without query part)
2. %2F represents "/" (slash) character inside a path; it shouldn't be decoded before applying rule
3. Double decoding is used by NoRobotClient.isUrlAllowed(URI uri) (method body; baseURI.getPath(); returns decoded string; then we call another URLDecoder.decode(path, US_ASCII);
4. URLDecoder.decode(path, US_ASCII); - UTF-8 must be used!
5. The longest matching directive path (not including wildcard expansion) should be the one applied to any page URL
6. Wildcard characters should be recognized
7. Sitemaps
8. Crawl rate
9. BOM sequence is not removed before processing robots.txt (http://unicode.org/faq/utf_bom.html, bytes: 0xEF 0xBB 0xBF

and most probably many more defects (Nutch & BIXO haven't done it in-full yet). I am working on it right now... 


Some references:
http://nikitathespider.com/python/rerp/
http://en.wikipedia.org/wiki/Uniform_Resource_Identifier
http://www.searchtools.com/robots/robots-txt.html
http://en.wikipedia.org/wiki/Robots.txt

Referenced (even by Google!) http://www.robotstxt.org/wc/norobots-rfc.html seems at least outdated...
Proper reference: http://www.robotstxt.org/norobots-rfc.txt (1996).
We need WIKI page explaining all rules implemented by Droids; hopefully it will become unofficial standard.



> Several defects in robots exclusion protocol (robots.txt) implementation
> ------------------------------------------------------------------------
>
>                 Key: DROIDS-109
>                 URL: https://issues.apache.org/jira/browse/DROIDS-109
>             Project: Droids
>          Issue Type: Bug
>          Components: core, norobots
>    Affects Versions: Graduating from the Incubator
>            Reporter: Fuad Efendi
>             Fix For: Graduating from the Incubator
>
>   Original Estimate: 672h
>  Remaining Estimate: 672h
>
> 1. Googlebot and many others support query part rules; Droids currently supports only URI.getPath() (without query part)
> 2. %2F represents "/" (slash) character inside a path; it shouldn't be decoded before applying rule
> 3. Double decoding is used by NoRobotClient.isUrlAllowed(URI uri) (method body; baseURI.getPath(); returns decoded string; then we call another URLDecoder.decode(path, US_ASCII);
> 4. URLDecoder.decode(path, US_ASCII); - UTF-8 must be used!
> 5. The longest matching directive path (not including wildcard expansion) should be the one applied to any page URL
> 6. Wildcard characters should be recognized
> 7. Sitemaps
> 8. Crawl rate
> 9. BOM sequence is not removed before processing robots.txt (http://unicode.org/faq/utf_bom.html, bytes: 0xEF 0xBB 0xBF
> and most probably many more defects (Nutch & BIXO haven't done it in-full yet). I am working on it right now... 
> Some references:
> http://nikitathespider.com/python/rerp/
> http://en.wikipedia.org/wiki/Uniform_Resource_Identifier
> http://www.searchtools.com/robots/robots-txt.html
> http://en.wikipedia.org/wiki/Robots.txt
> Referenced (even by Google!) http://www.robotstxt.org/wc/norobots-rfc.html seems at least outdated...
> Proper reference: http://www.robotstxt.org/norobots-rfc.txt (1996).
> We need WIKI page explaining all rules implemented by Droids; hopefully it will become unofficial standard.
> Recent update from Google:
> {b}http://code.google.com/web/controlcrawlindex/{b}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (DROIDS-109) Several defects in robots exclusion protocol (robots.txt) implementation

Posted by "Fuad Efendi (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/DROIDS-109?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Fuad Efendi updated DROIDS-109:
-------------------------------

    Description: 
1. Googlebot and many others support query part rules; Droids currently supports only URI.getPath() (without query part)
2. %2F represents "/" (slash) character inside a path; it shouldn't be decoded before applying rule
3. Double decoding is used by NoRobotClient.isUrlAllowed(URI uri) (method body; baseURI.getPath(); returns decoded string; then we call another URLDecoder.decode(path, US_ASCII);
4. URLDecoder.decode(path, US_ASCII); - UTF-8 must be used!
5. The longest matching directive path (not including wildcard expansion) should be the one applied to any page URL
6. Wildcard characters should be recognized
7. Sitemaps
8. Crawl rate
9. BOM sequence is not removed before processing robots.txt (http://unicode.org/faq/utf_bom.html, bytes: 0xEF 0xBB 0xBF

and most probably many more defects (Nutch & BIXO haven't done it in-full yet). I am working on it right now... 


Some references:
http://nikitathespider.com/python/rerp/
http://en.wikipedia.org/wiki/Uniform_Resource_Identifier
http://www.searchtools.com/robots/robots-txt.html
http://en.wikipedia.org/wiki/Robots.txt

Referenced (even by Google!) http://www.robotstxt.org/wc/norobots-rfc.html seems at least outdated...
Proper reference: http://www.robotstxt.org/norobots-rfc.txt (1996).
We need WIKI page explaining all rules implemented by Droids; hopefully it will become unofficial standard.

*Update from Google:*
http://code.google.com/web/controlcrawlindex/

  was:
1. Googlebot and many others support query part rules; Droids currently supports only URI.getPath() (without query part)
2. %2F represents "/" (slash) character inside a path; it shouldn't be decoded before applying rule
3. Double decoding is used by NoRobotClient.isUrlAllowed(URI uri) (method body; baseURI.getPath(); returns decoded string; then we call another URLDecoder.decode(path, US_ASCII);
4. URLDecoder.decode(path, US_ASCII); - UTF-8 must be used!
5. The longest matching directive path (not including wildcard expansion) should be the one applied to any page URL
6. Wildcard characters should be recognized
7. Sitemaps
8. Crawl rate
9. BOM sequence is not removed before processing robots.txt (http://unicode.org/faq/utf_bom.html, bytes: 0xEF 0xBB 0xBF

and most probably many more defects (Nutch & BIXO haven't done it in-full yet). I am working on it right now... 


Some references:
http://nikitathespider.com/python/rerp/
http://en.wikipedia.org/wiki/Uniform_Resource_Identifier
http://www.searchtools.com/robots/robots-txt.html
http://en.wikipedia.org/wiki/Robots.txt

Referenced (even by Google!) http://www.robotstxt.org/wc/norobots-rfc.html seems at least outdated...
Proper reference: http://www.robotstxt.org/norobots-rfc.txt (1996).
We need WIKI page explaining all rules implemented by Droids; hopefully it will become unofficial standard.

Recent update from Google:
*http://code.google.com/web/controlcrawlindex/*


> Several defects in robots exclusion protocol (robots.txt) implementation
> ------------------------------------------------------------------------
>
>                 Key: DROIDS-109
>                 URL: https://issues.apache.org/jira/browse/DROIDS-109
>             Project: Droids
>          Issue Type: Bug
>          Components: core, norobots
>    Affects Versions: Graduating from the Incubator
>            Reporter: Fuad Efendi
>             Fix For: Graduating from the Incubator
>
>   Original Estimate: 672h
>  Remaining Estimate: 672h
>
> 1. Googlebot and many others support query part rules; Droids currently supports only URI.getPath() (without query part)
> 2. %2F represents "/" (slash) character inside a path; it shouldn't be decoded before applying rule
> 3. Double decoding is used by NoRobotClient.isUrlAllowed(URI uri) (method body; baseURI.getPath(); returns decoded string; then we call another URLDecoder.decode(path, US_ASCII);
> 4. URLDecoder.decode(path, US_ASCII); - UTF-8 must be used!
> 5. The longest matching directive path (not including wildcard expansion) should be the one applied to any page URL
> 6. Wildcard characters should be recognized
> 7. Sitemaps
> 8. Crawl rate
> 9. BOM sequence is not removed before processing robots.txt (http://unicode.org/faq/utf_bom.html, bytes: 0xEF 0xBB 0xBF
> and most probably many more defects (Nutch & BIXO haven't done it in-full yet). I am working on it right now... 
> Some references:
> http://nikitathespider.com/python/rerp/
> http://en.wikipedia.org/wiki/Uniform_Resource_Identifier
> http://www.searchtools.com/robots/robots-txt.html
> http://en.wikipedia.org/wiki/Robots.txt
> Referenced (even by Google!) http://www.robotstxt.org/wc/norobots-rfc.html seems at least outdated...
> Proper reference: http://www.robotstxt.org/norobots-rfc.txt (1996).
> We need WIKI page explaining all rules implemented by Droids; hopefully it will become unofficial standard.
> *Update from Google:*
> http://code.google.com/web/controlcrawlindex/

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Issue Comment Edited: (DROIDS-109) Several defects in robots exclusion protocol (robots.txt) implementation

Posted by "Fuad Efendi (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/DROIDS-109?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12965439#action_12965439 ] 

Fuad Efendi edited comment on DROIDS-109 at 11/30/10 4:06 PM:
--------------------------------------------------------------

http://www.robotstxt.org/norobots-rfc.txt (draft-koster-robots-00.txt, page 5):
{quote}The matching process compares every octet in the path portion of
   the URL and the path from the record. If a %xx encoded octet is
   encountered it is unencoded prior to comparison, unless it is the
   "/" character, which has special meaning in a path. The match
   evaluates positively if and only if the end of the path from the
   record is reached before a difference in octets is encountered.{quote}

Koster doesn't write anything about robots.txt encoding/decoding (HTTP response headers). Koster only mentions HTTP cache control in section 3.4...

Logically, we need to decode path (excluding %2F) before comparison to a rule; and decoded path may contain any unicode character.

It naturally means that Webmasters are allowed to use any charset in robots.txt; and we must analyze HTTP headers and decode stream accordingly, although it's not officially mentioned yet (except http://nikitathespider.com/python/rerp/)





      was (Author: funtick):
    
http://www.robotstxt.org/norobots-rfc.txt (draft-koster-robots-00.txt, page 5):
??The matching process compares every octet in the path portion of
   the URL and the path from the record. If a %xx encoded octet is
   encountered it is unencoded prior to comparison, unless it is the
   "/" character, which has special meaning in a path. The match
   evaluates positively if and only if the end of the path from the
   record is reached before a difference in octets is encountered.??

Koster doesn't write anything about robots.txt encoding. However, we need to decode (excluding %2F) before comparison to a rule; and decoded path may contain any unicode character.

It naturally means that Webmasters are allowed to use any charset in robots.txt; and we must analyze HTTP headers and decode stream accordingly, although it's not officially mentioned yet (except http://nikitathespider.com/python/rerp/)




  
> Several defects in robots exclusion protocol (robots.txt) implementation
> ------------------------------------------------------------------------
>
>                 Key: DROIDS-109
>                 URL: https://issues.apache.org/jira/browse/DROIDS-109
>             Project: Droids
>          Issue Type: Bug
>          Components: core, norobots
>            Reporter: Fuad Efendi
>   Original Estimate: 672h
>  Remaining Estimate: 672h
>
> 1. Googlebot and many others support query part rules; Droids currently supports only URI.getPath() (without query part)
> 2. %2F represents "/" (slash) character inside a path; it shouldn't be decoded before applying rule
> 3. Double decoding is used by NoRobotClient.isUrlAllowed(URI uri) (method body; baseURI.getPath(); returns decoded string; then we call another URLDecoder.decode(path, US_ASCII);
> 4. URLDecoder.decode(path, US_ASCII); - UTF-8 must be used!
> 5. The longest matching directive path (not including wildcard expansion) should be the one applied to any page URL
> 6. Wildcard characters should be recognized
> 7. Sitemaps
> 8. Crawl rate
> 9. BOM sequence is not removed before processing robots.txt (http://unicode.org/faq/utf_bom.html, bytes: 0xEF 0xBB 0xBF
> and most probably many more defects (Nutch & BIXO haven't done it in-full yet). I am working on it right now... 
> Some references:
> http://nikitathespider.com/python/rerp/
> http://en.wikipedia.org/wiki/Uniform_Resource_Identifier
> http://www.searchtools.com/robots/robots-txt.html
> http://en.wikipedia.org/wiki/Robots.txt
> Referenced (even by Google!) http://www.robotstxt.org/wc/norobots-rfc.html seems at least outdated...
> Proper reference: http://www.robotstxt.org/norobots-rfc.txt (1996).
> We need WIKI page explaining all rules implemented by Droids; hopefully it will become unofficial standard.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Issue Comment Edited: (DROIDS-109) Several defects in robots exclusion protocol (robots.txt) implementation

Posted by "Fuad Efendi (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/DROIDS-109?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12965439#action_12965439 ] 

Fuad Efendi edited comment on DROIDS-109 at 11/30/10 4:12 PM:
--------------------------------------------------------------

http://www.robotstxt.org/norobots-rfc.txt (draft-koster-robots-00.txt, page 5):
{quote}The matching process compares every octet in the path portion of
   the URL and the path from the record. If a %xx encoded octet is
   encountered it is unencoded prior to comparison, unless it is the
   "/" character, which has special meaning in a path. The match
   evaluates positively if and only if the end of the path from the
   record is reached before a difference in octets is encountered.{quote}

Koster doesn't write anything about robots.txt encoding/decoding (HTTP response headers). Koster only mentions HTTP cache control in section 3.4...

Logically, we need to decode path (excluding %2F) before comparison to a rule; and decoded path may contain any unicode character.

It naturally means that Webmasters are allowed to use any charset in robots.txt; and we must analyze HTTP headers and decode stream accordingly, although it's not officially mentioned yet (except http://nikitathespider.com/python/rerp/)

Also, don't forget "path" in this unofficial document (1996) means really whatever is after "protocol+//+host+port"... for instance:
/query;sessionID=123#My%2fAnchor?abc=123




      was (Author: funtick):
    http://www.robotstxt.org/norobots-rfc.txt (draft-koster-robots-00.txt, page 5):
{quote}The matching process compares every octet in the path portion of
   the URL and the path from the record. If a %xx encoded octet is
   encountered it is unencoded prior to comparison, unless it is the
   "/" character, which has special meaning in a path. The match
   evaluates positively if and only if the end of the path from the
   record is reached before a difference in octets is encountered.{quote}

Koster doesn't write anything about robots.txt encoding/decoding (HTTP response headers). Koster only mentions HTTP cache control in section 3.4...

Logically, we need to decode path (excluding %2F) before comparison to a rule; and decoded path may contain any unicode character.

It naturally means that Webmasters are allowed to use any charset in robots.txt; and we must analyze HTTP headers and decode stream accordingly, although it's not officially mentioned yet (except http://nikitathespider.com/python/rerp/)




  
> Several defects in robots exclusion protocol (robots.txt) implementation
> ------------------------------------------------------------------------
>
>                 Key: DROIDS-109
>                 URL: https://issues.apache.org/jira/browse/DROIDS-109
>             Project: Droids
>          Issue Type: Bug
>          Components: core, norobots
>            Reporter: Fuad Efendi
>   Original Estimate: 672h
>  Remaining Estimate: 672h
>
> 1. Googlebot and many others support query part rules; Droids currently supports only URI.getPath() (without query part)
> 2. %2F represents "/" (slash) character inside a path; it shouldn't be decoded before applying rule
> 3. Double decoding is used by NoRobotClient.isUrlAllowed(URI uri) (method body; baseURI.getPath(); returns decoded string; then we call another URLDecoder.decode(path, US_ASCII);
> 4. URLDecoder.decode(path, US_ASCII); - UTF-8 must be used!
> 5. The longest matching directive path (not including wildcard expansion) should be the one applied to any page URL
> 6. Wildcard characters should be recognized
> 7. Sitemaps
> 8. Crawl rate
> 9. BOM sequence is not removed before processing robots.txt (http://unicode.org/faq/utf_bom.html, bytes: 0xEF 0xBB 0xBF
> and most probably many more defects (Nutch & BIXO haven't done it in-full yet). I am working on it right now... 
> Some references:
> http://nikitathespider.com/python/rerp/
> http://en.wikipedia.org/wiki/Uniform_Resource_Identifier
> http://www.searchtools.com/robots/robots-txt.html
> http://en.wikipedia.org/wiki/Robots.txt
> Referenced (even by Google!) http://www.robotstxt.org/wc/norobots-rfc.html seems at least outdated...
> Proper reference: http://www.robotstxt.org/norobots-rfc.txt (1996).
> We need WIKI page explaining all rules implemented by Droids; hopefully it will become unofficial standard.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (DROIDS-109) Several defects in robots exclusion protocol (robots.txt) implementation

Posted by "Ken Krugler (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/DROIDS-109?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12975338#action_12975338 ] 

Ken Krugler commented on DROIDS-109:
------------------------------------

I'd separately emailed Fuad about crawler-commons, and also pointed him at the current robots.txt parsing code in Bixo. I'd taken all of the code/tests I could find from Nutch, Droids, Heritrix and one other Java-based crawler, and tried to come up with parsing code that passed all tests. Then I ran it against a 2.3M domain crawl, and tried to handle all of the common errors I encountered (typos, missing ':', etc).

The big remaining issue is handling Google-esque URL patterns.


> Several defects in robots exclusion protocol (robots.txt) implementation
> ------------------------------------------------------------------------
>
>                 Key: DROIDS-109
>                 URL: https://issues.apache.org/jira/browse/DROIDS-109
>             Project: Droids
>          Issue Type: Bug
>          Components: core, norobots
>    Affects Versions: Graduating from the Incubator
>            Reporter: Fuad Efendi
>             Fix For: Graduating from the Incubator
>
>   Original Estimate: 672h
>  Remaining Estimate: 672h
>
> 1. Googlebot and many others support query part rules; Droids currently supports only URI.getPath() (without query part)
> 2. %2F represents "/" (slash) character inside a path; it shouldn't be decoded before applying rule
> 3. Double decoding is used by NoRobotClient.isUrlAllowed(URI uri) (method body; baseURI.getPath(); returns decoded string; then we call another URLDecoder.decode(path, US_ASCII);
> 4. URLDecoder.decode(path, US_ASCII); - UTF-8 must be used!
> 5. The longest matching directive path (not including wildcard expansion) should be the one applied to any page URL
> 6. Wildcard characters should be recognized
> 7. Sitemaps
> 8. Crawl rate
> 9. BOM sequence is not removed before processing robots.txt (http://unicode.org/faq/utf_bom.html, bytes: 0xEF 0xBB 0xBF
> and most probably many more defects (Nutch & BIXO haven't done it in-full yet). I am working on it right now... 
> Some references:
> http://nikitathespider.com/python/rerp/
> http://en.wikipedia.org/wiki/Uniform_Resource_Identifier
> http://www.searchtools.com/robots/robots-txt.html
> http://en.wikipedia.org/wiki/Robots.txt
> Referenced (even by Google!) http://www.robotstxt.org/wc/norobots-rfc.html seems at least outdated...
> Proper reference: http://www.robotstxt.org/norobots-rfc.txt (1996).
> We need WIKI page explaining all rules implemented by Droids; hopefully it will become unofficial standard.
> *Update from Google:*
> http://code.google.com/web/controlcrawlindex/

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (DROIDS-109) Several defects in robots exclusion protocol (robots.txt) implementation

Posted by "Fuad Efendi (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/DROIDS-109?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12975554#action_12975554 ] 

Fuad Efendi commented on DROIDS-109:
------------------------------------

@Ken:
BIXO Robot is great, especially spellchecker, and many flavors of "new line" character (which I really encountered few years ago and reported to Nutch). 

@Paul:
Ken suggested the same, to design test cases; I am simply very limited in time... whenever I feel I need to share findings I do share...

It's much easier to improve BIXO or crawler-commons than to completely redesign Droids (in order to implement HTTP headers pre-processing in Droids, I need to avoid using InputStream in JavaBean classes, and use bytearrays and metadata instead - it's easier to rewrite Droids from scratch than to submit a patch)


> Several defects in robots exclusion protocol (robots.txt) implementation
> ------------------------------------------------------------------------
>
>                 Key: DROIDS-109
>                 URL: https://issues.apache.org/jira/browse/DROIDS-109
>             Project: Droids
>          Issue Type: Bug
>          Components: core, norobots
>    Affects Versions: Graduating from the Incubator
>            Reporter: Fuad Efendi
>             Fix For: Graduating from the Incubator
>
>   Original Estimate: 672h
>  Remaining Estimate: 672h
>
> 1. Googlebot and many others support query part rules; Droids currently supports only URI.getPath() (without query part)
> 2. %2F represents "/" (slash) character inside a path; it shouldn't be decoded before applying rule
> 3. Double decoding is used by NoRobotClient.isUrlAllowed(URI uri) (method body; baseURI.getPath(); returns decoded string; then we call another URLDecoder.decode(path, US_ASCII);
> 4. URLDecoder.decode(path, US_ASCII); - UTF-8 must be used!
> 5. The longest matching directive path (not including wildcard expansion) should be the one applied to any page URL
> 6. Wildcard characters should be recognized
> 7. Sitemaps
> 8. Crawl rate
> 9. BOM sequence is not removed before processing robots.txt (http://unicode.org/faq/utf_bom.html, bytes: 0xEF 0xBB 0xBF
> and most probably many more defects (Nutch & BIXO haven't done it in-full yet). I am working on it right now... 
> Some references:
> http://nikitathespider.com/python/rerp/
> http://en.wikipedia.org/wiki/Uniform_Resource_Identifier
> http://www.searchtools.com/robots/robots-txt.html
> http://en.wikipedia.org/wiki/Robots.txt
> Referenced (even by Google!) http://www.robotstxt.org/wc/norobots-rfc.html seems at least outdated...
> Proper reference: http://www.robotstxt.org/norobots-rfc.txt (1996).
> We need WIKI page explaining all rules implemented by Droids; hopefully it will become unofficial standard.
> *Update from Google:*
> http://code.google.com/web/controlcrawlindex/

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] [Updated] (DROIDS-109) Several defects in robots exclusion protocol (robots.txt) implementation

Posted by "Richard Frovarp (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/DROIDS-109?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Richard Frovarp updated DROIDS-109:
-----------------------------------

    Affects Version/s:     (was: Graduating from the Incubator)
                       0.0.2
        Fix Version/s:     (was: Graduating from the Incubator)
    
> Several defects in robots exclusion protocol (robots.txt) implementation
> ------------------------------------------------------------------------
>
>                 Key: DROIDS-109
>                 URL: https://issues.apache.org/jira/browse/DROIDS-109
>             Project: Droids
>          Issue Type: Bug
>          Components: core, norobots
>    Affects Versions: 0.0.2
>            Reporter: Fuad Efendi
>   Original Estimate: 672h
>  Remaining Estimate: 672h
>
> 1. Googlebot and many others support query part rules; Droids currently supports only URI.getPath() (without query part)
> 2. %2F represents "/" (slash) character inside a path; it shouldn't be decoded before applying rule
> 3. Double decoding is used by NoRobotClient.isUrlAllowed(URI uri) (method body; baseURI.getPath(); returns decoded string; then we call another URLDecoder.decode(path, US_ASCII);
> 4. URLDecoder.decode(path, US_ASCII); - UTF-8 must be used!
> 5. The longest matching directive path (not including wildcard expansion) should be the one applied to any page URL
> 6. Wildcard characters should be recognized
> 7. Sitemaps
> 8. Crawl rate
> 9. BOM sequence is not removed before processing robots.txt (http://unicode.org/faq/utf_bom.html, bytes: 0xEF 0xBB 0xBF
> and most probably many more defects (Nutch & BIXO haven't done it in-full yet). I am working on it right now... 
> Some references:
> http://nikitathespider.com/python/rerp/
> http://en.wikipedia.org/wiki/Uniform_Resource_Identifier
> http://www.searchtools.com/robots/robots-txt.html
> http://en.wikipedia.org/wiki/Robots.txt
> Referenced (even by Google!) http://www.robotstxt.org/wc/norobots-rfc.html seems at least outdated...
> Proper reference: http://www.robotstxt.org/norobots-rfc.txt (1996).
> We need WIKI page explaining all rules implemented by Droids; hopefully it will become unofficial standard.
> *Update from Google:*
> http://code.google.com/web/controlcrawlindex/

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] Updated: (DROIDS-109) Several defects in robots exclusion protocol (robots.txt) implementation

Posted by "Fuad Efendi (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/DROIDS-109?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Fuad Efendi updated DROIDS-109:
-------------------------------

    Description: 

1. Googlebot and many others support query part rules; Droids currently supports only URI.getPath() (without query part)
2. %2F represents "/" (slash) character inside a path; it shouldn't be decoded before applying rule
3. Double decoding is used by NoRobotClient.isUrlAllowed(URI uri) (method body; baseURI.getPath(); returns decoded string; then we call another URLDecoder.decode(path, US_ASCII);
4. URLDecoder.decode(path, US_ASCII); - UTF-8 must be used!
5. The longest matching directive path (not including wildcard expansion) should be the one applied to any page URL
6. Wildcard characters should be recognized
7. Sitemaps
8. Crawl rate
9. BOM sequence is not removed before processing robots.txt (http://unicode.org/faq/utf_bom.html, bytes: 0xEF 0xBB 0xBF

and most probably many more defects (Nutch & BIXO haven't done it in-full yet). I am working on it right now... 


Some references:
http://nikitathespider.com/python/rerp/
http://en.wikipedia.org/wiki/Uniform_Resource_Identifier
http://www.searchtools.com/robots/robots-txt.html
http://en.wikipedia.org/wiki/Robots.txt

Referenced (even by Google!) http://www.robotstxt.org/wc/norobots-rfc.html seems at least outdated...
Proper reference: http://www.robotstxt.org/norobots-rfc.txt (1996).
We need WIKI page explaining all rules implemented by Droids; hopefully it will become unofficial standard.


  was:
Referenced (even by Google!) http://www.robotstxt.org/wc/norobots-rfc.html is at least 12 years outdated.

1. Googlebot and many others support query part rules; Droids currently supports only URI.getPath() (without query part)
2. %2F represents "/" (slash) character inside a path; it shouldn't be decoded before applying rule
3. Double decoding is used by NoRobotClient.isUrlAllowed(URI uri) (method body; baseURI.getPath(); returns decoded string; then we call another URLDecoder.decode(path, US_ASCII);
4. URLDecoder.decode(path, US_ASCII); - UTF-8 must be used!
5. The longest matching directive path (not including wildcard expansion) should be the one applied to any page URL
6. Wildcard characters should be recognized
7. Sitemaps
8. Crawl rate
9. BOM sequence is not removed before processing robots.txt (http://unicode.org/faq/utf_bom.html, bytes: 0xEF 0xBB 0xBF

and most probably many more defects (Nutch & BIXO haven't done it in-full yet). I am working on it right now... 


Some references:
http://nikitathespider.com/python/rerp/
http://en.wikipedia.org/wiki/Uniform_Resource_Identifier
http://www.searchtools.com/robots/robots-txt.html
http://en.wikipedia.org/wiki/Robots.txt




> Several defects in robots exclusion protocol (robots.txt) implementation
> ------------------------------------------------------------------------
>
>                 Key: DROIDS-109
>                 URL: https://issues.apache.org/jira/browse/DROIDS-109
>             Project: Droids
>          Issue Type: Bug
>          Components: core, norobots
>            Reporter: Fuad Efendi
>   Original Estimate: 672h
>  Remaining Estimate: 672h
>
> 1. Googlebot and many others support query part rules; Droids currently supports only URI.getPath() (without query part)
> 2. %2F represents "/" (slash) character inside a path; it shouldn't be decoded before applying rule
> 3. Double decoding is used by NoRobotClient.isUrlAllowed(URI uri) (method body; baseURI.getPath(); returns decoded string; then we call another URLDecoder.decode(path, US_ASCII);
> 4. URLDecoder.decode(path, US_ASCII); - UTF-8 must be used!
> 5. The longest matching directive path (not including wildcard expansion) should be the one applied to any page URL
> 6. Wildcard characters should be recognized
> 7. Sitemaps
> 8. Crawl rate
> 9. BOM sequence is not removed before processing robots.txt (http://unicode.org/faq/utf_bom.html, bytes: 0xEF 0xBB 0xBF
> and most probably many more defects (Nutch & BIXO haven't done it in-full yet). I am working on it right now... 
> Some references:
> http://nikitathespider.com/python/rerp/
> http://en.wikipedia.org/wiki/Uniform_Resource_Identifier
> http://www.searchtools.com/robots/robots-txt.html
> http://en.wikipedia.org/wiki/Robots.txt
> Referenced (even by Google!) http://www.robotstxt.org/wc/norobots-rfc.html seems at least outdated...
> Proper reference: http://www.robotstxt.org/norobots-rfc.txt (1996).
> We need WIKI page explaining all rules implemented by Droids; hopefully it will become unofficial standard.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (DROIDS-109) Several defects in robots exclusion protocol (robots.txt) implementation

Posted by "Otis Gospodnetic (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/DROIDS-109?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12976402#action_12976402 ] 

Otis Gospodnetic commented on DROIDS-109:
-----------------------------------------

Fuad opened an issue, but won't be providing a patch.
Should we close this as Won't Fix until this starts itching somebody enough to submit a patch?

> Several defects in robots exclusion protocol (robots.txt) implementation
> ------------------------------------------------------------------------
>
>                 Key: DROIDS-109
>                 URL: https://issues.apache.org/jira/browse/DROIDS-109
>             Project: Droids
>          Issue Type: Bug
>          Components: core, norobots
>    Affects Versions: Graduating from the Incubator
>            Reporter: Fuad Efendi
>             Fix For: Graduating from the Incubator
>
>   Original Estimate: 672h
>  Remaining Estimate: 672h
>
> 1. Googlebot and many others support query part rules; Droids currently supports only URI.getPath() (without query part)
> 2. %2F represents "/" (slash) character inside a path; it shouldn't be decoded before applying rule
> 3. Double decoding is used by NoRobotClient.isUrlAllowed(URI uri) (method body; baseURI.getPath(); returns decoded string; then we call another URLDecoder.decode(path, US_ASCII);
> 4. URLDecoder.decode(path, US_ASCII); - UTF-8 must be used!
> 5. The longest matching directive path (not including wildcard expansion) should be the one applied to any page URL
> 6. Wildcard characters should be recognized
> 7. Sitemaps
> 8. Crawl rate
> 9. BOM sequence is not removed before processing robots.txt (http://unicode.org/faq/utf_bom.html, bytes: 0xEF 0xBB 0xBF
> and most probably many more defects (Nutch & BIXO haven't done it in-full yet). I am working on it right now... 
> Some references:
> http://nikitathespider.com/python/rerp/
> http://en.wikipedia.org/wiki/Uniform_Resource_Identifier
> http://www.searchtools.com/robots/robots-txt.html
> http://en.wikipedia.org/wiki/Robots.txt
> Referenced (even by Google!) http://www.robotstxt.org/wc/norobots-rfc.html seems at least outdated...
> Proper reference: http://www.robotstxt.org/norobots-rfc.txt (1996).
> We need WIKI page explaining all rules implemented by Droids; hopefully it will become unofficial standard.
> *Update from Google:*
> http://code.google.com/web/controlcrawlindex/

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (DROIDS-109) Several defects in robots exclusion protocol (robots.txt) implementation

Posted by "Fuad Efendi (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/DROIDS-109?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12965439#action_12965439 ] 

Fuad Efendi commented on DROIDS-109:
------------------------------------


http://www.robotstxt.org/norobots-rfc.txt (draft-koster-robots-00.txt, page 5):
??The matching process compares every octet in the path portion of
   the URL and the path from the record. If a %xx encoded octet is
   encountered it is unencoded prior to comparison, unless it is the
   "/" character, which has special meaning in a path. The match
   evaluates positively if and only if the end of the path from the
   record is reached before a difference in octets is encountered.??

Koster doesn't write anything about robots.txt encoding. However, we need to decode (excluding %2F) before comparison to a rule; and decoded path may contain any unicode character.

It naturally means that Webmasters are allowed to use any charset in robots.txt; and we must analyze HTTP headers and decode stream accordingly, although it's not officially mentioned yet (except http://nikitathespider.com/python/rerp/)





> Several defects in robots exclusion protocol (robots.txt) implementation
> ------------------------------------------------------------------------
>
>                 Key: DROIDS-109
>                 URL: https://issues.apache.org/jira/browse/DROIDS-109
>             Project: Droids
>          Issue Type: Bug
>          Components: core, norobots
>            Reporter: Fuad Efendi
>   Original Estimate: 672h
>  Remaining Estimate: 672h
>
> 1. Googlebot and many others support query part rules; Droids currently supports only URI.getPath() (without query part)
> 2. %2F represents "/" (slash) character inside a path; it shouldn't be decoded before applying rule
> 3. Double decoding is used by NoRobotClient.isUrlAllowed(URI uri) (method body; baseURI.getPath(); returns decoded string; then we call another URLDecoder.decode(path, US_ASCII);
> 4. URLDecoder.decode(path, US_ASCII); - UTF-8 must be used!
> 5. The longest matching directive path (not including wildcard expansion) should be the one applied to any page URL
> 6. Wildcard characters should be recognized
> 7. Sitemaps
> 8. Crawl rate
> 9. BOM sequence is not removed before processing robots.txt (http://unicode.org/faq/utf_bom.html, bytes: 0xEF 0xBB 0xBF
> and most probably many more defects (Nutch & BIXO haven't done it in-full yet). I am working on it right now... 
> Some references:
> http://nikitathespider.com/python/rerp/
> http://en.wikipedia.org/wiki/Uniform_Resource_Identifier
> http://www.searchtools.com/robots/robots-txt.html
> http://en.wikipedia.org/wiki/Robots.txt
> Referenced (even by Google!) http://www.robotstxt.org/wc/norobots-rfc.html seems at least outdated...
> Proper reference: http://www.robotstxt.org/norobots-rfc.txt (1996).
> We need WIKI page explaining all rules implemented by Droids; hopefully it will become unofficial standard.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (DROIDS-109) Several defects in robots exclusion protocol (robots.txt) implementation

Posted by "Thorsten Scherler (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/DROIDS-109?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12975156#action_12975156 ] 

Thorsten Scherler commented on DROIDS-109:
------------------------------------------

Actually I am subscribed to the crawler-common ml and was there when the project had been created. There is not much traffic in that project and it had been created to have some independent ground between Nutch, Droids (basically this two at least that had been my impression) and some others. 

> Several defects in robots exclusion protocol (robots.txt) implementation
> ------------------------------------------------------------------------
>
>                 Key: DROIDS-109
>                 URL: https://issues.apache.org/jira/browse/DROIDS-109
>             Project: Droids
>          Issue Type: Bug
>          Components: core, norobots
>    Affects Versions: Graduating from the Incubator
>            Reporter: Fuad Efendi
>             Fix For: Graduating from the Incubator
>
>   Original Estimate: 672h
>  Remaining Estimate: 672h
>
> 1. Googlebot and many others support query part rules; Droids currently supports only URI.getPath() (without query part)
> 2. %2F represents "/" (slash) character inside a path; it shouldn't be decoded before applying rule
> 3. Double decoding is used by NoRobotClient.isUrlAllowed(URI uri) (method body; baseURI.getPath(); returns decoded string; then we call another URLDecoder.decode(path, US_ASCII);
> 4. URLDecoder.decode(path, US_ASCII); - UTF-8 must be used!
> 5. The longest matching directive path (not including wildcard expansion) should be the one applied to any page URL
> 6. Wildcard characters should be recognized
> 7. Sitemaps
> 8. Crawl rate
> 9. BOM sequence is not removed before processing robots.txt (http://unicode.org/faq/utf_bom.html, bytes: 0xEF 0xBB 0xBF
> and most probably many more defects (Nutch & BIXO haven't done it in-full yet). I am working on it right now... 
> Some references:
> http://nikitathespider.com/python/rerp/
> http://en.wikipedia.org/wiki/Uniform_Resource_Identifier
> http://www.searchtools.com/robots/robots-txt.html
> http://en.wikipedia.org/wiki/Robots.txt
> Referenced (even by Google!) http://www.robotstxt.org/wc/norobots-rfc.html seems at least outdated...
> Proper reference: http://www.robotstxt.org/norobots-rfc.txt (1996).
> We need WIKI page explaining all rules implemented by Droids; hopefully it will become unofficial standard.
> *Update from Google:*
> http://code.google.com/web/controlcrawlindex/

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Issue Comment Edited: (DROIDS-109) Several defects in robots exclusion protocol (robots.txt) implementation

Posted by "Fuad Efendi (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/DROIDS-109?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12975389#action_12975389 ] 

Fuad Efendi edited comment on DROIDS-109 at 12/27/10 6:55 PM:
--------------------------------------------------------------

And another project hosted at Google by Google, just documentation:
http://code.google.com/web/controlcrawlindex/
For instance, it documents X-Robots-Tags HTTP headers, Punycode (unicode encoding for domain names), and etc.



      was (Author: funtick):
    And another project hosted at Google by Google, just documentation:
http://code.google.com/web/controlcrawlindex/
For instance, it documents X-Robots-Tags HTTP headers, Punicode (unicode encoding for domain names), and etc.


  
> Several defects in robots exclusion protocol (robots.txt) implementation
> ------------------------------------------------------------------------
>
>                 Key: DROIDS-109
>                 URL: https://issues.apache.org/jira/browse/DROIDS-109
>             Project: Droids
>          Issue Type: Bug
>          Components: core, norobots
>    Affects Versions: Graduating from the Incubator
>            Reporter: Fuad Efendi
>             Fix For: Graduating from the Incubator
>
>   Original Estimate: 672h
>  Remaining Estimate: 672h
>
> 1. Googlebot and many others support query part rules; Droids currently supports only URI.getPath() (without query part)
> 2. %2F represents "/" (slash) character inside a path; it shouldn't be decoded before applying rule
> 3. Double decoding is used by NoRobotClient.isUrlAllowed(URI uri) (method body; baseURI.getPath(); returns decoded string; then we call another URLDecoder.decode(path, US_ASCII);
> 4. URLDecoder.decode(path, US_ASCII); - UTF-8 must be used!
> 5. The longest matching directive path (not including wildcard expansion) should be the one applied to any page URL
> 6. Wildcard characters should be recognized
> 7. Sitemaps
> 8. Crawl rate
> 9. BOM sequence is not removed before processing robots.txt (http://unicode.org/faq/utf_bom.html, bytes: 0xEF 0xBB 0xBF
> and most probably many more defects (Nutch & BIXO haven't done it in-full yet). I am working on it right now... 
> Some references:
> http://nikitathespider.com/python/rerp/
> http://en.wikipedia.org/wiki/Uniform_Resource_Identifier
> http://www.searchtools.com/robots/robots-txt.html
> http://en.wikipedia.org/wiki/Robots.txt
> Referenced (even by Google!) http://www.robotstxt.org/wc/norobots-rfc.html seems at least outdated...
> Proper reference: http://www.robotstxt.org/norobots-rfc.txt (1996).
> We need WIKI page explaining all rules implemented by Droids; hopefully it will become unofficial standard.
> *Update from Google:*
> http://code.google.com/web/controlcrawlindex/

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] [Commented] (DROIDS-109) Several defects in robots exclusion protocol (robots.txt) implementation

Posted by "Fuad Efendi (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/DROIDS-109?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13043125#comment-13043125 ] 

Fuad Efendi commented on DROIDS-109:
------------------------------------

I can work on it now; together with BIXO team, and crawler-commons.
The problem is InputStreams but I'll try to minimize changes... I'll start with test cases... thanks

> Several defects in robots exclusion protocol (robots.txt) implementation
> ------------------------------------------------------------------------
>
>                 Key: DROIDS-109
>                 URL: https://issues.apache.org/jira/browse/DROIDS-109
>             Project: Droids
>          Issue Type: Bug
>          Components: core, norobots
>    Affects Versions: Graduating from the Incubator
>            Reporter: Fuad Efendi
>             Fix For: Graduating from the Incubator
>
>   Original Estimate: 672h
>  Remaining Estimate: 672h
>
> 1. Googlebot and many others support query part rules; Droids currently supports only URI.getPath() (without query part)
> 2. %2F represents "/" (slash) character inside a path; it shouldn't be decoded before applying rule
> 3. Double decoding is used by NoRobotClient.isUrlAllowed(URI uri) (method body; baseURI.getPath(); returns decoded string; then we call another URLDecoder.decode(path, US_ASCII);
> 4. URLDecoder.decode(path, US_ASCII); - UTF-8 must be used!
> 5. The longest matching directive path (not including wildcard expansion) should be the one applied to any page URL
> 6. Wildcard characters should be recognized
> 7. Sitemaps
> 8. Crawl rate
> 9. BOM sequence is not removed before processing robots.txt (http://unicode.org/faq/utf_bom.html, bytes: 0xEF 0xBB 0xBF
> and most probably many more defects (Nutch & BIXO haven't done it in-full yet). I am working on it right now... 
> Some references:
> http://nikitathespider.com/python/rerp/
> http://en.wikipedia.org/wiki/Uniform_Resource_Identifier
> http://www.searchtools.com/robots/robots-txt.html
> http://en.wikipedia.org/wiki/Robots.txt
> Referenced (even by Google!) http://www.robotstxt.org/wc/norobots-rfc.html seems at least outdated...
> Proper reference: http://www.robotstxt.org/norobots-rfc.txt (1996).
> We need WIKI page explaining all rules implemented by Droids; hopefully it will become unofficial standard.
> *Update from Google:*
> http://code.google.com/web/controlcrawlindex/

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira