You are viewing a plain text version of this content. The canonical link for it is here.

Posted to droids-dev@incubator.apache.org by "Paul Rogalinski (JIRA)" <ji...@apache.org> on 2010/11/23 13:32:13 UTC

[jira] Created: (DROIDS-105) missing caching for robots.txt

missing caching for robots.txt
------------------------------

                 Key: DROIDS-105
                 URL: https://issues.apache.org/jira/browse/DROIDS-105
             Project: Droids
          Issue Type: Improvement
          Components: core
            Reporter: Paul Rogalinski
         Attachments: CachingContentLoader.java

the current implementation of the HttpClient will not cache any requests to the robots.txt file. While using the CrawlingWorker this will result in 2 requests to the robots.txt (HEAD + GET) per crawled URL. So when crawling 3 URLs the target server would get 6 requests for the robots.txt.

unfortunately the contentLoader is made final in HttpProtocol, so there is no possibility to replace it with a caching Protocol like that one you'll find in the attachment.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (DROIDS-105) missing caching for robots.txt

Posted by "Paul Rogalinski (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/DROIDS-105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12935493#action_12935493 ] 

Paul Rogalinski commented on DROIDS-105:
----------------------------------------

@Javier

Then use the 2nd java class attached, this does exactly that. Still uses Commons-Collections though, must have slipped through my changes detection somehow, sorry about that. Not easy running 3 copies of droids in different version to get clean patches out of the build-system :)

Oh, the 2nd solution needs to be adapted to the droids package naming and also reformatted to match the code guidelines. I am sure you have that on a keyboard-shortcut :)

> missing caching for robots.txt
> ------------------------------
>
>                 Key: DROIDS-105
>                 URL: https://issues.apache.org/jira/browse/DROIDS-105
>             Project: Droids
>          Issue Type: Improvement
>          Components: core
>            Reporter: Paul Rogalinski
>         Attachments: Caching-Support-and-Robots_txt-fix.patch, CachingContentLoader.java
>
>
> the current implementation of the HttpClient will not cache any requests to the robots.txt file. While using the CrawlingWorker this will result in 2 requests to the robots.txt (HEAD + GET) per crawled URL. So when crawling 3 URLs the target server would get 6 requests for the robots.txt.
> unfortunately the contentLoader is made final in HttpProtocol, so there is no possibility to replace it with a caching Protocol like that one you'll find in the attachment.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] [Updated] (DROIDS-105) missing caching for robots.txt

Posted by "Richard Frovarp (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/DROIDS-105?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Richard Frovarp updated DROIDS-105:
-----------------------------------

    Fix Version/s:     (was: 0.2.0)
                   0.3.0
    
> missing caching for robots.txt
> ------------------------------
>
>                 Key: DROIDS-105
>                 URL: https://issues.apache.org/jira/browse/DROIDS-105
>             Project: Droids
>          Issue Type: Improvement
>          Components: core
>            Reporter: Paul Rogalinski
>             Fix For: 0.3.0
>
>         Attachments: Caching-Support-and-Robots_txt-fix.patch, CachingContentLoader.java
>
>
> the current implementation of the HttpClient will not cache any requests to the robots.txt file. While using the CrawlingWorker this will result in 2 requests to the robots.txt (HEAD + GET) per crawled URL. So when crawling 3 URLs the target server would get 6 requests for the robots.txt.
> unfortunately the contentLoader is made final in HttpProtocol, so there is no possibility to replace it with a caching Protocol like that one you'll find in the attachment.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Commented: (DROIDS-105) missing caching for robots.txt

Posted by "Fuad Efendi (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/DROIDS-105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12935403#action_12935403 ] 

Fuad Efendi commented on DROIDS-105:
------------------------------------

Frankly I don't understand why Droids uses HEAD to /robots.txt
And what are estimates for real-life cache sizes?
Don't forget: Java caches forever DNS-to-IP, and URL is still synchronized...

> missing caching for robots.txt
> ------------------------------
>
>                 Key: DROIDS-105
>                 URL: https://issues.apache.org/jira/browse/DROIDS-105
>             Project: Droids
>          Issue Type: Improvement
>          Components: core
>            Reporter: Paul Rogalinski
>         Attachments: Caching-Support-and-Robots_txt-fix.patch, CachingContentLoader.java
>
>
> the current implementation of the HttpClient will not cache any requests to the robots.txt file. While using the CrawlingWorker this will result in 2 requests to the robots.txt (HEAD + GET) per crawled URL. So when crawling 3 URLs the target server would get 6 requests for the robots.txt.
> unfortunately the contentLoader is made final in HttpProtocol, so there is no possibility to replace it with a caching Protocol like that one you'll find in the attachment.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (DROIDS-105) missing caching for robots.txt

Posted by "Paul Rogalinski (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/DROIDS-105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12935490#action_12935490 ] 

Paul Rogalinski commented on DROIDS-105:
----------------------------------------

about real-life cache sizes: 

i am using a LRU map here, so in theory it should be sufficient to use a cache size of 2 in order to prevent the frequent hitting of the robotx.txt file (one robots request per potential URL request). I don't think many application will benefit from cache sizes beyond 100 when crawling the web, the TaskQueue will usually take care of filtering already visited URLs. 

So, while the caching implementation offers an easy and generic solution to the described problem, it should actually only be necessary to cache the robots.txt request per unique host. 

About the DNS-to-IP JVM Caching - I see your point here ... somewhat. If long crawls become a problem due to excessive caching, the implementing side should make use of:

java.security.Security.setProperty("networkaddress.cache.ttl" ,TTL);

A TTL mechanism for very large LRU Cache-Sizes (robots.txt of a particular domain never gets removed from the cache) might also become necessary for the URL-to-Content cache. Setting the cache-size to a value around 100, heck, even 100.000 if you are about to crawl 10.000.000 sites or more (I am) should not become an issue though. But I still agree, the implementing side should be very aware of those implications. 

I was also thinking about implementing cache on a lower level, such as the HttpClient which would get a bit more challenging and complicated to implement. The proposed solution above was on the other hand good enough for my requirements.

> missing caching for robots.txt
> ------------------------------
>
>                 Key: DROIDS-105
>                 URL: https://issues.apache.org/jira/browse/DROIDS-105
>             Project: Droids
>          Issue Type: Improvement
>          Components: core
>            Reporter: Paul Rogalinski
>         Attachments: Caching-Support-and-Robots_txt-fix.patch, CachingContentLoader.java
>
>
> the current implementation of the HttpClient will not cache any requests to the robots.txt file. While using the CrawlingWorker this will result in 2 requests to the robots.txt (HEAD + GET) per crawled URL. So when crawling 3 URLs the target server would get 6 requests for the robots.txt.
> unfortunately the contentLoader is made final in HttpProtocol, so there is no possibility to replace it with a caching Protocol like that one you'll find in the attachment.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (DROIDS-105) missing caching for robots.txt

Posted by "Paul Rogalinski (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/DROIDS-105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12935305#action_12935305 ] 

Paul Rogalinski commented on DROIDS-105:
----------------------------------------

Attaching a new patch-set  which adds caching functionality to the HttpProtocol and the HttpClientContentLoader - there still the Advanced* subclasses which might need similar treatment. From my point of view, I would like to get rid of them by merging them with the current base classes. If we do not pay attention to this, we'll end up like the Win32 API with plenty of doSomethingEx and doSomethingEx2 methods :/

> missing caching for robots.txt
> ------------------------------
>
>                 Key: DROIDS-105
>                 URL: https://issues.apache.org/jira/browse/DROIDS-105
>             Project: Droids
>          Issue Type: Improvement
>          Components: core
>            Reporter: Paul Rogalinski
>         Attachments: Caching-Support-and-Robots_txt-fix.patch, CachingContentLoader.java
>
>
> the current implementation of the HttpClient will not cache any requests to the robots.txt file. While using the CrawlingWorker this will result in 2 requests to the robots.txt (HEAD + GET) per crawled URL. So when crawling 3 URLs the target server would get 6 requests for the robots.txt.
> unfortunately the contentLoader is made final in HttpProtocol, so there is no possibility to replace it with a caching Protocol like that one you'll find in the attachment.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (DROIDS-105) missing caching for robots.txt

Posted by "Otis Gospodnetic (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/DROIDS-105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12974843#action_12974843 ] 

Otis Gospodnetic commented on DROIDS-105:
-----------------------------------------

Paul, want to do what Javier suggested?
If possible, could you please send everything as patches?

If you are adding new classes, please slap ASL on top.
Note that you can/should use the same file name for new patches, so people reviewing are not confused about what needs to be reviewed/committed.


> missing caching for robots.txt
> ------------------------------
>
>                 Key: DROIDS-105
>                 URL: https://issues.apache.org/jira/browse/DROIDS-105
>             Project: Droids
>          Issue Type: Improvement
>          Components: core
>            Reporter: Paul Rogalinski
>         Attachments: Caching-Support-and-Robots_txt-fix.patch, CachingContentLoader.java
>
>
> the current implementation of the HttpClient will not cache any requests to the robots.txt file. While using the CrawlingWorker this will result in 2 requests to the robots.txt (HEAD + GET) per crawled URL. So when crawling 3 URLs the target server would get 6 requests for the robots.txt.
> unfortunately the contentLoader is made final in HttpProtocol, so there is no possibility to replace it with a caching Protocol like that one you'll find in the attachment.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (DROIDS-105) missing caching for robots.txt

Posted by "Paul Rogalinski (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/DROIDS-105?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Paul Rogalinski updated DROIDS-105:
-----------------------------------

    Attachment: Caching-Support-and-Robots_txt-fix.patch

> missing caching for robots.txt
> ------------------------------
>
>                 Key: DROIDS-105
>                 URL: https://issues.apache.org/jira/browse/DROIDS-105
>             Project: Droids
>          Issue Type: Improvement
>          Components: core
>            Reporter: Paul Rogalinski
>         Attachments: Caching-Support-and-Robots_txt-fix.patch, CachingContentLoader.java
>
>
> the current implementation of the HttpClient will not cache any requests to the robots.txt file. While using the CrawlingWorker this will result in 2 requests to the robots.txt (HEAD + GET) per crawled URL. So when crawling 3 URLs the target server would get 6 requests for the robots.txt.
> unfortunately the contentLoader is made final in HttpProtocol, so there is no possibility to replace it with a caching Protocol like that one you'll find in the attachment.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (DROIDS-105) missing caching for robots.txt

Posted by "Paul Rogalinski (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/DROIDS-105?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Paul Rogalinski updated DROIDS-105:
-----------------------------------

    Attachment: CachingContentLoader.java

> missing caching for robots.txt
> ------------------------------
>
>                 Key: DROIDS-105
>                 URL: https://issues.apache.org/jira/browse/DROIDS-105
>             Project: Droids
>          Issue Type: Improvement
>          Components: core
>            Reporter: Paul Rogalinski
>         Attachments: CachingContentLoader.java
>
>
> the current implementation of the HttpClient will not cache any requests to the robots.txt file. While using the CrawlingWorker this will result in 2 requests to the robots.txt (HEAD + GET) per crawled URL. So when crawling 3 URLs the target server would get 6 requests for the robots.txt.
> unfortunately the contentLoader is made final in HttpProtocol, so there is no possibility to replace it with a caching Protocol like that one you'll find in the attachment.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (DROIDS-105) missing caching for robots.txt

Posted by "Florent ANDRE (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/DROIDS-105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12935280#action_12935280 ] 

Florent ANDRE commented on DROIDS-105:
--------------------------------------

Thanks for pointing this.

Did you see a way to change the final status of contentLoader in HttpProtocol ?

You cachingContentLoader class can be use inside droids ? If yes can you provide a patch from the project's root ? It's more expressive and understandable that an attached class.

++

> missing caching for robots.txt
> ------------------------------
>
>                 Key: DROIDS-105
>                 URL: https://issues.apache.org/jira/browse/DROIDS-105
>             Project: Droids
>          Issue Type: Improvement
>          Components: core
>            Reporter: Paul Rogalinski
>         Attachments: CachingContentLoader.java
>
>
> the current implementation of the HttpClient will not cache any requests to the robots.txt file. While using the CrawlingWorker this will result in 2 requests to the robots.txt (HEAD + GET) per crawled URL. So when crawling 3 URLs the target server would get 6 requests for the robots.txt.
> unfortunately the contentLoader is made final in HttpProtocol, so there is no possibility to replace it with a caching Protocol like that one you'll find in the attachment.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (DROIDS-105) missing caching for robots.txt

Posted by "Javier Puerto (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/DROIDS-105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12935487#action_12935487 ] 

Javier Puerto commented on DROIDS-105:
--------------------------------------

Sorry Paul, I'm getting an unresolved dependency after patching. Seems like you forget to add the commons collections to pom.xml.

Therefore I review your patch and I think that for cache implementation should be better to extend the default client. What do you think?

You can always call to super to get the cached content but it also allow us to implement another ways for caching, for example (based on your patch):

MemCacheLoader extends HttpClientContentLoader {

....
  public InputStream load(URI uri) throws IOException {
    if (contentCache == null || !contentCache.containsKey(uri)) {
      InputStream toBeCached = super.load(uri);
      
      ... Do the caching stuff ...
    }
    return cachedContent;
  }

> missing caching for robots.txt
> ------------------------------
>
>                 Key: DROIDS-105
>                 URL: https://issues.apache.org/jira/browse/DROIDS-105
>             Project: Droids
>          Issue Type: Improvement
>          Components: core
>            Reporter: Paul Rogalinski
>         Attachments: Caching-Support-and-Robots_txt-fix.patch, CachingContentLoader.java
>
>
> the current implementation of the HttpClient will not cache any requests to the robots.txt file. While using the CrawlingWorker this will result in 2 requests to the robots.txt (HEAD + GET) per crawled URL. So when crawling 3 URLs the target server would get 6 requests for the robots.txt.
> unfortunately the contentLoader is made final in HttpProtocol, so there is no possibility to replace it with a caching Protocol like that one you'll find in the attachment.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (DROIDS-105) missing caching for robots.txt

Posted by "Bertil Chapuis (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/DROIDS-105?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Bertil Chapuis updated DROIDS-105:
----------------------------------

    Fix Version/s: 0.0.2

> missing caching for robots.txt
> ------------------------------
>
>                 Key: DROIDS-105
>                 URL: https://issues.apache.org/jira/browse/DROIDS-105
>             Project: Droids
>          Issue Type: Improvement
>          Components: core
>            Reporter: Paul Rogalinski
>             Fix For: 0.0.2
>
>         Attachments: Caching-Support-and-Robots_txt-fix.patch, CachingContentLoader.java
>
>
> the current implementation of the HttpClient will not cache any requests to the robots.txt file. While using the CrawlingWorker this will result in 2 requests to the robots.txt (HEAD + GET) per crawled URL. So when crawling 3 URLs the target server would get 6 requests for the robots.txt.
> unfortunately the contentLoader is made final in HttpProtocol, so there is no possibility to replace it with a caching Protocol like that one you'll find in the attachment.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira