You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Greg Kim (JIRA)" <ji...@apache.org> on 2006/08/08 01:56:15 UTC

[jira] Created: (NUTCH-344) Fetcher threads blocked on synchronized block in cleanExpiredServerBlocks

Fetcher threads blocked on synchronized block in cleanExpiredServerBlocks
-------------------------------------------------------------------------

                 Key: NUTCH-344
                 URL: http://issues.apache.org/jira/browse/NUTCH-344
             Project: Nutch
          Issue Type: Bug
          Components: fetcher
    Affects Versions: 0.8.1, 0.9.0
         Environment: All
            Reporter: Greg Kim
         Attachments: cleanExpiredServerBlocks.patch

With the recent change to the following code in HttpBase.java has tendencies to block fetcher threads while one thread busy waits... 

  private static void cleanExpiredServerBlocks() {
    synchronized (BLOCKED_ADDR_TO_TIME) {
      while (!BLOCKED_ADDR_QUEUE.isEmpty()) {   <===== LINE 3:   
        String host = (String) BLOCKED_ADDR_QUEUE.getLast();
        long time = ((Long) BLOCKED_ADDR_TO_TIME.get(host)).longValue();
        if (time <= System.currentTimeMillis()) {   
          BLOCKED_ADDR_TO_TIME.remove(host);
          BLOCKED_ADDR_QUEUE.removeLast();
        }
      }
    }
  }

LINE3:  As long as there are *any* entries in the BLOCKED_ADDR_QUEUE, the thread that first enters this block busy-waits until it becomes empty while all other threads block on the synchronized block.  This leads to extremely poor fetcher performance.  

Since the checkin to respect crawlDelay in robots.txt, we are no longer guranteed that BLOCKED_ADDR_TO_TIME queue is a fifo list. The simple fix is to iterate the queue once rather than busy waiting...

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] Resolved: (NUTCH-344) Fetcher threads blocked on synchronized block in cleanExpiredServerBlocks

Posted by "Sami Siren (JIRA)" <ji...@apache.org>.
     [ http://issues.apache.org/jira/browse/NUTCH-344?page=all ]

Sami Siren resolved NUTCH-344.
------------------------------

    Fix Version/s: 0.8.1
                   0.9.0
       Resolution: Fixed

I just committed this to 0.8 branch and trunk, thanks Greg!

> Fetcher threads blocked on synchronized block in cleanExpiredServerBlocks
> -------------------------------------------------------------------------
>
>                 Key: NUTCH-344
>                 URL: http://issues.apache.org/jira/browse/NUTCH-344
>             Project: Nutch
>          Issue Type: Bug
>          Components: fetcher
>    Affects Versions: 0.8, 0.9.0, 0.8.1
>         Environment: All
>            Reporter: Greg Kim
>             Fix For: 0.8.1, 0.9.0
>
>         Attachments: cleanExpiredServerBlocks.patch
>
>
> With the recent change to the following code in HttpBase.java has tendencies to block fetcher threads while one thread busy waits... 
>   private static void cleanExpiredServerBlocks() {
>     synchronized (BLOCKED_ADDR_TO_TIME) {
>       while (!BLOCKED_ADDR_QUEUE.isEmpty()) {   <===== LINE 3:   
>         String host = (String) BLOCKED_ADDR_QUEUE.getLast();
>         long time = ((Long) BLOCKED_ADDR_TO_TIME.get(host)).longValue();
>         if (time <= System.currentTimeMillis()) {   
>           BLOCKED_ADDR_TO_TIME.remove(host);
>           BLOCKED_ADDR_QUEUE.removeLast();
>         }
>       }
>     }
>   }
> LINE3:  As long as there are *any* entries in the BLOCKED_ADDR_QUEUE, the thread that first enters this block busy-waits until it becomes empty while all other threads block on the synchronized block.  This leads to extremely poor fetcher performance.  
> Since the checkin to respect crawlDelay in robots.txt, we are no longer guranteed that BLOCKED_ADDR_TO_TIME queue is a fifo list. The simple fix is to iterate the queue once rather than busy waiting...

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] Commented: (NUTCH-344) Fetcher threads blocked on synchronized block in cleanExpiredServerBlocks

Posted by "Jacob Brunson (JIRA)" <ji...@apache.org>.
    [ http://issues.apache.org/jira/browse/NUTCH-344?page=comments#action_12427096 ] 
            
Jacob Brunson commented on NUTCH-344:
-------------------------------------

I'm having problems with the patch committed in revision #429779.  I used to be having the "fetch aborted with X hung threads" problem.  After updating to this revision, fetching goes fine for a while, but then I get this error on just about every page fetch attempt:
2006-08-09 23:27:28,548 INFO  fetcher.Fetcher - fetching http://www.xmission.com/~nelsonb/resources.htm
2006-08-09 23:27:28,549 ERROR http.Http - java.lang.NullPointerException
2006-08-09 23:27:28,549 ERROR http.Http - at org.apache.nutch.protocol.http.api.HttpBase.cleanExpiredServerBlocks(HttpBase.java:382)
2006-08-09 23:27:28,549 ERROR http.Http - at org.apache.nutch.protocol.http.api.HttpBase.blockAddr(HttpBase.java:323)
2006-08-09 23:27:28,549 ERROR http.Http - at org.apache.nutch.protocol.http.api.HttpBase.getProtocolOutput(HttpBase.java:188)
2006-08-09 23:27:28,549 ERROR http.Http - at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:144)
2006-08-09 23:27:28,549 INFO  fetcher.Fetcher - fetch of http://www.xmission.com/~nelsonb/resources.htm failed with: java.lang.NullPointerException


> Fetcher threads blocked on synchronized block in cleanExpiredServerBlocks
> -------------------------------------------------------------------------
>
>                 Key: NUTCH-344
>                 URL: http://issues.apache.org/jira/browse/NUTCH-344
>             Project: Nutch
>          Issue Type: Bug
>          Components: fetcher
>    Affects Versions: 0.8.1, 0.9.0, 0.8
>         Environment: All
>            Reporter: Greg Kim
>             Fix For: 0.8.1, 0.9.0
>
>         Attachments: cleanExpiredServerBlocks.patch, HttpBase.patch
>
>
> With the recent change to the following code in HttpBase.java has tendencies to block fetcher threads while one thread busy waits... 
>   private static void cleanExpiredServerBlocks() {
>     synchronized (BLOCKED_ADDR_TO_TIME) {
>       while (!BLOCKED_ADDR_QUEUE.isEmpty()) {   <===== LINE 3:   
>         String host = (String) BLOCKED_ADDR_QUEUE.getLast();
>         long time = ((Long) BLOCKED_ADDR_TO_TIME.get(host)).longValue();
>         if (time <= System.currentTimeMillis()) {   
>           BLOCKED_ADDR_TO_TIME.remove(host);
>           BLOCKED_ADDR_QUEUE.removeLast();
>         }
>       }
>     }
>   }
> LINE3:  As long as there are *any* entries in the BLOCKED_ADDR_QUEUE, the thread that first enters this block busy-waits until it becomes empty while all other threads block on the synchronized block.  This leads to extremely poor fetcher performance.  
> Since the checkin to respect crawlDelay in robots.txt, we are no longer guranteed that BLOCKED_ADDR_TO_TIME queue is a fifo list. The simple fix is to iterate the queue once rather than busy waiting...

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] Updated: (NUTCH-344) Fetcher threads blocked on synchronized block in cleanExpiredServerBlocks

Posted by "Jason Calabrese (JIRA)" <ji...@apache.org>.
     [ http://issues.apache.org/jira/browse/NUTCH-344?page=all ]

Jason Calabrese updated NUTCH-344:
----------------------------------

    Attachment: HttpBase.patch

This fix missed 1 little change that caused BLOCKED_ADDR_TO_TIME and BLOCKED_ADDR_QUEUE to get out of sync.

To fix the problem you only need to change the remove on line 385 to:
BLOCKED_ADDR_QUEUE.remove(i);

I can report the the fetch is now much faster with both of these fixes

> Fetcher threads blocked on synchronized block in cleanExpiredServerBlocks
> -------------------------------------------------------------------------
>
>                 Key: NUTCH-344
>                 URL: http://issues.apache.org/jira/browse/NUTCH-344
>             Project: Nutch
>          Issue Type: Bug
>          Components: fetcher
>    Affects Versions: 0.8.1, 0.9.0, 0.8
>         Environment: All
>            Reporter: Greg Kim
>             Fix For: 0.8.1, 0.9.0
>
>         Attachments: cleanExpiredServerBlocks.patch, HttpBase.patch
>
>
> With the recent change to the following code in HttpBase.java has tendencies to block fetcher threads while one thread busy waits... 
>   private static void cleanExpiredServerBlocks() {
>     synchronized (BLOCKED_ADDR_TO_TIME) {
>       while (!BLOCKED_ADDR_QUEUE.isEmpty()) {   <===== LINE 3:   
>         String host = (String) BLOCKED_ADDR_QUEUE.getLast();
>         long time = ((Long) BLOCKED_ADDR_TO_TIME.get(host)).longValue();
>         if (time <= System.currentTimeMillis()) {   
>           BLOCKED_ADDR_TO_TIME.remove(host);
>           BLOCKED_ADDR_QUEUE.removeLast();
>         }
>       }
>     }
>   }
> LINE3:  As long as there are *any* entries in the BLOCKED_ADDR_QUEUE, the thread that first enters this block busy-waits until it becomes empty while all other threads block on the synchronized block.  This leads to extremely poor fetcher performance.  
> Since the checkin to respect crawlDelay in robots.txt, we are no longer guranteed that BLOCKED_ADDR_TO_TIME queue is a fifo list. The simple fix is to iterate the queue once rather than busy waiting...

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] Closed: (NUTCH-344) Fetcher threads blocked on synchronized block in cleanExpiredServerBlocks

Posted by "Sami Siren (JIRA)" <ji...@apache.org>.
     [ http://issues.apache.org/jira/browse/NUTCH-344?page=all ]

Sami Siren closed NUTCH-344.
----------------------------


> Fetcher threads blocked on synchronized block in cleanExpiredServerBlocks
> -------------------------------------------------------------------------
>
>                 Key: NUTCH-344
>                 URL: http://issues.apache.org/jira/browse/NUTCH-344
>             Project: Nutch
>          Issue Type: Bug
>          Components: fetcher
>    Affects Versions: 0.8.1, 0.9.0, 0.8
>         Environment: All
>            Reporter: Greg Kim
>             Fix For: 0.8.1, 0.9.0
>
>         Attachments: cleanExpiredServerBlocks.patch, HttpBase.patch
>
>
> With the recent change to the following code in HttpBase.java has tendencies to block fetcher threads while one thread busy waits... 
>   private static void cleanExpiredServerBlocks() {
>     synchronized (BLOCKED_ADDR_TO_TIME) {
>       while (!BLOCKED_ADDR_QUEUE.isEmpty()) {   <===== LINE 3:   
>         String host = (String) BLOCKED_ADDR_QUEUE.getLast();
>         long time = ((Long) BLOCKED_ADDR_TO_TIME.get(host)).longValue();
>         if (time <= System.currentTimeMillis()) {   
>           BLOCKED_ADDR_TO_TIME.remove(host);
>           BLOCKED_ADDR_QUEUE.removeLast();
>         }
>       }
>     }
>   }
> LINE3:  As long as there are *any* entries in the BLOCKED_ADDR_QUEUE, the thread that first enters this block busy-waits until it becomes empty while all other threads block on the synchronized block.  This leads to extremely poor fetcher performance.  
> Since the checkin to respect crawlDelay in robots.txt, we are no longer guranteed that BLOCKED_ADDR_TO_TIME queue is a fifo list. The simple fix is to iterate the queue once rather than busy waiting...

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] Commented: (NUTCH-344) Fetcher threads blocked on synchronized block in cleanExpiredServerBlocks

Posted by "Jason Calabrese (JIRA)" <ji...@apache.org>.
    [ http://issues.apache.org/jira/browse/NUTCH-344?page=comments#action_12427238 ] 
            
Jason Calabrese commented on NUTCH-344:
---------------------------------------

This issue is still marked as resolved, it needs to be re-opened so the patch will be committed to SVN



> Fetcher threads blocked on synchronized block in cleanExpiredServerBlocks
> -------------------------------------------------------------------------
>
>                 Key: NUTCH-344
>                 URL: http://issues.apache.org/jira/browse/NUTCH-344
>             Project: Nutch
>          Issue Type: Bug
>          Components: fetcher
>    Affects Versions: 0.8, 0.9.0, 0.8.1
>         Environment: All
>            Reporter: Greg Kim
>             Fix For: 0.9.0, 0.8.1
>
>         Attachments: cleanExpiredServerBlocks.patch, HttpBase.patch
>
>
> With the recent change to the following code in HttpBase.java has tendencies to block fetcher threads while one thread busy waits... 
>   private static void cleanExpiredServerBlocks() {
>     synchronized (BLOCKED_ADDR_TO_TIME) {
>       while (!BLOCKED_ADDR_QUEUE.isEmpty()) {   <===== LINE 3:   
>         String host = (String) BLOCKED_ADDR_QUEUE.getLast();
>         long time = ((Long) BLOCKED_ADDR_TO_TIME.get(host)).longValue();
>         if (time <= System.currentTimeMillis()) {   
>           BLOCKED_ADDR_TO_TIME.remove(host);
>           BLOCKED_ADDR_QUEUE.removeLast();
>         }
>       }
>     }
>   }
> LINE3:  As long as there are *any* entries in the BLOCKED_ADDR_QUEUE, the thread that first enters this block busy-waits until it becomes empty while all other threads block on the synchronized block.  This leads to extremely poor fetcher performance.  
> Since the checkin to respect crawlDelay in robots.txt, we are no longer guranteed that BLOCKED_ADDR_TO_TIME queue is a fifo list. The simple fix is to iterate the queue once rather than busy waiting...

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] Updated: (NUTCH-344) Fetcher threads blocked on synchronized block in cleanExpiredServerBlocks

Posted by "Greg Kim (JIRA)" <ji...@apache.org>.
     [ http://issues.apache.org/jira/browse/NUTCH-344?page=all ]

Greg Kim updated NUTCH-344:
---------------------------

    Affects Version/s: 0.8

> Fetcher threads blocked on synchronized block in cleanExpiredServerBlocks
> -------------------------------------------------------------------------
>
>                 Key: NUTCH-344
>                 URL: http://issues.apache.org/jira/browse/NUTCH-344
>             Project: Nutch
>          Issue Type: Bug
>          Components: fetcher
>    Affects Versions: 0.8, 0.9.0, 0.8.1
>         Environment: All
>            Reporter: Greg Kim
>         Attachments: cleanExpiredServerBlocks.patch
>
>
> With the recent change to the following code in HttpBase.java has tendencies to block fetcher threads while one thread busy waits... 
>   private static void cleanExpiredServerBlocks() {
>     synchronized (BLOCKED_ADDR_TO_TIME) {
>       while (!BLOCKED_ADDR_QUEUE.isEmpty()) {   <===== LINE 3:   
>         String host = (String) BLOCKED_ADDR_QUEUE.getLast();
>         long time = ((Long) BLOCKED_ADDR_TO_TIME.get(host)).longValue();
>         if (time <= System.currentTimeMillis()) {   
>           BLOCKED_ADDR_TO_TIME.remove(host);
>           BLOCKED_ADDR_QUEUE.removeLast();
>         }
>       }
>     }
>   }
> LINE3:  As long as there are *any* entries in the BLOCKED_ADDR_QUEUE, the thread that first enters this block busy-waits until it becomes empty while all other threads block on the synchronized block.  This leads to extremely poor fetcher performance.  
> Since the checkin to respect crawlDelay in robots.txt, we are no longer guranteed that BLOCKED_ADDR_TO_TIME queue is a fifo list. The simple fix is to iterate the queue once rather than busy waiting...

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] Commented: (NUTCH-344) Fetcher threads blocked on synchronized block in cleanExpiredServerBlocks

Posted by "Greg Kim (JIRA)" <ji...@apache.org>.
    [ http://issues.apache.org/jira/browse/NUTCH-344?page=comments#action_12427100 ] 
            
Greg Kim commented on NUTCH-344:
--------------------------------

Had the correct version in my workspace; blotched the copy over to the vendor trunk. doh!   Thanks Jason for catching it!

Jacob, your problem should be resolved w/ the one line patch that Jason provided. 

> Fetcher threads blocked on synchronized block in cleanExpiredServerBlocks
> -------------------------------------------------------------------------
>
>                 Key: NUTCH-344
>                 URL: http://issues.apache.org/jira/browse/NUTCH-344
>             Project: Nutch
>          Issue Type: Bug
>          Components: fetcher
>    Affects Versions: 0.8.1, 0.9.0, 0.8
>         Environment: All
>            Reporter: Greg Kim
>             Fix For: 0.8.1, 0.9.0
>
>         Attachments: cleanExpiredServerBlocks.patch, HttpBase.patch
>
>
> With the recent change to the following code in HttpBase.java has tendencies to block fetcher threads while one thread busy waits... 
>   private static void cleanExpiredServerBlocks() {
>     synchronized (BLOCKED_ADDR_TO_TIME) {
>       while (!BLOCKED_ADDR_QUEUE.isEmpty()) {   <===== LINE 3:   
>         String host = (String) BLOCKED_ADDR_QUEUE.getLast();
>         long time = ((Long) BLOCKED_ADDR_TO_TIME.get(host)).longValue();
>         if (time <= System.currentTimeMillis()) {   
>           BLOCKED_ADDR_TO_TIME.remove(host);
>           BLOCKED_ADDR_QUEUE.removeLast();
>         }
>       }
>     }
>   }
> LINE3:  As long as there are *any* entries in the BLOCKED_ADDR_QUEUE, the thread that first enters this block busy-waits until it becomes empty while all other threads block on the synchronized block.  This leads to extremely poor fetcher performance.  
> Since the checkin to respect crawlDelay in robots.txt, we are no longer guranteed that BLOCKED_ADDR_TO_TIME queue is a fifo list. The simple fix is to iterate the queue once rather than busy waiting...

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira