You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Greg Kim (JIRA)" <ji...@apache.org> on 2006/08/08 01:56:15 UTC
[jira] Created: (NUTCH-344) Fetcher threads blocked on synchronized
block in cleanExpiredServerBlocks
Fetcher threads blocked on synchronized block in cleanExpiredServerBlocks
-------------------------------------------------------------------------
Key: NUTCH-344
URL: http://issues.apache.org/jira/browse/NUTCH-344
Project: Nutch
Issue Type: Bug
Components: fetcher
Affects Versions: 0.8.1, 0.9.0
Environment: All
Reporter: Greg Kim
Attachments: cleanExpiredServerBlocks.patch
With the recent change to the following code in HttpBase.java has tendencies to block fetcher threads while one thread busy waits...
private static void cleanExpiredServerBlocks() {
synchronized (BLOCKED_ADDR_TO_TIME) {
while (!BLOCKED_ADDR_QUEUE.isEmpty()) { <===== LINE 3:
String host = (String) BLOCKED_ADDR_QUEUE.getLast();
long time = ((Long) BLOCKED_ADDR_TO_TIME.get(host)).longValue();
if (time <= System.currentTimeMillis()) {
BLOCKED_ADDR_TO_TIME.remove(host);
BLOCKED_ADDR_QUEUE.removeLast();
}
}
}
}
LINE3: As long as there are *any* entries in the BLOCKED_ADDR_QUEUE, the thread that first enters this block busy-waits until it becomes empty while all other threads block on the synchronized block. This leads to extremely poor fetcher performance.
Since the checkin to respect crawlDelay in robots.txt, we are no longer guranteed that BLOCKED_ADDR_TO_TIME queue is a fifo list. The simple fix is to iterate the queue once rather than busy waiting...
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Resolved: (NUTCH-344) Fetcher threads blocked on
synchronized block in cleanExpiredServerBlocks
Posted by "Sami Siren (JIRA)" <ji...@apache.org>.
[ http://issues.apache.org/jira/browse/NUTCH-344?page=all ]
Sami Siren resolved NUTCH-344.
------------------------------
Fix Version/s: 0.8.1
0.9.0
Resolution: Fixed
I just committed this to 0.8 branch and trunk, thanks Greg!
> Fetcher threads blocked on synchronized block in cleanExpiredServerBlocks
> -------------------------------------------------------------------------
>
> Key: NUTCH-344
> URL: http://issues.apache.org/jira/browse/NUTCH-344
> Project: Nutch
> Issue Type: Bug
> Components: fetcher
> Affects Versions: 0.8, 0.9.0, 0.8.1
> Environment: All
> Reporter: Greg Kim
> Fix For: 0.8.1, 0.9.0
>
> Attachments: cleanExpiredServerBlocks.patch
>
>
> With the recent change to the following code in HttpBase.java has tendencies to block fetcher threads while one thread busy waits...
> private static void cleanExpiredServerBlocks() {
> synchronized (BLOCKED_ADDR_TO_TIME) {
> while (!BLOCKED_ADDR_QUEUE.isEmpty()) { <===== LINE 3:
> String host = (String) BLOCKED_ADDR_QUEUE.getLast();
> long time = ((Long) BLOCKED_ADDR_TO_TIME.get(host)).longValue();
> if (time <= System.currentTimeMillis()) {
> BLOCKED_ADDR_TO_TIME.remove(host);
> BLOCKED_ADDR_QUEUE.removeLast();
> }
> }
> }
> }
> LINE3: As long as there are *any* entries in the BLOCKED_ADDR_QUEUE, the thread that first enters this block busy-waits until it becomes empty while all other threads block on the synchronized block. This leads to extremely poor fetcher performance.
> Since the checkin to respect crawlDelay in robots.txt, we are no longer guranteed that BLOCKED_ADDR_TO_TIME queue is a fifo list. The simple fix is to iterate the queue once rather than busy waiting...
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (NUTCH-344) Fetcher threads blocked on
synchronized block in cleanExpiredServerBlocks
Posted by "Jacob Brunson (JIRA)" <ji...@apache.org>.
[ http://issues.apache.org/jira/browse/NUTCH-344?page=comments#action_12427096 ]
Jacob Brunson commented on NUTCH-344:
-------------------------------------
I'm having problems with the patch committed in revision #429779. I used to be having the "fetch aborted with X hung threads" problem. After updating to this revision, fetching goes fine for a while, but then I get this error on just about every page fetch attempt:
2006-08-09 23:27:28,548 INFO fetcher.Fetcher - fetching http://www.xmission.com/~nelsonb/resources.htm
2006-08-09 23:27:28,549 ERROR http.Http - java.lang.NullPointerException
2006-08-09 23:27:28,549 ERROR http.Http - at org.apache.nutch.protocol.http.api.HttpBase.cleanExpiredServerBlocks(HttpBase.java:382)
2006-08-09 23:27:28,549 ERROR http.Http - at org.apache.nutch.protocol.http.api.HttpBase.blockAddr(HttpBase.java:323)
2006-08-09 23:27:28,549 ERROR http.Http - at org.apache.nutch.protocol.http.api.HttpBase.getProtocolOutput(HttpBase.java:188)
2006-08-09 23:27:28,549 ERROR http.Http - at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:144)
2006-08-09 23:27:28,549 INFO fetcher.Fetcher - fetch of http://www.xmission.com/~nelsonb/resources.htm failed with: java.lang.NullPointerException
> Fetcher threads blocked on synchronized block in cleanExpiredServerBlocks
> -------------------------------------------------------------------------
>
> Key: NUTCH-344
> URL: http://issues.apache.org/jira/browse/NUTCH-344
> Project: Nutch
> Issue Type: Bug
> Components: fetcher
> Affects Versions: 0.8.1, 0.9.0, 0.8
> Environment: All
> Reporter: Greg Kim
> Fix For: 0.8.1, 0.9.0
>
> Attachments: cleanExpiredServerBlocks.patch, HttpBase.patch
>
>
> With the recent change to the following code in HttpBase.java has tendencies to block fetcher threads while one thread busy waits...
> private static void cleanExpiredServerBlocks() {
> synchronized (BLOCKED_ADDR_TO_TIME) {
> while (!BLOCKED_ADDR_QUEUE.isEmpty()) { <===== LINE 3:
> String host = (String) BLOCKED_ADDR_QUEUE.getLast();
> long time = ((Long) BLOCKED_ADDR_TO_TIME.get(host)).longValue();
> if (time <= System.currentTimeMillis()) {
> BLOCKED_ADDR_TO_TIME.remove(host);
> BLOCKED_ADDR_QUEUE.removeLast();
> }
> }
> }
> }
> LINE3: As long as there are *any* entries in the BLOCKED_ADDR_QUEUE, the thread that first enters this block busy-waits until it becomes empty while all other threads block on the synchronized block. This leads to extremely poor fetcher performance.
> Since the checkin to respect crawlDelay in robots.txt, we are no longer guranteed that BLOCKED_ADDR_TO_TIME queue is a fifo list. The simple fix is to iterate the queue once rather than busy waiting...
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Updated: (NUTCH-344) Fetcher threads blocked on synchronized
block in cleanExpiredServerBlocks
Posted by "Jason Calabrese (JIRA)" <ji...@apache.org>.
[ http://issues.apache.org/jira/browse/NUTCH-344?page=all ]
Jason Calabrese updated NUTCH-344:
----------------------------------
Attachment: HttpBase.patch
This fix missed 1 little change that caused BLOCKED_ADDR_TO_TIME and BLOCKED_ADDR_QUEUE to get out of sync.
To fix the problem you only need to change the remove on line 385 to:
BLOCKED_ADDR_QUEUE.remove(i);
I can report the the fetch is now much faster with both of these fixes
> Fetcher threads blocked on synchronized block in cleanExpiredServerBlocks
> -------------------------------------------------------------------------
>
> Key: NUTCH-344
> URL: http://issues.apache.org/jira/browse/NUTCH-344
> Project: Nutch
> Issue Type: Bug
> Components: fetcher
> Affects Versions: 0.8.1, 0.9.0, 0.8
> Environment: All
> Reporter: Greg Kim
> Fix For: 0.8.1, 0.9.0
>
> Attachments: cleanExpiredServerBlocks.patch, HttpBase.patch
>
>
> With the recent change to the following code in HttpBase.java has tendencies to block fetcher threads while one thread busy waits...
> private static void cleanExpiredServerBlocks() {
> synchronized (BLOCKED_ADDR_TO_TIME) {
> while (!BLOCKED_ADDR_QUEUE.isEmpty()) { <===== LINE 3:
> String host = (String) BLOCKED_ADDR_QUEUE.getLast();
> long time = ((Long) BLOCKED_ADDR_TO_TIME.get(host)).longValue();
> if (time <= System.currentTimeMillis()) {
> BLOCKED_ADDR_TO_TIME.remove(host);
> BLOCKED_ADDR_QUEUE.removeLast();
> }
> }
> }
> }
> LINE3: As long as there are *any* entries in the BLOCKED_ADDR_QUEUE, the thread that first enters this block busy-waits until it becomes empty while all other threads block on the synchronized block. This leads to extremely poor fetcher performance.
> Since the checkin to respect crawlDelay in robots.txt, we are no longer guranteed that BLOCKED_ADDR_TO_TIME queue is a fifo list. The simple fix is to iterate the queue once rather than busy waiting...
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Closed: (NUTCH-344) Fetcher threads blocked on synchronized
block in cleanExpiredServerBlocks
Posted by "Sami Siren (JIRA)" <ji...@apache.org>.
[ http://issues.apache.org/jira/browse/NUTCH-344?page=all ]
Sami Siren closed NUTCH-344.
----------------------------
> Fetcher threads blocked on synchronized block in cleanExpiredServerBlocks
> -------------------------------------------------------------------------
>
> Key: NUTCH-344
> URL: http://issues.apache.org/jira/browse/NUTCH-344
> Project: Nutch
> Issue Type: Bug
> Components: fetcher
> Affects Versions: 0.8.1, 0.9.0, 0.8
> Environment: All
> Reporter: Greg Kim
> Fix For: 0.8.1, 0.9.0
>
> Attachments: cleanExpiredServerBlocks.patch, HttpBase.patch
>
>
> With the recent change to the following code in HttpBase.java has tendencies to block fetcher threads while one thread busy waits...
> private static void cleanExpiredServerBlocks() {
> synchronized (BLOCKED_ADDR_TO_TIME) {
> while (!BLOCKED_ADDR_QUEUE.isEmpty()) { <===== LINE 3:
> String host = (String) BLOCKED_ADDR_QUEUE.getLast();
> long time = ((Long) BLOCKED_ADDR_TO_TIME.get(host)).longValue();
> if (time <= System.currentTimeMillis()) {
> BLOCKED_ADDR_TO_TIME.remove(host);
> BLOCKED_ADDR_QUEUE.removeLast();
> }
> }
> }
> }
> LINE3: As long as there are *any* entries in the BLOCKED_ADDR_QUEUE, the thread that first enters this block busy-waits until it becomes empty while all other threads block on the synchronized block. This leads to extremely poor fetcher performance.
> Since the checkin to respect crawlDelay in robots.txt, we are no longer guranteed that BLOCKED_ADDR_TO_TIME queue is a fifo list. The simple fix is to iterate the queue once rather than busy waiting...
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (NUTCH-344) Fetcher threads blocked on
synchronized block in cleanExpiredServerBlocks
Posted by "Jason Calabrese (JIRA)" <ji...@apache.org>.
[ http://issues.apache.org/jira/browse/NUTCH-344?page=comments#action_12427238 ]
Jason Calabrese commented on NUTCH-344:
---------------------------------------
This issue is still marked as resolved, it needs to be re-opened so the patch will be committed to SVN
> Fetcher threads blocked on synchronized block in cleanExpiredServerBlocks
> -------------------------------------------------------------------------
>
> Key: NUTCH-344
> URL: http://issues.apache.org/jira/browse/NUTCH-344
> Project: Nutch
> Issue Type: Bug
> Components: fetcher
> Affects Versions: 0.8, 0.9.0, 0.8.1
> Environment: All
> Reporter: Greg Kim
> Fix For: 0.9.0, 0.8.1
>
> Attachments: cleanExpiredServerBlocks.patch, HttpBase.patch
>
>
> With the recent change to the following code in HttpBase.java has tendencies to block fetcher threads while one thread busy waits...
> private static void cleanExpiredServerBlocks() {
> synchronized (BLOCKED_ADDR_TO_TIME) {
> while (!BLOCKED_ADDR_QUEUE.isEmpty()) { <===== LINE 3:
> String host = (String) BLOCKED_ADDR_QUEUE.getLast();
> long time = ((Long) BLOCKED_ADDR_TO_TIME.get(host)).longValue();
> if (time <= System.currentTimeMillis()) {
> BLOCKED_ADDR_TO_TIME.remove(host);
> BLOCKED_ADDR_QUEUE.removeLast();
> }
> }
> }
> }
> LINE3: As long as there are *any* entries in the BLOCKED_ADDR_QUEUE, the thread that first enters this block busy-waits until it becomes empty while all other threads block on the synchronized block. This leads to extremely poor fetcher performance.
> Since the checkin to respect crawlDelay in robots.txt, we are no longer guranteed that BLOCKED_ADDR_TO_TIME queue is a fifo list. The simple fix is to iterate the queue once rather than busy waiting...
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Updated: (NUTCH-344) Fetcher threads blocked on synchronized
block in cleanExpiredServerBlocks
Posted by "Greg Kim (JIRA)" <ji...@apache.org>.
[ http://issues.apache.org/jira/browse/NUTCH-344?page=all ]
Greg Kim updated NUTCH-344:
---------------------------
Affects Version/s: 0.8
> Fetcher threads blocked on synchronized block in cleanExpiredServerBlocks
> -------------------------------------------------------------------------
>
> Key: NUTCH-344
> URL: http://issues.apache.org/jira/browse/NUTCH-344
> Project: Nutch
> Issue Type: Bug
> Components: fetcher
> Affects Versions: 0.8, 0.9.0, 0.8.1
> Environment: All
> Reporter: Greg Kim
> Attachments: cleanExpiredServerBlocks.patch
>
>
> With the recent change to the following code in HttpBase.java has tendencies to block fetcher threads while one thread busy waits...
> private static void cleanExpiredServerBlocks() {
> synchronized (BLOCKED_ADDR_TO_TIME) {
> while (!BLOCKED_ADDR_QUEUE.isEmpty()) { <===== LINE 3:
> String host = (String) BLOCKED_ADDR_QUEUE.getLast();
> long time = ((Long) BLOCKED_ADDR_TO_TIME.get(host)).longValue();
> if (time <= System.currentTimeMillis()) {
> BLOCKED_ADDR_TO_TIME.remove(host);
> BLOCKED_ADDR_QUEUE.removeLast();
> }
> }
> }
> }
> LINE3: As long as there are *any* entries in the BLOCKED_ADDR_QUEUE, the thread that first enters this block busy-waits until it becomes empty while all other threads block on the synchronized block. This leads to extremely poor fetcher performance.
> Since the checkin to respect crawlDelay in robots.txt, we are no longer guranteed that BLOCKED_ADDR_TO_TIME queue is a fifo list. The simple fix is to iterate the queue once rather than busy waiting...
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (NUTCH-344) Fetcher threads blocked on
synchronized block in cleanExpiredServerBlocks
Posted by "Greg Kim (JIRA)" <ji...@apache.org>.
[ http://issues.apache.org/jira/browse/NUTCH-344?page=comments#action_12427100 ]
Greg Kim commented on NUTCH-344:
--------------------------------
Had the correct version in my workspace; blotched the copy over to the vendor trunk. doh! Thanks Jason for catching it!
Jacob, your problem should be resolved w/ the one line patch that Jason provided.
> Fetcher threads blocked on synchronized block in cleanExpiredServerBlocks
> -------------------------------------------------------------------------
>
> Key: NUTCH-344
> URL: http://issues.apache.org/jira/browse/NUTCH-344
> Project: Nutch
> Issue Type: Bug
> Components: fetcher
> Affects Versions: 0.8.1, 0.9.0, 0.8
> Environment: All
> Reporter: Greg Kim
> Fix For: 0.8.1, 0.9.0
>
> Attachments: cleanExpiredServerBlocks.patch, HttpBase.patch
>
>
> With the recent change to the following code in HttpBase.java has tendencies to block fetcher threads while one thread busy waits...
> private static void cleanExpiredServerBlocks() {
> synchronized (BLOCKED_ADDR_TO_TIME) {
> while (!BLOCKED_ADDR_QUEUE.isEmpty()) { <===== LINE 3:
> String host = (String) BLOCKED_ADDR_QUEUE.getLast();
> long time = ((Long) BLOCKED_ADDR_TO_TIME.get(host)).longValue();
> if (time <= System.currentTimeMillis()) {
> BLOCKED_ADDR_TO_TIME.remove(host);
> BLOCKED_ADDR_QUEUE.removeLast();
> }
> }
> }
> }
> LINE3: As long as there are *any* entries in the BLOCKED_ADDR_QUEUE, the thread that first enters this block busy-waits until it becomes empty while all other threads block on the synchronized block. This leads to extremely poor fetcher performance.
> Since the checkin to respect crawlDelay in robots.txt, we are no longer guranteed that BLOCKED_ADDR_TO_TIME queue is a fifo list. The simple fix is to iterate the queue once rather than busy waiting...
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira