You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@chukwa.apache.org by "Sourygna Luangsay (JIRA)" <ji...@apache.org> on 2011/06/28 23:47:28 UTC
[jira] [Created] (CHUKWA-593) Archive daemon: infinite loop at
midnight
Archive daemon: infinite loop at midnight
-----------------------------------------
Key: CHUKWA-593
URL: https://issues.apache.org/jira/browse/CHUKWA-593
Project: Chukwa
Issue Type: Bug
Components: MR Data Processors
Affects Versions: 0.4.0
Environment: Debian 5.0, Hadoop 0.20
Reporter: Sourygna Luangsay
Priority: Minor
Fix For: 0.5.0, 0.4.0
The archive manager Chukwa daemon enters an infinite loop between 24H to 1H. This entails an increase of the namenode load and a huge increase of both chukwa and namenode logs.
Problem seems to come from the start function of ChukwaArchiveManager.java (in package org/apache/hadoop/chukwa/extraction/archive). At midnight, we get two directories in /chukwa/dataSinkArchives/ (one for the last day and one for the new day). This means that we neither enter the "daysInRawArchiveDir.length == 0" condition nor the "daysInRawArchiveDir.length == 1" one. processDay function is then called but few is done due to "modificationDate < oneHourAgo" condition.
Finally, we loop without having slept or deleted last day directory. Such process repeats itself during one hour.
Here is how I propose to change the "daysInRawArchiveDir.length == 1" condition block in the start function:
148 if (daysInRawArchiveDir.length >= 1 ) {
149 long nextRun = lastRun + (2*ONE_HOUR) - (1*60*1000);// 2h -1min
150 if (now < nextRun) {
151 log.info("lastRun < 2 hours so skip archive for now, going to sleep for 30 minutes, currentDate is:" + new java.util.Date());
152 Thread.sleep(30 * 60 * 1000);
153 continue;
154 }
155 }
As for me, it removed the infinite loop problem. But maybe there is a reason to separate "1 directory" case from "many directories" case. I've been reading documentation and subversion but could not find it.
If there is one, could someone explain it to me?
Regards.
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Resolved] (CHUKWA-593) Archive daemon: infinite loop at
midnight
Posted by "Eric Yang (Resolved) (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/CHUKWA-593?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Eric Yang resolved CHUKWA-593.
------------------------------
Resolution: Fixed
Fix Version/s: (was: 0.4.0)
Assignee: Eric Yang
Release Note: Fixed infinite loop archiving at midnight. (Sourygna Luangsay via Eric Yang)
Thanks Sourygna. I just committed this.
> Archive daemon: infinite loop at midnight
> -----------------------------------------
>
> Key: CHUKWA-593
> URL: https://issues.apache.org/jira/browse/CHUKWA-593
> Project: Chukwa
> Issue Type: Bug
> Components: MR Data Processors
> Affects Versions: 0.4.0
> Environment: Debian 5.0, Hadoop 0.20
> Reporter: Sourygna Luangsay
> Assignee: Eric Yang
> Priority: Minor
> Fix For: 0.5.0
>
> Original Estimate: 10m
> Remaining Estimate: 10m
>
> The archive manager Chukwa daemon enters an infinite loop between 24H to 1H. This entails an increase of the namenode load and a huge increase of both chukwa and namenode logs.
> Problem seems to come from the start function of ChukwaArchiveManager.java (in package org/apache/hadoop/chukwa/extraction/archive). At midnight, we get two directories in /chukwa/dataSinkArchives/ (one for the last day and one for the new day). This means that we neither enter the "daysInRawArchiveDir.length == 0" condition nor the "daysInRawArchiveDir.length == 1" one. processDay function is then called but few is done due to "modificationDate < oneHourAgo" condition.
> Finally, we loop without having slept or deleted last day directory. Such process repeats itself during one hour.
> Here is how I propose to change the "daysInRawArchiveDir.length == 1" condition block in the start function:
> 148 if (daysInRawArchiveDir.length >= 1 ) {
> 149 long nextRun = lastRun + (2*ONE_HOUR) - (1*60*1000);// 2h -1min
> 150 if (now < nextRun) {
> 151 log.info("lastRun < 2 hours so skip archive for now, going to sleep for 30 minutes, currentDate is:" + new java.util.Date());
> 152 Thread.sleep(30 * 60 * 1000);
> 153 continue;
> 154 }
> 155 }
> As for me, it removed the infinite loop problem. But maybe there is a reason to separate "1 directory" case from "many directories" case. I've been reading documentation and subversion but could not find it.
> If there is one, could someone explain it to me?
> Regards.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (CHUKWA-593) Archive daemon: infinite loop at
midnight
Posted by "Sourygna Luangsay (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/CHUKWA-593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13057487#comment-13057487 ]
Sourygna Luangsay commented on CHUKWA-593:
------------------------------------------
My collectors, archive manager and every hadoop components of my cluster are NTP synced and have the same timezone.
I understand the reason for daysInRawArchiveDir.length==1 instead of >=1. So maybe the fix should be in the processDay function. For instance, in the for loop, maybe we could change the condition "if (modificationDate < oneHourAgo) " for something like that: "if (modificationDate < oneHourAgo || workingDay < currentDay)" (I haven't tried this solution so I'm not sure if it's OK, I think I am going to keep the one I said in my first post since it works and remove the infinite loop). Such new condition would avoid latency, no?
> Archive daemon: infinite loop at midnight
> -----------------------------------------
>
> Key: CHUKWA-593
> URL: https://issues.apache.org/jira/browse/CHUKWA-593
> Project: Chukwa
> Issue Type: Bug
> Components: MR Data Processors
> Affects Versions: 0.4.0
> Environment: Debian 5.0, Hadoop 0.20
> Reporter: Sourygna Luangsay
> Priority: Minor
> Fix For: 0.4.0, 0.5.0
>
> Original Estimate: 10m
> Remaining Estimate: 10m
>
> The archive manager Chukwa daemon enters an infinite loop between 24H to 1H. This entails an increase of the namenode load and a huge increase of both chukwa and namenode logs.
> Problem seems to come from the start function of ChukwaArchiveManager.java (in package org/apache/hadoop/chukwa/extraction/archive). At midnight, we get two directories in /chukwa/dataSinkArchives/ (one for the last day and one for the new day). This means that we neither enter the "daysInRawArchiveDir.length == 0" condition nor the "daysInRawArchiveDir.length == 1" one. processDay function is then called but few is done due to "modificationDate < oneHourAgo" condition.
> Finally, we loop without having slept or deleted last day directory. Such process repeats itself during one hour.
> Here is how I propose to change the "daysInRawArchiveDir.length == 1" condition block in the start function:
> 148 if (daysInRawArchiveDir.length >= 1 ) {
> 149 long nextRun = lastRun + (2*ONE_HOUR) - (1*60*1000);// 2h -1min
> 150 if (now < nextRun) {
> 151 log.info("lastRun < 2 hours so skip archive for now, going to sleep for 30 minutes, currentDate is:" + new java.util.Date());
> 152 Thread.sleep(30 * 60 * 1000);
> 153 continue;
> 154 }
> 155 }
> As for me, it removed the infinite loop problem. But maybe there is a reason to separate "1 directory" case from "many directories" case. I've been reading documentation and subversion but could not find it.
> If there is one, could someone explain it to me?
> Regards.
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (CHUKWA-593) Archive daemon: infinite loop at
midnight
Posted by "Eric Yang (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/CHUKWA-593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13056834#comment-13056834 ]
Eric Yang commented on CHUKWA-593:
----------------------------------
processDay function should delete the previous day directory, if the previous day directory is empty. The hour between 2 days, the system is design to archive for previous day as soon as possible
Do you have collectors running in multiple timezones, or server clock is out of sync by one hour?
The busy loop should not happen unless there something continue to write to the previous day directory. daysInRawArchiveDir.length==1 is to ensure the roll up for previous day happens as soon as possible.
If we change to >=1 then the roll up for previous day will not occur until 1:59AM of the current day. We should avoid this latency, if possible.
> Archive daemon: infinite loop at midnight
> -----------------------------------------
>
> Key: CHUKWA-593
> URL: https://issues.apache.org/jira/browse/CHUKWA-593
> Project: Chukwa
> Issue Type: Bug
> Components: MR Data Processors
> Affects Versions: 0.4.0
> Environment: Debian 5.0, Hadoop 0.20
> Reporter: Sourygna Luangsay
> Priority: Minor
> Fix For: 0.4.0, 0.5.0
>
> Original Estimate: 10m
> Remaining Estimate: 10m
>
> The archive manager Chukwa daemon enters an infinite loop between 24H to 1H. This entails an increase of the namenode load and a huge increase of both chukwa and namenode logs.
> Problem seems to come from the start function of ChukwaArchiveManager.java (in package org/apache/hadoop/chukwa/extraction/archive). At midnight, we get two directories in /chukwa/dataSinkArchives/ (one for the last day and one for the new day). This means that we neither enter the "daysInRawArchiveDir.length == 0" condition nor the "daysInRawArchiveDir.length == 1" one. processDay function is then called but few is done due to "modificationDate < oneHourAgo" condition.
> Finally, we loop without having slept or deleted last day directory. Such process repeats itself during one hour.
> Here is how I propose to change the "daysInRawArchiveDir.length == 1" condition block in the start function:
> 148 if (daysInRawArchiveDir.length >= 1 ) {
> 149 long nextRun = lastRun + (2*ONE_HOUR) - (1*60*1000);// 2h -1min
> 150 if (now < nextRun) {
> 151 log.info("lastRun < 2 hours so skip archive for now, going to sleep for 30 minutes, currentDate is:" + new java.util.Date());
> 152 Thread.sleep(30 * 60 * 1000);
> 153 continue;
> 154 }
> 155 }
> As for me, it removed the infinite loop problem. But maybe there is a reason to separate "1 directory" case from "many directories" case. I've been reading documentation and subversion but could not find it.
> If there is one, could someone explain it to me?
> Regards.
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (CHUKWA-593) Archive daemon: infinite loop at
midnight
Posted by "Eric Yang (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/CHUKWA-593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13057497#comment-13057497 ]
Eric Yang commented on CHUKWA-593:
----------------------------------
{noformat}
if (modificationDate < oneHourAgo || workingDay < currentDay)
{noformat}
+1, looks like the right fix.
> Archive daemon: infinite loop at midnight
> -----------------------------------------
>
> Key: CHUKWA-593
> URL: https://issues.apache.org/jira/browse/CHUKWA-593
> Project: Chukwa
> Issue Type: Bug
> Components: MR Data Processors
> Affects Versions: 0.4.0
> Environment: Debian 5.0, Hadoop 0.20
> Reporter: Sourygna Luangsay
> Priority: Minor
> Fix For: 0.4.0, 0.5.0
>
> Original Estimate: 10m
> Remaining Estimate: 10m
>
> The archive manager Chukwa daemon enters an infinite loop between 24H to 1H. This entails an increase of the namenode load and a huge increase of both chukwa and namenode logs.
> Problem seems to come from the start function of ChukwaArchiveManager.java (in package org/apache/hadoop/chukwa/extraction/archive). At midnight, we get two directories in /chukwa/dataSinkArchives/ (one for the last day and one for the new day). This means that we neither enter the "daysInRawArchiveDir.length == 0" condition nor the "daysInRawArchiveDir.length == 1" one. processDay function is then called but few is done due to "modificationDate < oneHourAgo" condition.
> Finally, we loop without having slept or deleted last day directory. Such process repeats itself during one hour.
> Here is how I propose to change the "daysInRawArchiveDir.length == 1" condition block in the start function:
> 148 if (daysInRawArchiveDir.length >= 1 ) {
> 149 long nextRun = lastRun + (2*ONE_HOUR) - (1*60*1000);// 2h -1min
> 150 if (now < nextRun) {
> 151 log.info("lastRun < 2 hours so skip archive for now, going to sleep for 30 minutes, currentDate is:" + new java.util.Date());
> 152 Thread.sleep(30 * 60 * 1000);
> 153 continue;
> 154 }
> 155 }
> As for me, it removed the infinite loop problem. But maybe there is a reason to separate "1 directory" case from "many directories" case. I've been reading documentation and subversion but could not find it.
> If there is one, could someone explain it to me?
> Regards.
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira