You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-dev@hadoop.apache.org by "Lohit Vijayarenu (JIRA)" <ji...@apache.org> on 2008/08/29 08:48:44 UTC

[jira] Created: (HADOOP-4045) Increment checkpoint if we see failures in rollEdits

Increment checkpoint if we see failures in rollEdits
----------------------------------------------------

                 Key: HADOOP-4045
                 URL: https://issues.apache.org/jira/browse/HADOOP-4045
             Project: Hadoop Core
          Issue Type: Bug
          Components: dfs
    Affects Versions: 0.19.0
            Reporter: Lohit Vijayarenu
             Fix For: 0.19.0


In _FSEditLog::rollEdits_, if we encounter an error during opening edits.new, we remove  the store directory associated with it. At this point we should also increment checkpoint on all other directories.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-4045) Increment checkpoint if we see failures in rollEdits

Posted by "Konstantin Shvachko (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-4045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12690162#action_12690162 ] 

Konstantin Shvachko commented on HADOOP-4045:
---------------------------------------------

# {{FSImage.setCheckpointTime()}} variable {{al}} is not used.
# {{processIOError(ArrayList<StorageDirectory> sds)}} may be eliminated.
# I would also get rid of {{processIOError(ArrayList<EditLogOutputStream> errorStreams)}}.
The point is that it is better to have only one processIOError in each class, otherwise it can get
as bad as it is now with all different variants of it.
If you think it is a lot of changes, then lets at least make both of them private.
# Do we want to make {{removedStorageDirs}} a map in order to avoid adding the same directory 
twice into it or does it never happen?
# Same with {{Storage.storageDirs}}. If we search in a collection then we might want to use
searchable collections. This may be done in a separate issue.
# It's somewhat confusing: {{FSImage.processIOError()}} calls {{editLog.processIOError()}} and
then {{FSEditLog.processIOError()}} calls {{fsimage.processIOError()}}. Is it going to converge
at some point?
# {{setCheckpointTime()}} ignores io errors. Just mentioning this, I don't see how to avoid it.
Failed streams/directories will be remove next time flushAndSync() called.

> Increment checkpoint if we see failures in rollEdits
> ----------------------------------------------------
>
>                 Key: HADOOP-4045
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4045
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: dfs
>            Reporter: Lohit Vijayarenu
>            Assignee: Boris Shkolnik
>            Priority: Critical
>             Fix For: 0.19.2
>
>         Attachments: HADOOP-4045-1.patch, HADOOP-4045.patch
>
>
> In _FSEditLog::rollEdits_, if we encounter an error during opening edits.new, we remove  the store directory associated with it. At this point we should also increment checkpoint on all other directories.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-4045) Increment checkpoint if we see failures in rollEdits

Posted by "Konstantin Shvachko (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-4045?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Konstantin Shvachko updated HADOOP-4045:
----------------------------------------

    Affects Version/s: 0.19.0
         Hadoop Flags: [Reviewed]

> Increment checkpoint if we see failures in rollEdits
> ----------------------------------------------------
>
>                 Key: HADOOP-4045
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4045
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: dfs
>    Affects Versions: 0.19.0
>            Reporter: Lohit Vijayarenu
>            Assignee: Boris Shkolnik
>            Priority: Critical
>             Fix For: 0.19.2
>
>         Attachments: HADOOP-4045-1.patch, HADOOP-4045-3.patch, HADOOP-4045.patch
>
>
> In _FSEditLog::rollEdits_, if we encounter an error during opening edits.new, we remove  the store directory associated with it. At this point we should also increment checkpoint on all other directories.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Issue Comment Edited: (HADOOP-4045) Increment checkpoint if we see failures in rollEdits

Posted by "Konstantin Shvachko (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-4045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12670883#action_12670883 ] 

shv edited comment on HADOOP-4045 at 2/5/09 12:32 PM:
----------------------------------------------------------------------

Some more requirements for {{processIOError()}}
# {{FSEdits.processIOError()}} should always unlock the storage being removed from service. This is necessary because if only the edits file causes the problem the directory will still be locked when we try to reuse it later as proposed in HADOOP-4885.
# It should always call {{incrementCheckpointTime()}}. Otherwise the abandoned directories may mistakenly be used for loading the latest image/edits.
#  {{incrementCheckpointTime()}} should not recursively call {{processIOError()}}. But we should be able to handle failures of multiple edits directories.
# It should close the {{EditsOutputStream}}.
# All the logic with adding lost edits dirs to {{removedStorageDirs}} should be in {{processIOError()}}.
# {{FSEdits.processIOError(int)}} should be eliminated completely.
# It seems to me that the most appropriate prototype for processIOError would be
{code}
FSEdits.processIOError(Collection<EditLogOutputStream> errorStreams)
{code}
Every method which works with streams should accumulate failed streams and then call {{processIOError(Collection)}} for the accumulated collection like in {{FSEdits.logSync()}}.
This might even let us have only 1 such method rather than two.

 [ -edited the last item- ]

      was (Author: shv):
    Some more requirements for {{processIOError()}}
# {{FSEdits.processIOError()}} should always unlock the storage being removed from service. This is necessary because if only the edits file causes the problem the directory will still be locked when we try to reuse it later as proposed in HADOOP-4885.
# It should always call {{incrementCheckpointTime()}}. Otherwise the abandoned directories may mistakenly be used for loading the latest image/edits.
#  {{incrementCheckpointTime()}} should not recursively call {{processIOError()}}. But we should be able to handle failures of multiple edits directories.
# It should close the {{EditsOutputStream}}.
# All the logic with adding lost edits dirs to {{removedStorageDirs}} should be in {{processIOError()}}.
# {{FSEdits.processIOError(int)}} should be eliminated completely.
# It seems to me that the most appropriate prototype for processIOError would be
{code}
FSEdits.processIOError(ArrayList<StorageDirectory> errorDirs)
{code}
This might even let us have only 1 such method rather than two.
  
> Increment checkpoint if we see failures in rollEdits
> ----------------------------------------------------
>
>                 Key: HADOOP-4045
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4045
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: dfs
>    Affects Versions: 0.19.0
>            Reporter: Lohit Vijayarenu
>            Priority: Blocker
>             Fix For: 0.19.1
>
>
> In _FSEditLog::rollEdits_, if we encounter an error during opening edits.new, we remove  the store directory associated with it. At this point we should also increment checkpoint on all other directories.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-4045) Increment checkpoint if we see failures in rollEdits

Posted by "Konstantin Shvachko (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-4045?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Konstantin Shvachko updated HADOOP-4045:
----------------------------------------

    Fix Version/s:     (was: 0.19.2)
                   0.21.0

> Increment checkpoint if we see failures in rollEdits
> ----------------------------------------------------
>
>                 Key: HADOOP-4045
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4045
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: dfs
>    Affects Versions: 0.19.0
>            Reporter: Lohit Vijayarenu
>            Assignee: Boris Shkolnik
>            Priority: Critical
>             Fix For: 0.21.0
>
>         Attachments: HADOOP-4045-1.patch, HADOOP-4045-3.patch, HADOOP-4045.patch
>
>
> In _FSEditLog::rollEdits_, if we encounter an error during opening edits.new, we remove  the store directory associated with it. At this point we should also increment checkpoint on all other directories.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-4045) Increment checkpoint if we see failures in rollEdits

Posted by "Hudson (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-4045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12695426#action_12695426 ] 

Hudson commented on HADOOP-4045:
--------------------------------

Integrated in Hadoop-trunk #796 (See [http://hudson.zones.apache.org/hudson/job/Hadoop-trunk/796/])
    . Fix processing of IO errors in EditsLog. Contributed by Boris Shkolnik.


> Increment checkpoint if we see failures in rollEdits
> ----------------------------------------------------
>
>                 Key: HADOOP-4045
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4045
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: dfs
>    Affects Versions: 0.19.0
>            Reporter: Lohit Vijayarenu
>            Assignee: Boris Shkolnik
>            Priority: Critical
>             Fix For: 0.19.2
>
>         Attachments: HADOOP-4045-1.patch, HADOOP-4045-3.patch, HADOOP-4045.patch
>
>
> In _FSEditLog::rollEdits_, if we encounter an error during opening edits.new, we remove  the store directory associated with it. At this point we should also increment checkpoint on all other directories.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-4045) Increment checkpoint if we see failures in rollEdits

Posted by "Boris Shkolnik (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-4045?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Boris Shkolnik updated HADOOP-4045:
-----------------------------------

    Attachment: HADOOP-4045-1.patch

> Increment checkpoint if we see failures in rollEdits
> ----------------------------------------------------
>
>                 Key: HADOOP-4045
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4045
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: dfs
>            Reporter: Lohit Vijayarenu
>            Assignee: Boris Shkolnik
>            Priority: Critical
>             Fix For: 0.19.2
>
>         Attachments: HADOOP-4045-1.patch, HADOOP-4045.patch
>
>
> In _FSEditLog::rollEdits_, if we encounter an error during opening edits.new, we remove  the store directory associated with it. At this point we should also increment checkpoint on all other directories.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-4045) Increment checkpoint if we see failures in rollEdits

Posted by "Konstantin Shvachko (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-4045?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Konstantin Shvachko updated HADOOP-4045:
----------------------------------------

    Resolution: Fixed
        Status: Resolved  (was: Patch Available)

I just committed this.
Thank you Boris.

> Increment checkpoint if we see failures in rollEdits
> ----------------------------------------------------
>
>                 Key: HADOOP-4045
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4045
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: dfs
>    Affects Versions: 0.19.0
>            Reporter: Lohit Vijayarenu
>            Assignee: Boris Shkolnik
>            Priority: Critical
>             Fix For: 0.19.2
>
>         Attachments: HADOOP-4045-1.patch, HADOOP-4045-3.patch, HADOOP-4045.patch
>
>
> In _FSEditLog::rollEdits_, if we encounter an error during opening edits.new, we remove  the store directory associated with it. At this point we should also increment checkpoint on all other directories.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-4045) Increment checkpoint if we see failures in rollEdits

Posted by "Hadoop QA (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-4045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12689773#action_12689773 ] 

Hadoop QA commented on HADOOP-4045:
-----------------------------------

-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12403739/HADOOP-4045-1.patch
  against trunk revision 758593.

    +1 @author.  The patch does not contain any @author tags.

    +1 tests included.  The patch appears to include 3 new or modified tests.

    +1 javadoc.  The javadoc tool did not generate any warning messages.

    +1 javac.  The applied patch does not increase the total number of javac compiler warnings.

    -1 findbugs.  The patch appears to introduce 1 new Findbugs warnings.

    +1 Eclipse classpath. The patch retains Eclipse classpath integrity.

    +1 release audit.  The applied patch does not increase the total number of release audit warnings.

    +1 core tests.  The patch passed core unit tests.

    -1 contrib tests.  The patch failed contrib unit tests.

Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/144/testReport/
Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/144/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/144/artifact/trunk/build/test/checkstyle-errors.html
Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/144/console

This message is automatically generated.

> Increment checkpoint if we see failures in rollEdits
> ----------------------------------------------------
>
>                 Key: HADOOP-4045
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4045
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: dfs
>            Reporter: Lohit Vijayarenu
>            Assignee: Boris Shkolnik
>            Priority: Critical
>             Fix For: 0.19.2
>
>         Attachments: HADOOP-4045-1.patch, HADOOP-4045.patch
>
>
> In _FSEditLog::rollEdits_, if we encounter an error during opening edits.new, we remove  the store directory associated with it. At this point we should also increment checkpoint on all other directories.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-4045) Increment checkpoint if we see failures in rollEdits

Posted by "Konstantin Shvachko (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-4045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12670883#action_12670883 ] 

Konstantin Shvachko commented on HADOOP-4045:
---------------------------------------------

Some more requirements for {{processIOError()}}
# {{FSEdits.processIOError()}} should always unlock the storage being removed from service. This is necessary because if only the edits file causes the problem the directory will still be locked when we try to reuse it later as proposed in HADOOP-4885.
# It should always call {{incrementCheckpointTime()}}. Otherwise the abandoned directories may mistakenly be used for loading the latest image/edits.
#  {{incrementCheckpointTime()}} should not recursively call {{processIOError()}}. But we should be able to handle failures of multiple edits directories.
# It should close the {{EditsOutputStream}}.
# All the logic with adding lost edits dirs to {{removedStorageDirs}} should be in {{processIOError()}}.
# {{FSEdits.processIOError(int)}} should be eliminated completely.
# It seems to me that the most appropriate prototype for processIOError would be
{code}
FSEdits.processIOError(ArrayList<StorageDirectory> errorDirs)
{code}
This might even let us have only 1 such method rather than two.

> Increment checkpoint if we see failures in rollEdits
> ----------------------------------------------------
>
>                 Key: HADOOP-4045
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4045
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: dfs
>    Affects Versions: 0.19.0
>            Reporter: Lohit Vijayarenu
>            Priority: Blocker
>             Fix For: 0.19.1
>
>
> In _FSEditLog::rollEdits_, if we encounter an error during opening edits.new, we remove  the store directory associated with it. At this point we should also increment checkpoint on all other directories.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-4045) Increment checkpoint if we see failures in rollEdits

Posted by "Konstantin Shvachko (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-4045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12670211#action_12670211 ] 

Konstantin Shvachko commented on HADOOP-4045:
---------------------------------------------

I see also that {{incrementCheckpointTime()}} is not called on other occasions, like 
- {{EditLog.logEdit()}}
- {{EditLog.close()}}
- {{FSImage.rollFSImage()}}

If we don't increment fsTime for remaining directories, which is done by {{incrementCheckpointTime()}}, we risk loosing data on name-node restart. I think this is critical enough to be a blocker.


> Increment checkpoint if we see failures in rollEdits
> ----------------------------------------------------
>
>                 Key: HADOOP-4045
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4045
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: dfs
>    Affects Versions: 0.19.0
>            Reporter: Lohit Vijayarenu
>
> In _FSEditLog::rollEdits_, if we encounter an error during opening edits.new, we remove  the store directory associated with it. At this point we should also increment checkpoint on all other directories.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-4045) Increment checkpoint if we see failures in rollEdits

Posted by "Nigel Daley (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-4045?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Nigel Daley updated HADOOP-4045:
--------------------------------

    Fix Version/s:     (was: 0.19.1)
                   0.19.2

As discussed on core-dev@ (http://www.nabble.com/Hadoop-0.19.1-td21739202.html) we will disable append in 0.19.1.  Moving these append related issues to 0.19.2.

> Increment checkpoint if we see failures in rollEdits
> ----------------------------------------------------
>
>                 Key: HADOOP-4045
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4045
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: dfs
>    Affects Versions: 0.19.0
>            Reporter: Lohit Vijayarenu
>            Priority: Blocker
>             Fix For: 0.19.2
>
>
> In _FSEditLog::rollEdits_, if we encounter an error during opening edits.new, we remove  the store directory associated with it. At this point we should also increment checkpoint on all other directories.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-4045) Increment checkpoint if we see failures in rollEdits

Posted by "Boris Shkolnik (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-4045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12689606#action_12689606 ] 

Boris Shkolnik commented on HADOOP-4045:
----------------------------------------

Couple of words about the design. 
There are StorageDirectories(SDs) of different types: IMAGE or EDITS or both. Which means some do not have any EditLogStreams associated with them. On the other hand there are some EditLogStreams which are attached to SDs and some that are not (BACKUP node streaming). Thus we should be able to take care of IOErrors from both sides. So the processIOError could be called for an SD or for a EditLogStream (eStream). If it is called for a SD and this SD has associated eStream we need to call processIOError for the stream too and vice-versa. So I left one processIOError function in each class with an optional flag to specify if the error should be propagated to the corresponding SD or eStream.
All functions accept arrayList as an argument.

> Increment checkpoint if we see failures in rollEdits
> ----------------------------------------------------
>
>                 Key: HADOOP-4045
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4045
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: dfs
>            Reporter: Lohit Vijayarenu
>            Assignee: Boris Shkolnik
>            Priority: Critical
>             Fix For: 0.19.2
>
>         Attachments: HADOOP-4045.patch
>
>
> In _FSEditLog::rollEdits_, if we encounter an error during opening edits.new, we remove  the store directory associated with it. At this point we should also increment checkpoint on all other directories.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-4045) Increment checkpoint if we see failures in rollEdits

Posted by "Konstantin Shvachko (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-4045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12675157#action_12675157 ] 

Konstantin Shvachko commented on HADOOP-4045:
---------------------------------------------

This is not append related, but it is ok to move it out of 0.19.1.

> Increment checkpoint if we see failures in rollEdits
> ----------------------------------------------------
>
>                 Key: HADOOP-4045
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4045
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: dfs
>    Affects Versions: 0.19.0
>            Reporter: Lohit Vijayarenu
>            Priority: Blocker
>             Fix For: 0.19.2
>
>
> In _FSEditLog::rollEdits_, if we encounter an error during opening edits.new, we remove  the store directory associated with it. At this point we should also increment checkpoint on all other directories.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-4045) Increment checkpoint if we see failures in rollEdits

Posted by "Boris Shkolnik (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-4045?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Boris Shkolnik updated HADOOP-4045:
-----------------------------------

    Attachment: HADOOP-4045-3.patch

> Increment checkpoint if we see failures in rollEdits
> ----------------------------------------------------
>
>                 Key: HADOOP-4045
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4045
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: dfs
>            Reporter: Lohit Vijayarenu
>            Assignee: Boris Shkolnik
>            Priority: Critical
>             Fix For: 0.19.2
>
>         Attachments: HADOOP-4045-1.patch, HADOOP-4045-3.patch, HADOOP-4045.patch
>
>
> In _FSEditLog::rollEdits_, if we encounter an error during opening edits.new, we remove  the store directory associated with it. At this point we should also increment checkpoint on all other directories.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-4045) Increment checkpoint if we see failures in rollEdits

Posted by "Robert Chansler (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-4045?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robert Chansler updated HADOOP-4045:
------------------------------------

    Priority: Critical  (was: Blocker)

Demoted unless there is an argument that this is a serious regression from earlier.

> Increment checkpoint if we see failures in rollEdits
> ----------------------------------------------------
>
>                 Key: HADOOP-4045
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4045
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: dfs
>    Affects Versions: 0.19.0
>            Reporter: Lohit Vijayarenu
>            Assignee: Boris Shkolnik
>            Priority: Critical
>             Fix For: 0.19.2
>
>
> In _FSEditLog::rollEdits_, if we encounter an error during opening edits.new, we remove  the store directory associated with it. At this point we should also increment checkpoint on all other directories.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-4045) Increment checkpoint if we see failures in rollEdits

Posted by "Hadoop QA (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-4045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12694979#action_12694979 ] 

Hadoop QA commented on HADOOP-4045:
-----------------------------------

+1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12404404/HADOOP-4045-3.patch
  against trunk revision 761082.

    +1 @author.  The patch does not contain any @author tags.

    +1 tests included.  The patch appears to include 3 new or modified tests.

    +1 javadoc.  The javadoc tool did not generate any warning messages.

    +1 javac.  The applied patch does not increase the total number of javac compiler warnings.

    +1 findbugs.  The patch does not introduce any new Findbugs warnings.

    +1 Eclipse classpath. The patch retains Eclipse classpath integrity.

    +1 release audit.  The applied patch does not increase the total number of release audit warnings.

    +1 core tests.  The patch passed core unit tests.

    +1 contrib tests.  The patch passed contrib unit tests.

Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-minerva.apache.org/97/testReport/
Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-minerva.apache.org/97/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-minerva.apache.org/97/artifact/trunk/build/test/checkstyle-errors.html
Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-minerva.apache.org/97/console

This message is automatically generated.

> Increment checkpoint if we see failures in rollEdits
> ----------------------------------------------------
>
>                 Key: HADOOP-4045
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4045
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: dfs
>            Reporter: Lohit Vijayarenu
>            Assignee: Boris Shkolnik
>            Priority: Critical
>             Fix For: 0.19.2
>
>         Attachments: HADOOP-4045-1.patch, HADOOP-4045-3.patch, HADOOP-4045.patch
>
>
> In _FSEditLog::rollEdits_, if we encounter an error during opening edits.new, we remove  the store directory associated with it. At this point we should also increment checkpoint on all other directories.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-4045) Increment checkpoint if we see failures in rollEdits

Posted by "Lohit Vijayarenu (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-4045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12626868#action_12626868 ] 

Lohit Vijayarenu commented on HADOOP-4045:
------------------------------------------

It would be good to see if we miss this any other place as well.

> Increment checkpoint if we see failures in rollEdits
> ----------------------------------------------------
>
>                 Key: HADOOP-4045
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4045
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: dfs
>    Affects Versions: 0.19.0
>            Reporter: Lohit Vijayarenu
>             Fix For: 0.19.0
>
>
> In _FSEditLog::rollEdits_, if we encounter an error during opening edits.new, we remove  the store directory associated with it. At this point we should also increment checkpoint on all other directories.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-4045) Increment checkpoint if we see failures in rollEdits

Posted by "Konstantin Shvachko (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-4045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12670215#action_12670215 ] 

Konstantin Shvachko commented on HADOOP-4045:
---------------------------------------------

This should include some cleanup. We have at least for different methods called {{processIOError()}}. It would be nice to reduce the number just to 2
- {{FSImage.processIOError()}} and
- {{FSEdits.processIOError()}}

> Increment checkpoint if we see failures in rollEdits
> ----------------------------------------------------
>
>                 Key: HADOOP-4045
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4045
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: dfs
>    Affects Versions: 0.19.0
>            Reporter: Lohit Vijayarenu
>            Priority: Blocker
>             Fix For: 0.19.1
>
>
> In _FSEditLog::rollEdits_, if we encounter an error during opening edits.new, we remove  the store directory associated with it. At this point we should also increment checkpoint on all other directories.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-4045) Increment checkpoint if we see failures in rollEdits

Posted by "Boris Shkolnik (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-4045?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Boris Shkolnik updated HADOOP-4045:
-----------------------------------

    Status: Patch Available  (was: Open)

implemented Konstantin's comments

> Increment checkpoint if we see failures in rollEdits
> ----------------------------------------------------
>
>                 Key: HADOOP-4045
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4045
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: dfs
>            Reporter: Lohit Vijayarenu
>            Assignee: Boris Shkolnik
>            Priority: Critical
>             Fix For: 0.19.2
>
>         Attachments: HADOOP-4045-1.patch, HADOOP-4045-3.patch, HADOOP-4045.patch
>
>
> In _FSEditLog::rollEdits_, if we encounter an error during opening edits.new, we remove  the store directory associated with it. At this point we should also increment checkpoint on all other directories.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-4045) Increment checkpoint if we see failures in rollEdits

Posted by "Konstantin Shvachko (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-4045?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Konstantin Shvachko updated HADOOP-4045:
----------------------------------------

         Priority: Blocker  (was: Major)
    Fix Version/s: 0.19.1

fstime and its relation to {{processIOError()}} was discussed in HADOOP-1188.

> Increment checkpoint if we see failures in rollEdits
> ----------------------------------------------------
>
>                 Key: HADOOP-4045
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4045
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: dfs
>    Affects Versions: 0.19.0
>            Reporter: Lohit Vijayarenu
>            Priority: Blocker
>             Fix For: 0.19.1
>
>
> In _FSEditLog::rollEdits_, if we encounter an error during opening edits.new, we remove  the store directory associated with it. At this point we should also increment checkpoint on all other directories.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-4045) Increment checkpoint if we see failures in rollEdits

Posted by "Boris Shkolnik (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-4045?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Boris Shkolnik updated HADOOP-4045:
-----------------------------------

    Affects Version/s:     (was: 0.19.0)
               Status: Patch Available  (was: Open)

> Increment checkpoint if we see failures in rollEdits
> ----------------------------------------------------
>
>                 Key: HADOOP-4045
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4045
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: dfs
>            Reporter: Lohit Vijayarenu
>            Assignee: Boris Shkolnik
>            Priority: Critical
>             Fix For: 0.19.2
>
>         Attachments: HADOOP-4045.patch
>
>
> In _FSEditLog::rollEdits_, if we encounter an error during opening edits.new, we remove  the store directory associated with it. At this point we should also increment checkpoint on all other directories.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-4045) Increment checkpoint if we see failures in rollEdits

Posted by "Boris Shkolnik (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-4045?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Boris Shkolnik updated HADOOP-4045:
-----------------------------------

    Attachment: HADOOP-4045.patch

> Increment checkpoint if we see failures in rollEdits
> ----------------------------------------------------
>
>                 Key: HADOOP-4045
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4045
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: dfs
>    Affects Versions: 0.19.0
>            Reporter: Lohit Vijayarenu
>            Assignee: Boris Shkolnik
>            Priority: Critical
>             Fix For: 0.19.2
>
>         Attachments: HADOOP-4045.patch
>
>
> In _FSEditLog::rollEdits_, if we encounter an error during opening edits.new, we remove  the store directory associated with it. At this point we should also increment checkpoint on all other directories.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Assigned: (HADOOP-4045) Increment checkpoint if we see failures in rollEdits

Posted by "Robert Chansler (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-4045?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robert Chansler reassigned HADOOP-4045:
---------------------------------------

    Assignee: Boris Shkolnik

> Increment checkpoint if we see failures in rollEdits
> ----------------------------------------------------
>
>                 Key: HADOOP-4045
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4045
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: dfs
>    Affects Versions: 0.19.0
>            Reporter: Lohit Vijayarenu
>            Assignee: Boris Shkolnik
>            Priority: Blocker
>             Fix For: 0.19.2
>
>
> In _FSEditLog::rollEdits_, if we encounter an error during opening edits.new, we remove  the store directory associated with it. At this point we should also increment checkpoint on all other directories.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-4045) Increment checkpoint if we see failures in rollEdits

Posted by "Boris Shkolnik (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-4045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12694814#action_12694814 ] 

Boris Shkolnik commented on HADOOP-4045:
----------------------------------------


   1. FSImage.setCheckpointTime() variable al is not used.
bq. fixed
   2. processIOError(ArrayList<StorageDirectory> sds) may be eliminated.
bq. This will force using two-argument version of the function everywhere, in most cases with "true" value for the second argument.
   3. I would also get rid of processIOError(ArrayList<EditLogOutputStream> errorStreams). The point is that it is better to have only one processIOError in each class, otherwise it can get as bad as it is now with all different variants of it. If you think it is a lot of changes, then lets at least make both of them private.
bq. see 2.
   4. Do we want to make removedStorageDirs a map in order to avoid adding the same directory twice into it or does it never happen?
bq. good idea. will need a separate JIRA for it
   5. Same with Storage.storageDirs. If we search in a collection then we might want to use searchable collections. This may be done in a separate issue.
bq. same as 4.
   6. It's somewhat confusing: FSImage.processIOError() calls editLog.processIOError() and then FSEditLog.processIOError() calls fsimage.processIOError(). Is it going to converge at some point?
bq. it should. every time processIOError calles its counterpart in the other class it passes _false_ as second (propagate) argument to make sure it will not call the original function. 
   7. setCheckpointTime() ignores io errors. Just mentioning this, I don't see how to avoid it. Failed streams/directories will be remove next time flushAndSync() called.
bq. Yes, it should be cought elsewhere.



> Increment checkpoint if we see failures in rollEdits
> ----------------------------------------------------
>
>                 Key: HADOOP-4045
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4045
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: dfs
>            Reporter: Lohit Vijayarenu
>            Assignee: Boris Shkolnik
>            Priority: Critical
>             Fix For: 0.19.2
>
>         Attachments: HADOOP-4045-1.patch, HADOOP-4045.patch
>
>
> In _FSEditLog::rollEdits_, if we encounter an error during opening edits.new, we remove  the store directory associated with it. At this point we should also increment checkpoint on all other directories.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-4045) Increment checkpoint if we see failures in rollEdits

Posted by "Boris Shkolnik (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-4045?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Boris Shkolnik updated HADOOP-4045:
-----------------------------------

    Status: Open  (was: Patch Available)

> Increment checkpoint if we see failures in rollEdits
> ----------------------------------------------------
>
>                 Key: HADOOP-4045
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4045
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: dfs
>            Reporter: Lohit Vijayarenu
>            Assignee: Boris Shkolnik
>            Priority: Critical
>             Fix For: 0.19.2
>
>         Attachments: HADOOP-4045-1.patch, HADOOP-4045-3.patch, HADOOP-4045.patch
>
>
> In _FSEditLog::rollEdits_, if we encounter an error during opening edits.new, we remove  the store directory associated with it. At this point we should also increment checkpoint on all other directories.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-4045) Increment checkpoint if we see failures in rollEdits

Posted by "Devaraj Das (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-4045?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Devaraj Das updated HADOOP-4045:
--------------------------------

    Fix Version/s:     (was: 0.19.0)

> Increment checkpoint if we see failures in rollEdits
> ----------------------------------------------------
>
>                 Key: HADOOP-4045
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4045
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: dfs
>    Affects Versions: 0.19.0
>            Reporter: Lohit Vijayarenu
>
> In _FSEditLog::rollEdits_, if we encounter an error during opening edits.new, we remove  the store directory associated with it. At this point we should also increment checkpoint on all other directories.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-4045) Increment checkpoint if we see failures in rollEdits

Posted by "Boris Shkolnik (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-4045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12689724#action_12689724 ] 

Boris Shkolnik commented on HADOOP-4045:
----------------------------------------

Manual testing done:
1. Mount two directories (one for Edits and Image, one for Edits only).
2. create some files
3. unmount one of them and wait for checkpoint (or create a file) , verify that failed dir is removed
4. unmount another one (optional) - more verifications
5. mount one back - (checkpoint or new files), verify that checkpointtime is updated and files have the same size and MD5
6. mount the other one (optional) - more verifications
7. repeat 3 and 5
8. check WebUI all the time.

> Increment checkpoint if we see failures in rollEdits
> ----------------------------------------------------
>
>                 Key: HADOOP-4045
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4045
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: dfs
>            Reporter: Lohit Vijayarenu
>            Assignee: Boris Shkolnik
>            Priority: Critical
>             Fix For: 0.19.2
>
>         Attachments: HADOOP-4045-1.patch, HADOOP-4045.patch
>
>
> In _FSEditLog::rollEdits_, if we encounter an error during opening edits.new, we remove  the store directory associated with it. At this point we should also increment checkpoint on all other directories.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.