You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@storm.apache.org by "Yoel Cabo Lopez (JIRA)" <ji...@apache.org> on 2016/11/24 00:09:58 UTC

[jira] [Created] (STORM-2219) In HDFSBolt and SequenceFileBolt the files are overridden if they already exist

Yoel Cabo Lopez created STORM-2219:
--------------------------------------

             Summary: In HDFSBolt and SequenceFileBolt the files are overridden if they already exist
                 Key: STORM-2219
                 URL: https://issues.apache.org/jira/browse/STORM-2219
             Project: Apache Storm
          Issue Type: Bug
          Components: storm-hdfs
            Reporter: Yoel Cabo Lopez
            Priority: Critical


In both bolts the files are opened in create mode. That implies that if the file already exists it is overridden. So, if for some reason the bolt is restarted (rebalancing or some crash), the data is lost. I think that is specially grave. What's more, since the rotation number is stored in memory, all the files will be eventually wiped out.

I think there are two possible approaches:
- If the file already exists, open it in append mode. I see some problems here, (1) the tuples data written to the several rotations will not keep its order unless we jump to the last rotation, (2) the TimedRotationPolicy and other that rely on memory stored data will not behave exactly as expected and (3) if the case of the SequenceFileBolt, if the file has different compression code or type it will raise an exception. Besides, we should change the way the HDFSWriter handles the writing offset because it depends on the size of the Tuples being written and not on the size of the file (and that would affect the FileSizeRotationPolicy). This doesn't affect the SequenceFileWriter, since it is using the getLength() method of SequenceFile.Writer that handles the append mode properly.
- If the file exists, move to the next rotation. The problem I see is that if the rotation number is not part of the file name it will enter in a endless loop. Another issue is that if the the restart of the bolt is caused by some problem that is not fixed after the restart, it could be creating new files infinitely until collapsing the NameNode.

I guess the solution will be a mix of both approaches and I think I can be able to implement it. But first I would like to ask if anyone has any other concern about it.

By the moment I just wrote a bolt that satisfies my use case, with Sequence Files opened in append mode if the file exists and rotating based on size. But this solution should be more general. 




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)