You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@flume.apache.org by "Mike Percy (JIRA)" <ji...@apache.org> on 2012/11/15 20:53:12 UTC

[jira] [Created] (FLUME-1714) Improve handling of HDFS sink .tmp files after crash

Mike Percy created FLUME-1714:
---------------------------------

             Summary: Improve handling of HDFS sink .tmp files after crash
                 Key: FLUME-1714
                 URL: https://issues.apache.org/jira/browse/FLUME-1714
             Project: Flume
          Issue Type: Improvement
            Reporter: Mike Percy


Currently, the .tmp files left after a system or Flume client crash are never cleaned up, and several users have noted that it would be better if Flume itself took care of this.

This is actually a complicated issue, with multiple facets. These include:
# We would need to persist the in-progress filenames somewhere, probably on the agent's local FS. This is not very hard.
# At startup, we would need to handle the files in some way to guarantee at least one of the following:
** Mark it as a potentially partial file somehow when renaming from .tmp
** Ensure that the file format is valid before renaming it from .tmp
*** This 2nd option is actually harder than it sounds, since arbitrary serializers may be plugged in. Say it's an XML serializer, then we would need some way to programmatically read (deserialize) the file, throw away any potentially unfinished records at the end (this is OK since the transaction must not have been committed), then re-serialize the file with all the valid records and correct opening/closing tags.
*** General deserialization / recovery APIs would need to be added to support this, and this would need to be very carefully designed and implemented in order to work. In the end, it also seems likely that if this a complex thing (sounds complex) then most people would rely on out-of-the-box implementations (supported file formats) to get this functionality, unless they are building on top of abstract classes (e.g. for XML schema handling) to help accomplish this.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira