You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@chukwa.apache.org by "Ahmed Fathalla (JIRA)" <ji...@apache.org> on 2010/04/12 19:27:51 UTC
[jira] Updated: (CHUKWA-4) Collectors don't finish writing .done datasink from last .chukwa datasink when stopped using bin/stop-collectors

     [ https://issues.apache.org/jira/browse/CHUKWA-4?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ahmed Fathalla updated CHUKWA-4:
--------------------------------

    Attachment: CHUKWA-4.patch

This patch contains a fix for corrupt sink files created locally. I've created a new class CopySequenceFile which copies the corrupt .chukwa file to a valid .done file.

The code for recovering a failed copy attempt is included in the cleanup() method of LocalToRemoteHdfsMover and follows Jerome's suggestions. I have also created a unit test that creates a sink file, converts it into a .done file and validates that the .done file was created and the .chukwa file removed.

I have tested this solution several times and it seems to be working. However, I have faced a rare case where recovery fails because I get the following exception while reading from the .chukwa file / writing to the .done file


2010-04-12 07:56:47,538 WARN LocalToRemoteHdfsMover CopySequenceFile - Error during .chukwa file recovery
java.io.EOFException
	at java.io.DataInputStream.readFully(DataInputStream.java:180)
	at org.apache.hadoop.io.DataOutputBuffer$Buffer.write(DataOutputBuffer.java:63)
	at org.apache.hadoop.io.DataOutputBuffer.write(DataOutputBuffer.java:101)
	at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1930)
	at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1830)
	at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1876)
	at org.apache.hadoop.chukwa.util.CopySequenceFile.createValidSequenceFile(CopySequenceFile.java:80)
	at org.apache.hadoop.chukwa.datacollection.writer.localfs.LocalToRemoteHdfsMover.cleanup(LocalToRemoteHdfsMover.java:185)
	at org.apache.hadoop.chukwa.datacollection.writer.localfs.LocalToRemoteHdfsMover.run(LocalToRemoteHdfsMover.java:215)


This seemed to happen when recovering from a .chukwa file that was just created before the collector crashed (the .chukwa file size was about ~200KB) so I guess it might be that the file has no actual data and should be removed. I would appreciate it if you can point out how we can deal with this situation.

> Collectors don't finish writing .done datasink from last .chukwa datasink when stopped using bin/stop-collectors
> ----------------------------------------------------------------------------------------------------------------
>
>                 Key: CHUKWA-4
>                 URL: https://issues.apache.org/jira/browse/CHUKWA-4
>             Project: Hadoop Chukwa
>          Issue Type: Bug
>          Components: data collection
>         Environment: I am running on our local cluster. This is a linux machine that I also run Hadoop cluster from.
>            Reporter: Andy Konwinski
>            Priority: Minor
>         Attachments: CHUKWA-4.patch
>
>
> When I use start-collectors, it creates the datasink as expected, writes to it as per normal, i.e. writes to the .chukwa file, and roll overs work fine when it renames the .chukwa file to .done. However, when I use bin/stop-collectors to shut down the running collector it leaves a .chukwa file in the HDFS file system. Not sure if this is a valid sink or not, but I think that the collector should gracefully clean up the datasink and rename it .done before exiting.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira