You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@kafka.apache.org by Jaikiran Pai <ja...@gmail.com> on 2015/01/29 07:26:54 UTC

Changing the default Kafka data log directory (was Re: What to do when file.rename fails?)

I have created a JIRA for this proposed change 
https://issues.apache.org/jira/browse/KAFKA-1906 and uploaded a patch 
for review https://reviews.apache.org/r/30403/

-Jaikiran

On Tuesday 27 January 2015 02:36 AM, Jay Kreps wrote:
> Having a relative path and keeping data under /data in the kafka distro
> would make sense. This would require some reworking of the shell scripts,
> though, as I think right now you an actually run Kafka from any directory
> and the cwd of the process will be whatever directory you start from. If we
> have a relative path in the config then the working directory will HAVE to
> be the kafka directory. This works for the simple download case but may
> making some packaging stuff harder for other use cases.
>
> -Jay
>
> On Mon, Jan 26, 2015 at 5:54 AM, Jaikiran Pai <ja...@gmail.com>
> wrote:
>
>> Having looked at the logs the user posted, I don't think this specific
>> issue has to do with /tmp path.
>>
>> However, now that the /tmp path is being discussed, I think it's a good
>> idea that we default the Kafka logs to a certain folder. As Jay notes, it
>> makes it very easy to just download and start the servers without having to
>> fiddle with the configs when you are just starting out. Having said that,
>> when I started out with Kafka, I found /tmp to be a odd place to default
>> the path to. I expected them to be defaulted to a folder within the Kafka
>> install. Somewhere like KAFKA_INSTALL_FOLDER/data/kafka-logs/ folder. Is
>> that something we should do?
>>
>> -Jaikiran
>>
>> On Monday 26 January 2015 12:23 AM, Jay Kreps wrote:
>>
>>> Hmm, but I don't think tmp gets cleaned while the server is running...
>>>
>>> The reason for using tmp was because we don't know which directory they
>>> will use and we don't want them to have to edit configuration for the
>>> simple "out of the box" getting started tutorial. I actually do think that
>>> is important. Maybe an intermediate step we could do is just call out this
>>> setting in the quickstart so people know where data is going and know they
>>> need to configure it later...
>>>
>>> -Jay
>>>
>>> On Sun, Jan 25, 2015 at 9:32 AM, Joe Stein <jo...@stealth.ly> wrote:
>>>
>>>   This feels like another type of symptom from people using /tmp/ for their
>>>> logs.  Perosnally, I would rather use /mnt/data or something and if that
>>>> doesn't exist on their machine we can exception, or no default and force
>>>> set it.
>>>>
>>>> /*******************************************
>>>> Joe Stein
>>>> Founder, Principal Consultant
>>>> Big Data Open Source Security LLC
>>>> http://www.stealth.ly
>>>> Twitter: @allthingshadoop
>>>> ********************************************/
>>>> On Jan 25, 2015 11:37 AM, "Jay Kreps" <ja...@gmail.com> wrote:
>>>>
>>>>   I think you are right, good catch. It could be that this user deleted
>>>>> the
>>>>> files manually, but I wonder if there isn't some way that is a Kafka
>>>>> bug--e.g. if multiple types of retention policies kick in at the same
>>>>>
>>>> time
>>>>
>>>>> do we synchronize that properly?
>>>>>
>>>>> -Jay
>>>>>
>>>>> On Sat, Jan 24, 2015 at 9:26 PM, Jaikiran Pai <jai.forums2013@gmail.com
>>>>> wrote:
>>>>>
>>>>>   Hi Jay,
>>>>>> I spent some more time over this today and went back to the original
>>>>>> thread which brought up the issue with file leaks [1]. I think that
>>>>>>
>>>>> output
>>>>>
>>>>>> of lsof in that logs has a very important hint:
>>>>>>
>>>>>> /home/work/data/soft/kafka-0.8/data/_oakbay_v2_search_
>>>>>> topic_ypgsearch_yellowpageV2-0/00000000000068818668.log (deleted) java
>>>>>> 8446 root 725u REG 253,2 536910838 26087364
>>>>>>
>>>>>> /home/work/data/soft/kafka-0.8/data/_oakbay_v2_search_
>>>>>> topic_ypgsearch_yellowpageV2-0/00000000000069457098.log (deleted) java
>>>>>> 8446 root 726u REG 253,2 536917902 26087368
>>>>>>
>>>>>> Notice the "(deleted)" text in that output. The last time I looked at
>>>>>>
>>>>> that
>>>>>
>>>>>> output, I thought it was the user who had added that "deleted" text to
>>>>>>
>>>>> help
>>>>>
>>>>>> us understand that problem. But today I read up on the output format of
>>>>>> lsof and it turns out that it's lsof which itself adds that hint
>>>>>>
>>>>> whenever a
>>>>>
>>>>>> file has already been deleted possibly by a different process but some
>>>>>> other process is still holding on to open resources of that (deleted)
>>>>>>
>>>>> file
>>>>>
>>>>>> [2].
>>>>>>
>>>>>> So in the context of the issue that we are discussing and the way Kafka
>>>>>> deals with async deletes (i.e. first by attempting a rename of the
>>>>>> log/index files), I think this all makes sense now. So what I think is
>>>>>> happening is, some (other?) process (not sure what/why) has already
>>>>>>
>>>>> deleted
>>>>>
>>>>>> the log file that Kafka is using for the LogSegment. The LogSegment
>>>>>>
>>>>> however
>>>>>
>>>>>> still has open FileChannel resource on that deleted file (and that's
>>>>>>
>>>>> why
>>>>> the open file descriptor is held on and shows up in that output). Now
>>>>>> Kafka, at some point in time, triggers an async delete of the
>>>>>>
>>>>> LogSegment,
>>>>> which involves a file rename of that (already deleted) log file. The
>>>>> rename
>>>>>
>>>>>> fails (because the original file path isn't there anymore). As a
>>>>>>
>>>>> result,
>>>>> we
>>>>>
>>>>>> end up throwing that "failed to rename, KafkaStorageException" and thus
>>>>>> leave behind the open FileChannel to continue being open forever (till
>>>>>>
>>>>> the
>>>>>
>>>>>> Kafka program exits).
>>>>>>
>>>>>> So I think we should:
>>>>>>
>>>>>> 1) Find what/why deletes that underlying log file(s). I'll add a reply
>>>>>>
>>>>> to
>>>>> that original mail discussion asking the user if he can provide more
>>>>>> details.
>>>>>> 2) Handle this case and close the FileChannel. The patch that's been
>>>>>> uploaded to review board https://reviews.apache.org/r/29755/ does
>>>>>>
>>>>> that.
>>>>> The "immediate delete" on failure to rename, involves (safely) closing
>>>>> the
>>>>>
>>>>>> open FileChannel and (safely) deleting the (possibly non-existent)
>>>>>>
>>>>> file.
>>>>> By the way, this entire thing can be easily reproduced by running the
>>>>>> following program which first creates a file and open a filechannel to
>>>>>>
>>>>> that
>>>>>
>>>>>> file and then waits for the user to delete that file externally (I used
>>>>>>
>>>>> the
>>>>>
>>>>>> rm command) and then go and tries to rename that deleted file, which
>>>>>>
>>>>> then
>>>>> fails. In between each of these steps, you can run the lsof command
>>>>>> externally to see the open file resources (I used 'lsof | grep
>>>>>>
>>>>> test.log'):
>>>>>
>>>>>>       public static void main(String[] args) throws Exception {
>>>>>>           // Open a file and file channel for read/write
>>>>>>           final File originalLogFile = new
>>>>>>
>>>>> File("/home/jaikiran/deleteme/test.log");
>>>>>
>>>>>> // change this path relevantly if you plan to run it
>>>>>>           final FileChannel fileChannel = new
>>>>>>
>>>>> RandomAccessFile(originalLogFile,
>>>>>
>>>>>> "rw").getChannel();
>>>>>>           System.out.println("Opened file channel to " +
>>>>>>
>>>>> originalLogFile);
>>>>>           // wait for underlying file to be deleted externally
>>>>>>           System.out.println("Waiting for the " + originalLogFile + " to
>>>>>>
>>>>> be
>>>>> deleted externally. Press any key after the file is deleted");
>>>>>>           System.in.read();
>>>>>>           // wait for the user to check the lsof output
>>>>>>           System.out.println(originalLogFile + " seems to have been
>>>>>>
>>>>> deleted
>>>>> externally, check lsof command output to see open file resources.");
>>>>>>           System.out.println("Press any key to try renaming this already
>>>>>> deleted file, from the program");
>>>>>>           System.in.read();
>>>>>>           // try rename
>>>>>>           final File fileToRenameTo = new File(originalLogFile.getPath()
>>>>>>
>>>>> +
>>>>> ".deleted");
>>>>>>           System.out.println("Trying to rename " + originalLogFile + "
>>>>>>
>>>>> to "
>>>>> + fileToRenameTo);
>>>>>>           final boolean renamedSucceeded = originalLogFile.renameTo(
>>>>>> fileToRenameTo);
>>>>>>           if (renamedSucceeded) {
>>>>>>               System.out.println("Rename SUCCEEDED. Renamed file exists?
>>>>>>
>>>>> "
>>>>> +
>>>>>
>>>>>> fileToRenameTo.exists());
>>>>>>           } else {
>>>>>>               System.out.println("FAILED to rename file " +
>>>>>>
>>>>> originalLogFile
>>>>> + " to " + fileToRenameTo);
>>>>>>           }
>>>>>>           // wait for the user to check the lsof output, after our
>>>>>> rename
>>>>>> failed
>>>>>>           System.out.println("Check the lsof output and press any key to
>>>>>> close the open file channel to a deleted file");
>>>>>>           System.in.read();
>>>>>>           // close the file channel
>>>>>>           fileChannel.close();
>>>>>>           // let user check the lsof output one final time. This time he
>>>>>> won't see open file resources from this program
>>>>>>           System.out.println("File channel closed. Check the lsof output
>>>>>>
>>>>> and
>>>>>
>>>>>> press any key to terminate the program");
>>>>>>           System.in.read();
>>>>>>           // all done, exit
>>>>>>           System.out.println("Program will terminate");
>>>>>>       }
>>>>>>
>>>>>>
>>>>>>
>>>>>> [1] http://mail-archives.apache.org/mod_mbox/kafka-users/
>>>>>> 201501.mbox/%3CCAA4R6b-7gSbPp5_ebGpwYyNibDAwE_%2BwoE%
>>>>>> 2BKbiMuU27-2j%2BLkg%40mail.gmail.com%3E
>>>>>> [2] http://unixhelp.ed.ac.uk/CGI/man-cgi?lsof+8
>>>>>>
>>>>>>
>>>>>> -Jaikiran
>>>>>>
>>>>>> On Saturday 24 January 2015 11:12 PM, Jay Kreps wrote:
>>>>>>
>>>>>>   Hey guys,
>>>>>>> Jaikiran posted a patch on KAFKA-1853 to improve the handling of
>>>>>>>
>>>>>> failures
>>>>>> during delete.
>>>>>>> https://issues.apache.org/jira/browse/KAFKA-1853
>>>>>>>
>>>>>>> The core problem here is that we are doing File.rename() as part of
>>>>>>>
>>>>>> the
>>>>> delete sequence which returns false if the rename failed. Or file
>>>>>> delete
>>>>> sequence is something like the following:
>>>>>>> 1. Remove the file from the index so no new reads can begin on it
>>>>>>> 2. Rename the file to xyz.deleted so that if we crash it will get
>>>>>>>
>>>>>> cleaned
>>>>>> up
>>>>>>> 3. Schedule a task to delete the file in 30 seconds or so when any
>>>>>>> in-progress reads have likely completed. The goal here is to avoid
>>>>>>>
>>>>>> errors
>>>>>> on in progress reads but also avoid locking on all reads.
>>>>>>> The question is what to do when rename fails? Previously if this
>>>>>>>
>>>>>> happened
>>>>>> we actually didn't pay attention and would fail to delete the file
>>>>>>> entirely. This patch changes it so that if the rename fails we log an
>>>>>>>
>>>>>> error
>>>>>> and force an immediate delete.
>>>>>>> I think this is the right thing to do, but I guess the real question
>>>>>>>
>>>>>> is
>>>>> why would rename fail? Some possibilities:
>>>>>>> http://stackoverflow.com/questions/2372374/why-would-a-
>>>>>>> file-rename-fail-in-java
>>>>>>>
>>>>>>> An alternative would be to treat this as a filesystem error and
>>>>>>>
>>>>>> shutdown
>>>>> as we do elsewhere.
>>>>>>> Thoughts?
>>>>>>>
>>>>>>> -Jay
>>>>>>>
>>>>>>>
>>>>>>>


Re: Changing the default Kafka data log directory (was Re: What to do when file.rename fails?)

Posted by Darion Yaphet <da...@gmail.com>.
It's seems a good idea .
use /tmp/kafka-log as default dir maybe delete by other people and very
unsafe .

2015-01-29 14:26 GMT+08:00 Jaikiran Pai <ja...@gmail.com>:

> I have created a JIRA for this proposed change https://issues.apache.org/
> jira/browse/KAFKA-1906 and uploaded a patch for review
> https://reviews.apache.org/r/30403/
>
> -Jaikiran
>
> On Tuesday 27 January 2015 02:36 AM, Jay Kreps wrote:
>
>> Having a relative path and keeping data under /data in the kafka distro
>> would make sense. This would require some reworking of the shell scripts,
>> though, as I think right now you an actually run Kafka from any directory
>> and the cwd of the process will be whatever directory you start from. If
>> we
>> have a relative path in the config then the working directory will HAVE to
>> be the kafka directory. This works for the simple download case but may
>> making some packaging stuff harder for other use cases.
>>
>> -Jay
>>
>> On Mon, Jan 26, 2015 at 5:54 AM, Jaikiran Pai <ja...@gmail.com>
>> wrote:
>>
>>  Having looked at the logs the user posted, I don't think this specific
>>> issue has to do with /tmp path.
>>>
>>> However, now that the /tmp path is being discussed, I think it's a good
>>> idea that we default the Kafka logs to a certain folder. As Jay notes, it
>>> makes it very easy to just download and start the servers without having
>>> to
>>> fiddle with the configs when you are just starting out. Having said that,
>>> when I started out with Kafka, I found /tmp to be a odd place to default
>>> the path to. I expected them to be defaulted to a folder within the Kafka
>>> install. Somewhere like KAFKA_INSTALL_FOLDER/data/kafka-logs/ folder. Is
>>> that something we should do?
>>>
>>> -Jaikiran
>>>
>>> On Monday 26 January 2015 12:23 AM, Jay Kreps wrote:
>>>
>>>  Hmm, but I don't think tmp gets cleaned while the server is running...
>>>>
>>>> The reason for using tmp was because we don't know which directory they
>>>> will use and we don't want them to have to edit configuration for the
>>>> simple "out of the box" getting started tutorial. I actually do think
>>>> that
>>>> is important. Maybe an intermediate step we could do is just call out
>>>> this
>>>> setting in the quickstart so people know where data is going and know
>>>> they
>>>> need to configure it later...
>>>>
>>>> -Jay
>>>>
>>>> On Sun, Jan 25, 2015 at 9:32 AM, Joe Stein <jo...@stealth.ly>
>>>> wrote:
>>>>
>>>>   This feels like another type of symptom from people using /tmp/ for
>>>> their
>>>>
>>>>> logs.  Perosnally, I would rather use /mnt/data or something and if
>>>>> that
>>>>> doesn't exist on their machine we can exception, or no default and
>>>>> force
>>>>> set it.
>>>>>
>>>>> /*******************************************
>>>>> Joe Stein
>>>>> Founder, Principal Consultant
>>>>> Big Data Open Source Security LLC
>>>>> http://www.stealth.ly
>>>>> Twitter: @allthingshadoop
>>>>> ********************************************/
>>>>> On Jan 25, 2015 11:37 AM, "Jay Kreps" <ja...@gmail.com> wrote:
>>>>>
>>>>>   I think you are right, good catch. It could be that this user deleted
>>>>>
>>>>>> the
>>>>>> files manually, but I wonder if there isn't some way that is a Kafka
>>>>>> bug--e.g. if multiple types of retention policies kick in at the same
>>>>>>
>>>>>>  time
>>>>>
>>>>>  do we synchronize that properly?
>>>>>>
>>>>>> -Jay
>>>>>>
>>>>>> On Sat, Jan 24, 2015 at 9:26 PM, Jaikiran Pai <
>>>>>> jai.forums2013@gmail.com
>>>>>> wrote:
>>>>>>
>>>>>>   Hi Jay,
>>>>>>
>>>>>>> I spent some more time over this today and went back to the original
>>>>>>> thread which brought up the issue with file leaks [1]. I think that
>>>>>>>
>>>>>>>  output
>>>>>>
>>>>>>  of lsof in that logs has a very important hint:
>>>>>>>
>>>>>>> /home/work/data/soft/kafka-0.8/data/_oakbay_v2_search_
>>>>>>> topic_ypgsearch_yellowpageV2-0/00000000000068818668.log (deleted)
>>>>>>> java
>>>>>>> 8446 root 725u REG 253,2 536910838 26087364
>>>>>>>
>>>>>>> /home/work/data/soft/kafka-0.8/data/_oakbay_v2_search_
>>>>>>> topic_ypgsearch_yellowpageV2-0/00000000000069457098.log (deleted)
>>>>>>> java
>>>>>>> 8446 root 726u REG 253,2 536917902 26087368
>>>>>>>
>>>>>>> Notice the "(deleted)" text in that output. The last time I looked at
>>>>>>>
>>>>>>>  that
>>>>>>
>>>>>>  output, I thought it was the user who had added that "deleted" text
>>>>>>> to
>>>>>>>
>>>>>>>  help
>>>>>>
>>>>>>  us understand that problem. But today I read up on the output format
>>>>>>> of
>>>>>>> lsof and it turns out that it's lsof which itself adds that hint
>>>>>>>
>>>>>>>  whenever a
>>>>>>
>>>>>>  file has already been deleted possibly by a different process but
>>>>>>> some
>>>>>>> other process is still holding on to open resources of that (deleted)
>>>>>>>
>>>>>>>  file
>>>>>>
>>>>>>  [2].
>>>>>>>
>>>>>>> So in the context of the issue that we are discussing and the way
>>>>>>> Kafka
>>>>>>> deals with async deletes (i.e. first by attempting a rename of the
>>>>>>> log/index files), I think this all makes sense now. So what I think
>>>>>>> is
>>>>>>> happening is, some (other?) process (not sure what/why) has already
>>>>>>>
>>>>>>>  deleted
>>>>>>
>>>>>>  the log file that Kafka is using for the LogSegment. The LogSegment
>>>>>>>
>>>>>>>  however
>>>>>>
>>>>>>  still has open FileChannel resource on that deleted file (and that's
>>>>>>>
>>>>>>>  why
>>>>>> the open file descriptor is held on and shows up in that output). Now
>>>>>>
>>>>>>> Kafka, at some point in time, triggers an async delete of the
>>>>>>>
>>>>>>>  LogSegment,
>>>>>> which involves a file rename of that (already deleted) log file. The
>>>>>> rename
>>>>>>
>>>>>>  fails (because the original file path isn't there anymore). As a
>>>>>>>
>>>>>>>  result,
>>>>>> we
>>>>>>
>>>>>>  end up throwing that "failed to rename, KafkaStorageException" and
>>>>>>> thus
>>>>>>> leave behind the open FileChannel to continue being open forever
>>>>>>> (till
>>>>>>>
>>>>>>>  the
>>>>>>
>>>>>>  Kafka program exits).
>>>>>>>
>>>>>>> So I think we should:
>>>>>>>
>>>>>>> 1) Find what/why deletes that underlying log file(s). I'll add a
>>>>>>> reply
>>>>>>>
>>>>>>>  to
>>>>>> that original mail discussion asking the user if he can provide more
>>>>>>
>>>>>>> details.
>>>>>>> 2) Handle this case and close the FileChannel. The patch that's been
>>>>>>> uploaded to review board https://reviews.apache.org/r/29755/ does
>>>>>>>
>>>>>>>  that.
>>>>>> The "immediate delete" on failure to rename, involves (safely) closing
>>>>>> the
>>>>>>
>>>>>>  open FileChannel and (safely) deleting the (possibly non-existent)
>>>>>>>
>>>>>>>  file.
>>>>>> By the way, this entire thing can be easily reproduced by running the
>>>>>>
>>>>>>> following program which first creates a file and open a filechannel
>>>>>>> to
>>>>>>>
>>>>>>>  that
>>>>>>
>>>>>>  file and then waits for the user to delete that file externally (I
>>>>>>> used
>>>>>>>
>>>>>>>  the
>>>>>>
>>>>>>  rm command) and then go and tries to rename that deleted file, which
>>>>>>>
>>>>>>>  then
>>>>>> fails. In between each of these steps, you can run the lsof command
>>>>>>
>>>>>>> externally to see the open file resources (I used 'lsof | grep
>>>>>>>
>>>>>>>  test.log'):
>>>>>>
>>>>>>        public static void main(String[] args) throws Exception {
>>>>>>>           // Open a file and file channel for read/write
>>>>>>>           final File originalLogFile = new
>>>>>>>
>>>>>>>  File("/home/jaikiran/deleteme/test.log");
>>>>>>
>>>>>>  // change this path relevantly if you plan to run it
>>>>>>>           final FileChannel fileChannel = new
>>>>>>>
>>>>>>>  RandomAccessFile(originalLogFile,
>>>>>>
>>>>>>  "rw").getChannel();
>>>>>>>           System.out.println("Opened file channel to " +
>>>>>>>
>>>>>>>  originalLogFile);
>>>>>>           // wait for underlying file to be deleted externally
>>>>>>
>>>>>>>           System.out.println("Waiting for the " + originalLogFile +
>>>>>>> " to
>>>>>>>
>>>>>>>  be
>>>>>> deleted externally. Press any key after the file is deleted");
>>>>>>
>>>>>>>           System.in.read();
>>>>>>>           // wait for the user to check the lsof output
>>>>>>>           System.out.println(originalLogFile + " seems to have been
>>>>>>>
>>>>>>>  deleted
>>>>>> externally, check lsof command output to see open file resources.");
>>>>>>
>>>>>>>           System.out.println("Press any key to try renaming this
>>>>>>> already
>>>>>>> deleted file, from the program");
>>>>>>>           System.in.read();
>>>>>>>           // try rename
>>>>>>>           final File fileToRenameTo = new
>>>>>>> File(originalLogFile.getPath()
>>>>>>>
>>>>>>>  +
>>>>>> ".deleted");
>>>>>>
>>>>>>>           System.out.println("Trying to rename " + originalLogFile +
>>>>>>> "
>>>>>>>
>>>>>>>  to "
>>>>>> + fileToRenameTo);
>>>>>>
>>>>>>>           final boolean renamedSucceeded = originalLogFile.renameTo(
>>>>>>> fileToRenameTo);
>>>>>>>           if (renamedSucceeded) {
>>>>>>>               System.out.println("Rename SUCCEEDED. Renamed file
>>>>>>> exists?
>>>>>>>
>>>>>>>  "
>>>>>> +
>>>>>>
>>>>>>  fileToRenameTo.exists());
>>>>>>>           } else {
>>>>>>>               System.out.println("FAILED to rename file " +
>>>>>>>
>>>>>>>  originalLogFile
>>>>>> + " to " + fileToRenameTo);
>>>>>>
>>>>>>>           }
>>>>>>>           // wait for the user to check the lsof output, after our
>>>>>>> rename
>>>>>>> failed
>>>>>>>           System.out.println("Check the lsof output and press any
>>>>>>> key to
>>>>>>> close the open file channel to a deleted file");
>>>>>>>           System.in.read();
>>>>>>>           // close the file channel
>>>>>>>           fileChannel.close();
>>>>>>>           // let user check the lsof output one final time. This
>>>>>>> time he
>>>>>>> won't see open file resources from this program
>>>>>>>           System.out.println("File channel closed. Check the lsof
>>>>>>> output
>>>>>>>
>>>>>>>  and
>>>>>>
>>>>>>  press any key to terminate the program");
>>>>>>>           System.in.read();
>>>>>>>           // all done, exit
>>>>>>>           System.out.println("Program will terminate");
>>>>>>>       }
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> [1] http://mail-archives.apache.org/mod_mbox/kafka-users/
>>>>>>> 201501.mbox/%3CCAA4R6b-7gSbPp5_ebGpwYyNibDAwE_%2BwoE%
>>>>>>> 2BKbiMuU27-2j%2BLkg%40mail.gmail.com%3E
>>>>>>> [2] http://unixhelp.ed.ac.uk/CGI/man-cgi?lsof+8
>>>>>>>
>>>>>>>
>>>>>>> -Jaikiran
>>>>>>>
>>>>>>> On Saturday 24 January 2015 11:12 PM, Jay Kreps wrote:
>>>>>>>
>>>>>>>   Hey guys,
>>>>>>>
>>>>>>>> Jaikiran posted a patch on KAFKA-1853 to improve the handling of
>>>>>>>>
>>>>>>>>  failures
>>>>>>> during delete.
>>>>>>>
>>>>>>>> https://issues.apache.org/jira/browse/KAFKA-1853
>>>>>>>>
>>>>>>>> The core problem here is that we are doing File.rename() as part of
>>>>>>>>
>>>>>>>>  the
>>>>>>>
>>>>>> delete sequence which returns false if the rename failed. Or file
>>>>>>
>>>>>>> delete
>>>>>>>
>>>>>> sequence is something like the following:
>>>>>>
>>>>>>> 1. Remove the file from the index so no new reads can begin on it
>>>>>>>> 2. Rename the file to xyz.deleted so that if we crash it will get
>>>>>>>>
>>>>>>>>  cleaned
>>>>>>> up
>>>>>>>
>>>>>>>> 3. Schedule a task to delete the file in 30 seconds or so when any
>>>>>>>> in-progress reads have likely completed. The goal here is to avoid
>>>>>>>>
>>>>>>>>  errors
>>>>>>> on in progress reads but also avoid locking on all reads.
>>>>>>>
>>>>>>>> The question is what to do when rename fails? Previously if this
>>>>>>>>
>>>>>>>>  happened
>>>>>>> we actually didn't pay attention and would fail to delete the file
>>>>>>>
>>>>>>>> entirely. This patch changes it so that if the rename fails we log
>>>>>>>> an
>>>>>>>>
>>>>>>>>  error
>>>>>>> and force an immediate delete.
>>>>>>>
>>>>>>>> I think this is the right thing to do, but I guess the real question
>>>>>>>>
>>>>>>>>  is
>>>>>>>
>>>>>> why would rename fail? Some possibilities:
>>>>>>
>>>>>>> http://stackoverflow.com/questions/2372374/why-would-a-
>>>>>>>> file-rename-fail-in-java
>>>>>>>>
>>>>>>>> An alternative would be to treat this as a filesystem error and
>>>>>>>>
>>>>>>>>  shutdown
>>>>>>>
>>>>>> as we do elsewhere.
>>>>>>
>>>>>>> Thoughts?
>>>>>>>>
>>>>>>>> -Jay
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>


-- 

long is the way and hard  that out of Hell leads up to light