You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@kafka.apache.org by "Swapnil Ghike (JIRA)" <ji...@apache.org> on 2012/08/21 04:18:37 UTC

[jira] [Created] (KAFKA-475) Time based log segment rollout

Swapnil Ghike created KAFKA-475:
-----------------------------------

             Summary: Time based log segment rollout
                 Key: KAFKA-475
                 URL: https://issues.apache.org/jira/browse/KAFKA-475
             Project: Kafka
          Issue Type: New Feature
    Affects Versions: 0.7.1
            Reporter: Swapnil Ghike
            Assignee: Swapnil Ghike
             Fix For: 0.7.2


Some applications might want their data to be deleted from the Kafka servers earlier than the default retention time. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Comment Edited] (KAFKA-475) Time based log segment rollout

Posted by "Swapnil Ghike (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/KAFKA-475?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13439893#comment-13439893 ] 

Swapnil Ghike edited comment on KAFKA-475 at 8/23/12 8:58 AM:
--------------------------------------------------------------

Patch attached:
1. Time based log segment rollout added. As discussed with Neha, the values of config.logRollHours and config.logRetentionHours are decoupled now.

2. Moved the position of maybeRoll(segment) call in the Log to make sure that a new message does not get appended to a segment that has expired in time. 
   i. Accordingly modified the testCleanupSegmentsToMaintainSizeWithSizeBasedLogRoll

3. I have currently set the range of logRetentionHours and logRollHours to (1, 24 * 7). An upper cap on the value of hours is necessary because a very high value of hours can overflow and become negative when converted to milliseconds. 

4. Unit tests added in LogTest 
    i.testTimeBasedLogRoll 
    ii. testSizeBasedLogRoll

5. Unit tests added in LogManagerTest (sorry couldn't come up with more concise names :\ )
    i. testCleanupSegmentsToMaintainSizeWithTimeBasedLogRoll
    ii. testCleanupExpiredSegmentsWithTimeBasedLogRoll
                
      was (Author: swapnilghike):
    Patch attached:
1. Time based log segment rollout added.

2. Moved the position of maybeRoll(segment) call in the Log to make sure that a new message does not get appended to a segment that has expired in time. 
   i. Accordingly modified the testCleanupSegmentsToMaintainSizeWithSizeBasedLogRoll

3. I have currently set the range of logRetentionHours and logRollHours to (1, 24 * 7). An upper cap on the value of hours is necessary because a very high value of hours can overflow and become negative when converted to milliseconds. 

4. Unit tests added in LogTest 
    i.testTimeBasedLogRoll 
    ii. testSizeBasedLogRoll

5. Unit tests added in LogManagerTest (sorry couldn't come up with more concise names :\ )
    i. testCleanupSegmentsToMaintainSizeWithTimeBasedLogRoll
    ii. testCleanupExpiredSegmentsWithTimeBasedLogRoll
                  
> Time based log segment rollout
> ------------------------------
>
>                 Key: KAFKA-475
>                 URL: https://issues.apache.org/jira/browse/KAFKA-475
>             Project: Kafka
>          Issue Type: New Feature
>    Affects Versions: 0.7.1
>            Reporter: Swapnil Ghike
>            Assignee: Swapnil Ghike
>              Labels: features
>             Fix For: 0.7.2
>
>         Attachments: kafka-475-v1.patch, kafka-475-v2.patch
>
>   Original Estimate: 48h
>  Remaining Estimate: 48h
>
> Some applications might want their data to be deleted from the Kafka servers earlier than the default retention time. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (KAFKA-475) Time based log segment rollout

Posted by "Jun Rao (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/KAFKA-475?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13440559#comment-13440559 ] 

Jun Rao commented on KAFKA-475:
-------------------------------

1. Yes, adding a topic level log retention size will be useful.

3. Yes, we can make timeOfCreation and Option. Initially, it will be none. It becomes a non-empty value on next append.

4. It doesn't seem that rolling logs are interfering with log cleanup. So, removing those tests should be fine.


                
> Time based log segment rollout
> ------------------------------
>
>                 Key: KAFKA-475
>                 URL: https://issues.apache.org/jira/browse/KAFKA-475
>             Project: Kafka
>          Issue Type: New Feature
>    Affects Versions: 0.7.1
>            Reporter: Swapnil Ghike
>            Assignee: Swapnil Ghike
>              Labels: features
>             Fix For: 0.7.2
>
>         Attachments: kafka-475-v1.patch, kafka-475-v2.patch, kafka-475-v3.patch
>
>   Original Estimate: 48h
>  Remaining Estimate: 48h
>
> Some applications might want their data to be deleted from the Kafka servers earlier than the default retention time. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (KAFKA-475) Time based log segment rollout

Posted by "Swapnil Ghike (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/KAFKA-475?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Swapnil Ghike updated KAFKA-475:
--------------------------------

    Status: Patch Available  (was: In Progress)
    
> Time based log segment rollout
> ------------------------------
>
>                 Key: KAFKA-475
>                 URL: https://issues.apache.org/jira/browse/KAFKA-475
>             Project: Kafka
>          Issue Type: New Feature
>    Affects Versions: 0.7.1
>            Reporter: Swapnil Ghike
>            Assignee: Swapnil Ghike
>              Labels: features
>             Fix For: 0.7.2
>
>         Attachments: kafka-475-v1.patch
>
>   Original Estimate: 48h
>  Remaining Estimate: 48h
>
> Some applications might want their data to be deleted from the Kafka servers earlier than the default retention time. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (KAFKA-475) Time based log segment rollout

Posted by "Swapnil Ghike (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/KAFKA-475?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Swapnil Ghike updated KAFKA-475:
--------------------------------

    Attachment: kafka-475-v2.patch

Patch attached:
1. Time based log segment rollout added.

2. Moved the position of maybeRoll(segment) call in the Log to make sure that a new message does not get appended to a segment that has expired in time. 
   i. Accordingly modified the testCleanupSegmentsToMaintainSizeWithSizeBasedLogRoll

3. I have currently set the range of logRetentionHours and logRollHours to (1, 24 * 7). An upper cap on the value of hours is necessary because a very high value of hours can overflow and become negative when converted to milliseconds. 

4. Unit tests added in LogTest 
    i.testTimeBasedLogRoll 
    ii. testSizeBasedLogRoll

5. Unit tests added in LogManagerTest (sorry couldn't come up with more concise names :\ )
    i. testCleanupSegmentsToMaintainSizeWithTimeBasedLogRoll
    ii. testCleanupExpiredSegmentsWithTimeBasedLogRoll
                
> Time based log segment rollout
> ------------------------------
>
>                 Key: KAFKA-475
>                 URL: https://issues.apache.org/jira/browse/KAFKA-475
>             Project: Kafka
>          Issue Type: New Feature
>    Affects Versions: 0.7.1
>            Reporter: Swapnil Ghike
>            Assignee: Swapnil Ghike
>              Labels: features
>             Fix For: 0.7.2
>
>         Attachments: kafka-475-v1.patch, kafka-475-v2.patch
>
>   Original Estimate: 48h
>  Remaining Estimate: 48h
>
> Some applications might want their data to be deleted from the Kafka servers earlier than the default retention time. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (KAFKA-475) Time based log segment rollout

Posted by "Swapnil Ghike (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/KAFKA-475?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13438975#comment-13438975 ] 

Swapnil Ghike commented on KAFKA-475:
-------------------------------------

Jun: Thanks for pointing out the mistake. I could not see why (a) in your suggestions is important though. Could you please elaborate if it makes a difference if we did not implement (a)?

Neha: Please correct me if I failed to see your point. In this proposed scheme, a new segment will be rolled out depending on whichever of the size limit or the time limit is hit first. So, if a producer produces data fast enough, it can still create multiple segments due to the size limit on each segment. I have set the time interval of rolling = retention time interval. In this case, if the segments don't hit the size limit within the retention time (due to aggressive retention time or slow production of data), then what you said will be true and there will be at most two active segments in the log at any point of time. In the first case, the application indeed wanted its data cleaned up fast and in the second case, hopefully the number of segments should not matter. 

Including your other suggestions in the patch.
                
> Time based log segment rollout
> ------------------------------
>
>                 Key: KAFKA-475
>                 URL: https://issues.apache.org/jira/browse/KAFKA-475
>             Project: Kafka
>          Issue Type: New Feature
>    Affects Versions: 0.7.1
>            Reporter: Swapnil Ghike
>            Assignee: Swapnil Ghike
>              Labels: features
>             Fix For: 0.7.2
>
>         Attachments: kafka-475-v1.patch
>
>   Original Estimate: 48h
>  Remaining Estimate: 48h
>
> Some applications might want their data to be deleted from the Kafka servers earlier than the default retention time. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (KAFKA-475) Time based log segment rollout

Posted by "Jun Rao (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/KAFKA-475?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13440342#comment-13440342 ] 

Jun Rao commented on KAFKA-475:
-------------------------------

Thanks for patch v3. A few other comments:
20. KafkaConfig:
20.1 To be consistent, we probably should add topic level log file size for rolling.
20.2 We probably don't need to cap logRoll and logRetention hours at 24*7 since we store ms in long, which has 2^^63 millseconds.

21. LogSegment: Unlike java, we can just have "val startTime" and use it directly. Scala already wraps the val with a public getter.

22. LogManagerTest: It seems to me that we can test log rolling (covered in LogTest) and log cleanup (covered in LogManager) independently. Is there any value in testing all 4 combination of log rolling and log cleanup?

                
> Time based log segment rollout
> ------------------------------
>
>                 Key: KAFKA-475
>                 URL: https://issues.apache.org/jira/browse/KAFKA-475
>             Project: Kafka
>          Issue Type: New Feature
>    Affects Versions: 0.7.1
>            Reporter: Swapnil Ghike
>            Assignee: Swapnil Ghike
>              Labels: features
>             Fix For: 0.7.2
>
>         Attachments: kafka-475-v1.patch, kafka-475-v2.patch, kafka-475-v3.patch
>
>   Original Estimate: 48h
>  Remaining Estimate: 48h
>
> Some applications might want their data to be deleted from the Kafka servers earlier than the default retention time. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (KAFKA-475) Time based log segment rollout

Posted by "Swapnil Ghike (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/KAFKA-475?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Swapnil Ghike updated KAFKA-475:
--------------------------------

    Attachment: kafka-475-v3.patch

Removed an unnecessary assert statement. Please view v3 of patch.
                
> Time based log segment rollout
> ------------------------------
>
>                 Key: KAFKA-475
>                 URL: https://issues.apache.org/jira/browse/KAFKA-475
>             Project: Kafka
>          Issue Type: New Feature
>    Affects Versions: 0.7.1
>            Reporter: Swapnil Ghike
>            Assignee: Swapnil Ghike
>              Labels: features
>             Fix For: 0.7.2
>
>         Attachments: kafka-475-v1.patch, kafka-475-v2.patch, kafka-475-v3.patch
>
>   Original Estimate: 48h
>  Remaining Estimate: 48h
>
> Some applications might want their data to be deleted from the Kafka servers earlier than the default retention time. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (KAFKA-475) Time based log segment rollout

Posted by "Swapnil Ghike (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/KAFKA-475?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13440519#comment-13440519 ] 

Swapnil Ghike commented on KAFKA-475:
-------------------------------------

1. Similarly, should we also add topic level log retention size?
2. Ok.
3. Ok. I am actually changing it to a var because there is one small change to be made to the rolling policy - We don't roll a new log when the previous segment which has expired in time is empty. When a new message is finally appended to this empty expired segment, its timeOfCreation should also be reset to a new value.
4. I implemented the new tests to make sure that the independent mechanisms of roll and recovery don't interfere with each other. But now that I look at them, they indeed look like a working module of rolling followed by a working model of recovery. We can either remove them, or I can try to combine all modes of roll and recovery in one new test to check for any interference.

Also, should we have a check for illegal values in getTopic* methods in Utils?
                
> Time based log segment rollout
> ------------------------------
>
>                 Key: KAFKA-475
>                 URL: https://issues.apache.org/jira/browse/KAFKA-475
>             Project: Kafka
>          Issue Type: New Feature
>    Affects Versions: 0.7.1
>            Reporter: Swapnil Ghike
>            Assignee: Swapnil Ghike
>              Labels: features
>             Fix For: 0.7.2
>
>         Attachments: kafka-475-v1.patch, kafka-475-v2.patch, kafka-475-v3.patch
>
>   Original Estimate: 48h
>  Remaining Estimate: 48h
>
> Some applications might want their data to be deleted from the Kafka servers earlier than the default retention time. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (KAFKA-475) Time based log segment rollout

Posted by "Neha Narkhede (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/KAFKA-475?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13438911#comment-13438911 ] 

Neha Narkhede commented on KAFKA-475:
-------------------------------------

If you roll log segments based on retention time, seems like you can have only one segment for that log at any point of time. If you want to roll 5 minute segments, it means that you can only have 5 minute worth of data for that partition. On the contrary, if I choose size based rolling and size based retention, I can have multiple log segments each of a specific size. What seems desirable is to have time based rolling + retention also behave the same way. I would imagine applications wanting to roll segments every 1 hour and retain 24 hours worth of data. This is an advantage for applications using getOffsetsBefore() to do some time indexed fetch of the data, since getOffsetsBefore only returns offsets at the log segment granularity. And it also gives applications a way to reason about the time window of the data retained for a partition. One potential downside is that, you can end up creating large number of log segments for your partition, if you choose too small a value for log.file.time.ms. But this problem exists today with size based log segment rolling too. So we are not introducing any regression in behavior.

Other review comments -

1. Log
1.1 Rename currentMS to currentMs (Follow camel case convention).
1.2 How about renaming retentionMSInterval to retentionIntervalMs to be consistent with naming convention ?
1.3 In maybeRoll, looks like currentMS is unused apart from being used to compute the time difference. How about removing currentMS ?

2. LogManager
2.1 This is unrelated to your patch, but lets also rename logRetentionMSMap to logRetentionMsMap


                
> Time based log segment rollout
> ------------------------------
>
>                 Key: KAFKA-475
>                 URL: https://issues.apache.org/jira/browse/KAFKA-475
>             Project: Kafka
>          Issue Type: New Feature
>    Affects Versions: 0.7.1
>            Reporter: Swapnil Ghike
>            Assignee: Swapnil Ghike
>              Labels: features
>             Fix For: 0.7.2
>
>         Attachments: kafka-475-v1.patch
>
>   Original Estimate: 48h
>  Remaining Estimate: 48h
>
> Some applications might want their data to be deleted from the Kafka servers earlier than the default retention time. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

[jira] [Updated] (KAFKA-475) Time based log segment rollout

Posted by "Swapnil Ghike (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/KAFKA-475?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Swapnil Ghike updated KAFKA-475:
--------------------------------

    Attachment: kafka-475-v4.patch

1. Topic level log roll size and retention size limits added. 
2. Removed the cap on logRoll and logRetention Hours. 
3. Created an Option for the timeOfFirstAppend. 
4. Removed the unnecessary unit tests. 

Created kafka-481 for adding require() to getTopic* methods.
                
> Time based log segment rollout
> ------------------------------
>
>                 Key: KAFKA-475
>                 URL: https://issues.apache.org/jira/browse/KAFKA-475
>             Project: Kafka
>          Issue Type: New Feature
>    Affects Versions: 0.7.1
>            Reporter: Swapnil Ghike
>            Assignee: Swapnil Ghike
>              Labels: features
>             Fix For: 0.7.2
>
>         Attachments: kafka-475-v1.patch, kafka-475-v2.patch, kafka-475-v3.patch, kafka-475-v4.patch
>
>   Original Estimate: 48h
>  Remaining Estimate: 48h
>
> Some applications might want their data to be deleted from the Kafka servers earlier than the default retention time. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (KAFKA-475) Time based log segment rollout

Posted by "Swapnil Ghike (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/KAFKA-475?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Swapnil Ghike updated KAFKA-475:
--------------------------------

    Attachment: kafka-475-v1.patch

To facilitate this, we can roll out a new log segment whenever a time threshold is reached if the size limit has not been reached already. We can fix this time limit for segment roll out as the same as retention time limit. These values will make sure that the number of open file handles at any point in the system cannot more than double.
                
> Time based log segment rollout
> ------------------------------
>
>                 Key: KAFKA-475
>                 URL: https://issues.apache.org/jira/browse/KAFKA-475
>             Project: Kafka
>          Issue Type: New Feature
>    Affects Versions: 0.7.1
>            Reporter: Swapnil Ghike
>            Assignee: Swapnil Ghike
>              Labels: features
>             Fix For: 0.7.2
>
>         Attachments: kafka-475-v1.patch
>
>   Original Estimate: 48h
>  Remaining Estimate: 48h
>
> Some applications might want their data to be deleted from the Kafka servers earlier than the default retention time. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (KAFKA-475) Time based log segment rollout

Posted by "Jun Rao (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/KAFKA-475?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13438789#comment-13438789 ] 

Jun Rao commented on KAFKA-475:
-------------------------------

Thanks for patch v1. Some comments:

1. The condition for testing whether we should roll a new log segment doesn't seem right. Currently, it will roll a new segment if the last segment hasn't been updated for retention time. What we should do is to roll a new segment every retention interval independent of the last update time, as long as (a) no segment has been rolled since the last retention interval; (b) the last segment has a size larger than 0.

2. We should add a unit test to test rolling a new segment by time. 
                
> Time based log segment rollout
> ------------------------------
>
>                 Key: KAFKA-475
>                 URL: https://issues.apache.org/jira/browse/KAFKA-475
>             Project: Kafka
>          Issue Type: New Feature
>    Affects Versions: 0.7.1
>            Reporter: Swapnil Ghike
>            Assignee: Swapnil Ghike
>              Labels: features
>             Fix For: 0.7.2
>
>         Attachments: kafka-475-v1.patch
>
>   Original Estimate: 48h
>  Remaining Estimate: 48h
>
> Some applications might want their data to be deleted from the Kafka servers earlier than the default retention time. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Work started] (KAFKA-475) Time based log segment rollout

Posted by "Swapnil Ghike (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/KAFKA-475?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Work on KAFKA-475 started by Swapnil Ghike.

> Time based log segment rollout
> ------------------------------
>
>                 Key: KAFKA-475
>                 URL: https://issues.apache.org/jira/browse/KAFKA-475
>             Project: Kafka
>          Issue Type: New Feature
>    Affects Versions: 0.7.1
>            Reporter: Swapnil Ghike
>            Assignee: Swapnil Ghike
>              Labels: features
>             Fix For: 0.7.2
>
>         Attachments: kafka-475-v1.patch
>
>   Original Estimate: 48h
>  Remaining Estimate: 48h
>
> Some applications might want their data to be deleted from the Kafka servers earlier than the default retention time. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira