You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@kafka.apache.org by "Jay Kreps (JIRA)" <ji...@apache.org> on 2012/07/19 07:05:36 UTC
[jira] [Commented] (KAFKA-350) Enable message replication in the presence of controlled failures

    [ https://issues.apache.org/jira/browse/KAFKA-350?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13418065#comment-13418065 ] 

Jay Kreps commented on KAFKA-350:
---------------------------------

This is a pretty hard to review due to the large number of changes and also because I don't know some of this code well.

A lot of things like bad logging/naming that I think you could probably catch just perusing the diff.

Log:
- Log should not know about hw right? We seem to be adding more hw stuff there?
- This adds a getHW() that just returns a private val, why not make the val public? Please fix these. Regardless of cleanup being done get/set methods have been against the style guide for a long time, lets not add more. Ditto getEndOffset() which in addition to being a getter is inconsistent with Log.logEndOffset
- There is debug statement in a for loop in Log.scala that needs to be removed
- I don't understand the difference between nextAppendOffset and logEndOffset. Can you make it clear in the javadoc and explain on why we need both of these. Our public interface to Log is getting incredibly complex, which is really sad so I think we should really think through deeply what is added here and why.
- The javadoc on line 138 of Log.scala doesn't match the style of javadoc for the preceeding 5 variables.

- Does making isr.keep.in.sync.time.ms more stringent actually make sense? 10 seconds is pretty tight. I think what you are saying is that every server bounce will introduce 30 seconds of latency. But I think that is kind of a weakness of our design. If we lower that timeout we may just get spurious dropped replicas, no?
- Can we change the name of isr.keep.in.sync.time.ms to replica.max.lag.time.ms?
- Good point about the socket timeouts. We can't set socket timeout equal to request timeout, though, as there may be a large network latency. I recommend we just default the socket timeout to something large (like 10x the request timeout), and throw an exception if it is less than the request timeout (since that is certainly wrong). I don't think we should be using the socket timeout except as an emergency escape for a totally hung broker now that we have the nice per-request timeout.
- Can we change producer.request.ack.timeout.ms to producer.request.timeout.ms so it is more intuitive? I don't think the word "ack" is going to be self-explanatory to users.

SocketServer.scala
- Please remove: info("Shut down acceptor completed")
- Is there a reason to add the port into the thread name? That seems extremely odd...is it to simplify testing where there are multiple servers on a machine?
- Why is it a bug for processNewResponses() to happen while a shutdown is occuring. I don't think that is a bug. That is called in the event loop. It is the loop that should stop, no? Is there any ill effect of this?
- Good catch on the infinite loop

System testing
- I think we should fix the performance test hacks. The performance tool is critical. I have watched this play out before. No one ever budgets time for making the performance test stuff usable and then it just gets re-written umpteen times and never does what is needed. Most of these are just a matter of some basic cleanup and adding options. Let's work clean.

AdminUtils
- I don't understand the change in getTopicMetaDataFromZK

Replica.scala
- Can you remove trace("Returning leader replica for leader " + leaderReplicaId) unless you think it is of value going forward

ErrorMapping.scala
- getMaxErrorCodeValue - seems to be recomputed for each TopicMetadata. Let's get rid of this, I don't think we need it. We already have an unknown entry in the mapping, we should use that and get rid of the Utils.getShortInRange
- If we do want to keep it, fix the name
- We should really give a base class KafkaException to all exceptions so the client can handle them more easily
- Instead of having the client get an IllegalArgumentException we should just throw new KafkaException("Unknown server error (error code " + code + ")")
- The file NoLeaderForPartitionException seems to be empty now, I think you meant to delete it

ConsoleConsumer
- What does moving those things into a finally block do? We just caught all exceptions...

FileMessageSet
- You added more log spam. Please format it so it is intelligible to someone not working on the code or remove: debug("flush size:" + sizeInBytes())
- Ditto info("recover upto size:" + sizeInBytes())

BrokerPartitionInfos is a really weird class

DefaultEventHandler
- This seems to have grown into a big glump of proceedural logic.
- Inconsistent spacing of parens should be cleaned up
- partitionAndCollate is extemely complex
- Option[Map[Int, Map[(String, Int), Seq[ProducerData[K,Message]]]]]

ZkUtils
- LeaderExists class needs a name that is a noun
- Also we have a class with the same name in TopicMetadata

I really like TestUtils.waitUntilLeaderIsElected. God bless you for not adding sleeps. We should consider repeating this pattern for other cases like this.

ZooKeeperTestHarness.scala
- Can we replace the thread.sleep with a waitUntilZkIsUp call?


                
> Enable message replication in the presence of controlled failures
> -----------------------------------------------------------------
>
>                 Key: KAFKA-350
>                 URL: https://issues.apache.org/jira/browse/KAFKA-350
>             Project: Kafka
>          Issue Type: Bug
>    Affects Versions: 0.8
>            Reporter: Neha Narkhede
>            Assignee: Neha Narkhede
>         Attachments: kafka-350-v1.patch
>
>
> KAFKA-46 introduced message replication feature in the absence of server failures. This JIRA will improve the log recovery logic and fix other bugs to enable message replication to happen in the presence of controlled server failures

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira