You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@kafka.apache.org by "Jay Kreps (JIRA)" <ji...@apache.org> on 2014/01/24 05:54:38 UTC

[jira] [Comment Edited] (KAFKA-1227) Code dump of new producer

    [ https://issues.apache.org/jira/browse/KAFKA-1227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13880717#comment-13880717 ] 

Jay Kreps edited comment on KAFKA-1227 at 1/24/14 4:54 AM:
-----------------------------------------------------------

Hey Jun, quick responses. I'll try to get a patch up with some of the minor things, though a few of the others I'd like to do post-commit.

1. WRT closing connections. Yes, this is true. I agree it is needed but decided to punt on it for the first pass. It is an important follow-up item. There are two cases to handle: metadata fetches and leadership handoffs. 
Obviously the Selector will not handle these special cases which are specific to this use case. Theoretically this could all be done in the Sender logic but it would be a bit complex. I think the best solution is just to have us time out idle connections after some configurable period of disuse (30 seconds, say). 

2. I think the VIP problem can be handled by just timing out idle connections. Special cases related to metadata won't help because non-metadata related connections can also be idle. Not retaining the bootstrap urls is intentional: future metadata requests should use the full broker set the producer is connecting to. You mention that this will cause the producer to prefer to fetch metadata from a broker to which it already has a connection for subsequent metadata fetches after the initial bootstrap, but this was the idea--no need to setup and then timeout another connection if we already have one.

3. WRT initializing the partitioner to 0: Yeah we can initialize to something random. This problem would be a bit pathological as you would have to start all your producers the same instant and send exactly the same number of messages through them for this to persist.

4. I included the numPartitions even though it is easily computable from cluster as all partitioners will need to mod by number of partitions, but the vast majority won't need to use the cluster. So it just seemed more intuitive rather than the user having to figure out that they can get it by calling into cluster and worrying about the underlying performance of that just to give it to them.

5.1 Yes, that is a bug.
5.2 It is a bit slangy :-)

6. I don't prevent this: a zero hash code will be recomputed each time, but this is an unlikely case and recomputing is what would happen in all cases if we didn't cache.

7. Good point, I'll improve the error message.

8.1 I'll try to think of a better name.
8.2 Yes, we can do that. I think that would be good for latency in the case where we had to allocate a non-standard size

9. I think you could argue either way in terms of the preferrable sequencing. However I wanted to reuse the RecordSend object as the argument to the callback rather than introduce another object. However this means I do need to complete the record send first otherwise the callback will block trying to access fields in the send.

10. Ah, very good point.

11. Thanks

12. I am not two picky. 2 spaces is the recommended style in scala and 4 spaces is the classic "sun java style". I would like to get our style formally specified in an IDE formatter. That is what I am using for eclipse and it is very nice, it does all formatting for you and ensures a very consistent style. I will start a thread on this one as likely everyone has an opinion.

13. I had the same thought. I'm not sure if it is better to give the current value or the average over the window. Thoughts? Since we mostly look at graphs polled every 30 seconds if we do the instantaneous measurement it amounts to just a single data point for the whole 30 seconds but that may be okay...

14. I figured we would need to discuss logging so I just punted for now. The standard insofar as there is one is really slf4j, which I consider kind of silly. There are really just a couple of places that need logging so maybe it would be fine to just use java.util.logging which comes with the jvm. I'll start a thread on this.

15. In a client I think this is something we should leave to the client. Printing lots of messages in their logs is a bit rude. I think it is better to give an API to get information about the configs.



was (Author: jkreps):
Hey Jun, quick responses. I'll try to get a patch up with some of the minor things, though a few of the others I'd like to do post-commit.

1. WRT closing connections. Yes, this is true. I agree it is needed but decided to punt on it for the first pass. It is an important follow-up item. There are two cases to handle: metadata fetches and leadership handoffs. 
Obviously the Selector will not handle these special cases which are specific to this use case. Theoretically this could all be done in the Sender logic but it would be a bit complex. I think the best solution is just to have us time out idle connections after some configurable period of disuse (30 seconds, say). 

2. I think the VIP problem can be handled by just timing out idle connections. Special cases related to metadata won't help because non-metadata related connections can also be idle. Not retaining the bootstrap urls is intentional--future metadata requests should use the full broker set the producer is connecting to. You mention that this will cause the producer to prefer to fetch metadata from a broker to which it already has a connection for subsequent metadata fetches after the initial bootstrap, but this was the idea--no need to setup and then timeout another connection if we already have one.

3. WRT initializing the partitioner to 0: Yeah we can initialize to something random. This problem would be a bit pathological as you would have to start all your producers the same instant and send exactly the same number of messages through them for this to persist.

4. I included the numPartitions even though it is easily computable from cluster as all partitioners will need to mod by number of partitions, but the vast majority won't need to use the cluster. So it just seemed more intuitive rather than the user having to figure out that they can get it by calling into cluster and worrying about the underlying performance of that just to give it to them.

5.1 Yes, that is a bug.
5.2 It is a bit slangy :-)

6. I don't prevent this: a zero hash code will be recomputed each time, but this is an unlikely case and recomputing is what would happen in all cases if we didn't cache.

7. Good point, I'll improve the error message.

8.1 I'll try to think of a better name.
8.2 Yes, we can do that. I think that would be good for latency in the case where we had to allocate a non-standard size

9. I think you could argue either way in terms of the preferrable sequencing. However I wanted to reuse the RecordSend object as the argument to the callback rather than introduce another object. However this means I do need to complete the record send first otherwise the callback will block trying to access fields in the send.

10. Ah, very good point.

11. Thanks

12. I am not two picky. 2 spaces is the recommended style in scala and 4 spaces is the classic "sun java style". I would like to get our style formally specified in an IDE formatter. That is what I am using for eclipse and it is very nice, it does all formatting for you and ensures a very consistent style. I will start a thread on this one as likely everyone has an opinion.

13. I had the same thought. I'm not sure if it is better to give the current value or the average over the window. Thoughts? Since we mostly look at graphs polled every 30 seconds if we do the instantaneous measurement it amounts to just a single data point for the whole 30 seconds but that may be okay...

14. I figured we would need to discuss logging so I just punted for now. The standard insofar as there is one is really slf4j, which I consider kind of silly. There are really just a couple of places that need logging so maybe it would be fine to just use java.util.logging which comes with the jvm. I'll start a thread on this.

15. In a client I think this is something we should leave to the client. Printing lots of messages in their logs is a bit rude. I think it is better to give an API to get information about the configs.


> Code dump of new producer
> -------------------------
>
>                 Key: KAFKA-1227
>                 URL: https://issues.apache.org/jira/browse/KAFKA-1227
>             Project: Kafka
>          Issue Type: New Feature
>            Reporter: Jay Kreps
>            Assignee: Jay Kreps
>         Attachments: KAFKA-1227.patch
>
>
> The plan is to take a dump of the producer code "as is" and then do a series of post-commit reviews to get it into shape. This bug tracks just the code dump.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)