You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@kafka.apache.org by Gwen Shapira <gs...@cloudera.com> on 2015/07/10 20:34:46 UTC

[Discussion] Limitations on topic names

Hi Kafka Fans,

If you have one topic named "kafka_lab_2" and the other named
"kafka.lab.2", the topic level metrics will be named kafka_lab_2 for
both, effectively making it impossible to monitor them properly.

The reason this happens is that using "." in topic names is pretty
common, especially as a way to group topics into data centers,
relevant apps, etc - basically a work-around to our current lack of
name spaces. However, most metric monitoring systems using "." to
annotate hierarchy, so to avoid issues around metric names, Kafka
replaces the "." in the name with an underscore.

This generates good metric names, but creates the problem with name collisions.

I'm wondering if it makes sense to simply limit the range of
characters permitted in a topic name and disallow "_"? Obviously
existing topics will need to remain as is, which is a bit awkward.

If anyone has better backward-compatible solutions to this, I'm all ears :)

Gwen

Re: [Discussion] Limitations on topic names

Posted by Grant Henke <gh...@cloudera.com>.
I vote for #1 too.

A special reason Kafka may use '.' in the future is for hierarchical or
namespaced topics.

On Fri, Jul 10, 2015 at 3:32 PM, Todd Palino <tp...@gmail.com> wrote:

> My selfish point of view is that we do #1, as we use "_" extensively in
> topic names here :) I also happen to think it's the right choice,
> specifically because "." has more special meanings, as you noted.
>
> -Todd
>
>
> On Fri, Jul 10, 2015 at 1:30 PM, Gwen Shapira <gs...@cloudera.com>
> wrote:
>
> > Unintentional side effect from allowing IP addresses in consumer client
> > IDs :)
> >
> > So the question is, what do we do now?
> >
> > 1) disallow "."
> > 2) disallow "_"
> > 3) find a reversible way to encode "." and "_" that won't break existing
> > metrics
> > 4) all of the above?
> >
> > btw. it looks like "." and ".." are currently valid. Topic names are
> > used for directories, right? this sounds like fun :)
> >
> > I vote for option #1, although if someone has a good idea for #3 it
> > will be even better.
> >
> > Gwen
> >
> >
> >
> > On Fri, Jul 10, 2015 at 1:22 PM, Grant Henke <gh...@cloudera.com>
> wrote:
> > > Found it was added here:
> https://issues.apache.org/jira/browse/KAFKA-697
> > >
> > > On Fri, Jul 10, 2015 at 3:18 PM, Todd Palino <tp...@gmail.com>
> wrote:
> > >
> > >> This was definitely changed at some point after KAFKA-495. The
> question
> > is
> > >> when and why.
> > >>
> > >> Here's the relevant code from that patch:
> > >>
> > >> ===================================================================
> > >> --- core/src/main/scala/kafka/utils/Topic.scala (revision 1390178)
> > >> +++ core/src/main/scala/kafka/utils/Topic.scala (working copy)
> > >> @@ -21,24 +21,21 @@
> > >>  import util.matching.Regex
> > >>
> > >>  object Topic {
> > >> +  val legalChars = "[a-zA-Z0-9_-]"
> > >>
> > >>
> > >>
> > >> -Todd
> > >>
> > >>
> > >> On Fri, Jul 10, 2015 at 1:02 PM, Grant Henke <gh...@cloudera.com>
> > wrote:
> > >>
> > >> > kafka.common.Topic shows that currently period is a valid character
> > and I
> > >> > have verified I can use kafka-topics.sh to create a new topic with a
> > >> > period.
> > >> >
> > >> >
> > >> > AdminUtils.createOrUpdateTopicPartitionAssignmentPathInZK currently
> > uses
> > >> > Topic.validate before writing to Zookeeper.
> > >> >
> > >> > Should period character support be removed? I was under the same
> > >> impression
> > >> > as Gwen, that a period was used by many as a way to "group" topics.
> > >> >
> > >> > The code is pasted below since its small:
> > >> >
> > >> > object Topic {
> > >> >   val legalChars = "[a-zA-Z0-9\\._\\-]"
> > >> >   private val maxNameLength = 255
> > >> >   private val rgx = new Regex(legalChars + "+")
> > >> >
> > >> >   val InternalTopics = Set(OffsetManager.OffsetsTopicName)
> > >> >
> > >> >   def validate(topic: String) {
> > >> >     if (topic.length <= 0)
> > >> >       throw new InvalidTopicException("topic name is illegal, can't
> be
> > >> > empty")
> > >> >     else if (topic.equals(".") || topic.equals(".."))
> > >> >       throw new InvalidTopicException("topic name cannot be \".\" or
> > >> > \"..\"")
> > >> >     else if (topic.length > maxNameLength)
> > >> >       throw new InvalidTopicException("topic name is illegal, can't
> be
> > >> > longer than " + maxNameLength + " characters")
> > >> >
> > >> >     rgx.findFirstIn(topic) match {
> > >> >       case Some(t) =>
> > >> >         if (!t.equals(topic))
> > >> >           throw new InvalidTopicException("topic name " + topic + "
> is
> > >> > illegal, contains a character other than ASCII alphanumerics, '.',
> '_'
> > >> and
> > >> > '-'")
> > >> >       case None => throw new InvalidTopicException("topic name " +
> > topic
> > >> +
> > >> > " is illegal,  contains a character other than ASCII alphanumerics,
> > '.',
> > >> > '_' and '-'")
> > >> >     }
> > >> >   }
> > >> > }
> > >> >
> > >> > On Fri, Jul 10, 2015 at 2:50 PM, Todd Palino <tp...@gmail.com>
> > wrote:
> > >> >
> > >> > > I had to go look this one up again to make sure -
> > >> > > https://issues.apache.org/jira/browse/KAFKA-495
> > >> > >
> > >> > > The only valid character names for topics are alphanumeric,
> > underscore,
> > >> > and
> > >> > > dash. A period is not supposed to be a valid character to use. If
> > >> you're
> > >> > > seeing them, then one of two things have happened:
> > >> > >
> > >> > > 1) You have topic names that are grandfathered in from before that
> > >> patch
> > >> > > 2) The patch is not working properly and there is somewhere in the
> > >> broker
> > >> > > that the standard is not being enforced.
> > >> > >
> > >> > > -Todd
> > >> > >
> > >> > >
> > >> > > On Fri, Jul 10, 2015 at 12:13 PM, Brock Noland <br...@apache.org>
> > >> wrote:
> > >> > >
> > >> > > > On Fri, Jul 10, 2015 at 11:34 AM, Gwen Shapira <
> > >> gshapira@cloudera.com>
> > >> > > > wrote:
> > >> > > > > Hi Kafka Fans,
> > >> > > > >
> > >> > > > > If you have one topic named "kafka_lab_2" and the other named
> > >> > > > > "kafka.lab.2", the topic level metrics will be named
> kafka_lab_2
> > >> for
> > >> > > > > both, effectively making it impossible to monitor them
> properly.
> > >> > > > >
> > >> > > > > The reason this happens is that using "." in topic names is
> > pretty
> > >> > > > > common, especially as a way to group topics into data centers,
> > >> > > > > relevant apps, etc - basically a work-around to our current
> > lack of
> > >> > > > > name spaces. However, most metric monitoring systems using "."
> > to
> > >> > > > > annotate hierarchy, so to avoid issues around metric names,
> > Kafka
> > >> > > > > replaces the "." in the name with an underscore.
> > >> > > > >
> > >> > > > > This generates good metric names, but creates the problem with
> > name
> > >> > > > collisions.
> > >> > > > >
> > >> > > > > I'm wondering if it makes sense to simply limit the range of
> > >> > > > > characters permitted in a topic name and disallow "_"?
> Obviously
> > >> > > > > existing topics will need to remain as is, which is a bit
> > awkward.
> > >> > > >
> > >> > > > Interesting problem! Many if not most users I personally am
> aware
> > of
> > >> > > > use "_" as a separator in topic names. I am sure that many users
> > >> would
> > >> > > > be quite surprised by this limitation. With that said, I am sure
> > >> > > > they'd transition accordingly.
> > >> > > >
> > >> > > > >
> > >> > > > > If anyone has better backward-compatible solutions to this,
> I'm
> > all
> > >> > > ears
> > >> > > > :)
> > >> > > > >
> > >> > > > > Gwen
> > >> > > >
> > >> > >
> > >> >
> > >> >
> > >> >
> > >> > --
> > >> > Grant Henke
> > >> > Solutions Consultant | Cloudera
> > >> > ghenke@cloudera.com | twitter.com/gchenke |
> > linkedin.com/in/granthenke
> > >> >
> > >>
> > >
> > >
> > >
> > > --
> > > Grant Henke
> > > Solutions Consultant | Cloudera
> > > ghenke@cloudera.com | twitter.com/gchenke | linkedin.com/in/granthenke
> >
>



-- 
Grant Henke
Solutions Consultant | Cloudera
ghenke@cloudera.com | twitter.com/gchenke | linkedin.com/in/granthenke

Re: [Discussion] Limitations on topic names

Posted by Joel Koshy <jj...@gmail.com>.
This did come up in the discussion in KAFKA-1902. It is somewhat
concerning that something very specific - in this case (what I think
is a limitation [1]) in certain metric reporters should drive the
decision on what constitutes a legal topic name in Kafka - especially
when all the characters in question actually seem reasonable in a
topic name.

I'm guessing this is not a popular choice simply because these metric
systems are actually popular, but my preference would be to do nothing
here and these users should just avoid such characters in topics.

[1] https://issues.apache.org/jira/browse/KAFKA-1902?focusedCommentId=14294733&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14294733

On Mon, Jul 13, 2015 at 07:40:17AM -0700, Jun Rao wrote:
> Magnus,
> 
> Converting dot to _ essentially is our way of escaping in the scope part of
> the metric name. The issue is that your options of escaping is limited due
> to the constraints in the reporters. For example, the Ganglia reporter
> replaces anything other than alpha-numeric, -, _ and dot to _ in the metric
> name. Not sure how well Graphite deals with \ either. For details, take a
> look at the discussion in KAFKA-1902. Note that the replacement of dots
> only affects the reporters. Dots are preserved in the mbean names.
> 
> Thanks,
> 
> Jun
> 
> On Sun, Jul 12, 2015 at 10:58 PM, Magnus Edenhill <ma...@edenhill.se>
> wrote:
> 
> > Hi,
> >
> > since dots seem to be a problem on the metrics side, why not let the
> > metrics side handle it
> > by escaping troublesome characters? E.g. "foo.my\.topic.feh"
> > Let's not push the problem upstream.
> >
> > Replacing "." with another set of allowed characters "__" seems like a bad
> > idea since it
> > is ambigious: "__consumer_offsets" == ".consumer_offsets"?
> >
> > I'm guessing the same problem arises if broker names are part of the
> > metrics name,
> > e.g., "broker.192.168.0.2.rxbytes", do we want to push the exclusion of
> > dots in IP addresses
> > upstream as well? :)
> >
> > Magnus
> >
> >
> > 2015-07-13 2:06 GMT+02:00 Jun Rao <ju...@confluent.io>:
> >
> > > First, a couple of clarifications on this.
> > >
> > > 1. Currently, we allow Kafka topic to have dots, except that we disallow
> > > topic names that are exactly "." or ".." (which can cause weird problems
> > > when mapping to file directories and ZK paths as Gwen pointed out).
> > >
> > > 2. When creating the Coda Hale metrics, currently, we only replace dot
> > with
> > > _ in the scope of the metric name. The actually jmx bean name still
> > > preserves dot. This is because the Graphite reporter uses scope when
> > > forming the metric names and assumes dots are component separators (see
> > > KAFKA-1902 for details). So, if one uses tools like jmxtrans to export
> > the
> > > metrics from the mbeans directly, the original topic name is preserved.
> > > However, I am not sure how well this maps to Graphite. We thought about
> > > making the replacing character configurable. However, the difficulty is
> > > that the logic of doing the replacement is in a singleton
> > > class KafkaMetricsGroup and I am not sure if we can pass in an external
> > > config.
> > >
> > > Given the above, I'd suggest that customer try the jmxtrans to Graphite
> > > path and see if that helps. I agree that it's too disruptive to restrict
> > > the current topic naming convention.
> > >
> > > Also, since we plan to replace Coda Hale metrics with Kafka metrics in
> > the
> > > future, we can try to address this issue better then.
> > >
> > > Thanks,
> > >
> > > Jun
> > >
> > >
> > >
> > >
> > > On Sun, Jul 12, 2015 at 10:26 AM, Gwen Shapira <gs...@cloudera.com>
> > > wrote:
> > >
> > > > I like the "lets warn people of conflicts when creating the topic"
> > > > suggestion. IMO, automatic topic creation as currently done is buggy
> > > > either way (Send data and hope the topic is ready before retries run
> > > > out, potentially failing with the super helpful NO_LEADER error), so I
> > > > don't mind leaving it broken a bit more. I think the right behavior is
> > > > that conflicts will cause auto creating to fail, the same way we
> > > > currently do when the default number of replicas is higher than number
> > > > of brokers.
> > > >
> > > > One thing that is left confusing is that people in the "." camp need
> > > > to know about the conversion or they will fail to find their topics in
> > > > their monitoring tools. Not very nice to them, but I can't think of
> > > > alternatives.
> > > >
> > > > I'll start with the doc patch :)
> > > >
> > > > On Sat, Jul 11, 2015 at 12:54 AM, Ewen Cheslack-Postava
> > > > <ew...@confluent.io> wrote:
> > > > > On Fri, Jul 10, 2015 at 4:41 PM, Gwen Shapira <gshapira@cloudera.com
> > >
> > > > wrote:
> > > > >
> > > > >> Yeah, I have an actual customer who ran into this. Unfortunately,
> > > > >> inconsistencies in the way things are named are pretty common - just
> > > > >> look at Kafka's many CLI options.
> > > > >>
> > > > >> I don't think that supporting both and pointing at the docs with "I
> > > > >> told you so" when our metrics break is a good solution.
> > > > >>
> > > > >
> > > > > I agree, especially since we don't *already* have something in the
> > docs
> > > > > indicating this will be an issue. I was flippant about the situation
> > > > > because I *wish* there was more careful consideration + naming policy
> > > in
> > > > > place, but I realize that doesn't always happen in practice. I guess
> > I
> > > > need
> > > > > to take Compatibility Czar more seriously :)
> > > > >
> > > > > I see think the obvious practical options are as follows:
> > > > >
> > > > > 1. Kill support for "_". Piss off the entire set of people who
> > > currently
> > > > > use "_" anywhere in topic names.
> > > > > 2. Kill support for ".". Piss off the entire set of people who
> > > currently
> > > > > use "." anywhere in topic names.
> > > > > 3. Tell people they need to be careful about this issue. Piss off the
> > > set
> > > > > of people who use both "_" and "." *and* happen to have conflicting
> > > topic
> > > > > names. They will have some pain when they discover the issue and have
> > > to
> > > > > figure out how to move one of those topics over to a non-conflicting
> > > > name.
> > > > > I'm going to claim that this group must be an *extremely* small
> > > fraction
> > > > of
> > > > > users, which doesn't make it better to allow things to break for
> > them,
> > > > but
> > > > > at least gives us an idea of the scale of impact.
> > > > >
> > > > > (One other alternative suggested earlier was encoding metric names to
> > > > > account for differences; given the metric renaming mess in the last
> > > > > release, I'm extremely hesitant to suggest anything of the sort...)
> > > > >
> > > > > None of the options are ideal, but to me, 3 seems like the least
> > > painful.
> > > > > Both for us, and for the vast majority of users. It seems to me that
> > > the
> > > > > number of users that would complain about (1) or (2) drastically
> > > outweigh
> > > > > (3).
> > > > >
> > > > > At this point, I don't think it's practical to keep switching the
> > rules
> > > > > about which characters are allowed and which aren't because the
> > > previous
> > > > > attempts haven't been successful -- it seems the rules have changed
> > > > > multiple times, whether intentionally or accidentally, such that any
> > > more
> > > > > changes will cause problems. At this point, I think we just need to
> > > > accept
> > > > > being liberal in accepting the range of topic names that have been
> > > > > permitted so far and make the best of the situation, even if it means
> > > > only
> > > > > being able to warn people of conflicts.
> > > > >
> > > > > Here's another alternative: how about being liberal with topic name
> > > > > characters, but upon topic creation we convert the name to the metric
> > > > name
> > > > > and fail if there's a conflict with another topic? This is relatively
> > > > > expensive (requires getting the metric name of all other topics), but
> > > it
> > > > > avoids the bad situation we're encountering here (conflicting
> > metrics),
> > > > > avoids getting into a persistent conflict (we kill topic creation
> > when
> > > we
> > > > > detect the issue rather than noticing it when the metrics conflict
> > > > > happens), and keeps the vast majority of existing users happy (both _
> > > > and .
> > > > > work in topic names as long as you don't create topics with
> > conflicting
> > > > > metric names).
> > > > >
> > > > > There are definitely details to be worked out (auto topic creation?),
> > > but
> > > > > it seems like a more realistic solution than to start disallowing _
> > or
> > > .
> > > > in
> > > > > topic names.
> > > > >
> > > > > -Ewen
> > > > >
> > > > >
> > > > >>
> > > > >> On Fri, Jul 10, 2015 at 4:33 PM, Ewen Cheslack-Postava
> > > > >> <ew...@confluent.io> wrote:
> > > > >> > I figure you'll probably see complaints no matter what change you
> > > > make.
> > > > >> > Gwen, given that you raised this, another important question might
> > > be
> > > > how
> > > > >> > many people you see using *both*. I'm guessing this question came
> > up
> > > > >> > because you actually saw a conflict? But I'd imagine (or at least
> > > > hope)
> > > > >> > that most organizations are mostly consistent about naming topics
> > --
> > > > they
> > > > >> > standardize on one or the other.
> > > > >> >
> > > > >> > Since there's no "right" way to name them, I'd just leave it
> > > > supporting
> > > > >> > both and document the potential conflict in metrics. And if people
> > > use
> > > > >> both
> > > > >> > naming schemes, they probably deserve to suffer for their
> > > > inconsistency
> > > > >> :)
> > > > >> >
> > > > >> > -Ewen
> > > > >> >
> > > > >> > On Fri, Jul 10, 2015 at 3:28 PM, Gwen Shapira <
> > > gshapira@cloudera.com>
> > > > >> wrote:
> > > > >> >
> > > > >> >> I find dots more common in my customer base, so I will definitely
> > > > feel
> > > > >> >> the pain of removing them.
> > > > >> >>
> > > > >> >> However, "." are already used in metrics, file names,
> > directories,
> > > > etc
> > > > >> >> - so if we keep the dots, we need to keep code that translates
> > them
> > > > >> >> and document the translation. Just banning "." seems more
> > natural.
> > > > >> >> Also, as Grant mentioned, we'll probably have our own special
> > usage
> > > > >> >> for "." down the line.
> > > > >> >>
> > > > >> >> On Fri, Jul 10, 2015 at 2:12 PM, Todd Palino <tp...@gmail.com>
> > > > wrote:
> > > > >> >> > I absolutely disagree with #2, Neha. That will break a lot of
> > > > >> >> > infrastructure within LinkedIn. That said, removing "." might
> > > break
> > > > >> other
> > > > >> >> > people as well, but I think we should have a clearer idea of
> > how
> > > > much
> > > > >> >> usage
> > > > >> >> > there is on either side.
> > > > >> >> >
> > > > >> >> > -Todd
> > > > >> >> >
> > > > >> >> >
> > > > >> >> > On Fri, Jul 10, 2015 at 2:08 PM, Neha Narkhede <
> > > neha@confluent.io>
> > > > >> >> wrote:
> > > > >> >> >
> > > > >> >> >> "." seems natural for grouping topic names. +1 for 2) going
> > > > forward
> > > > >> only
> > > > >> >> >> without breaking previously created topics with "_" though
> > that
> > > > might
> > > > >> >> >> require us to patch the code somewhat awkwardly till we phase
> > it
> > > > out
> > > > >> a
> > > > >> >> >> couple (purposely left vague to stay out of Ewen's wrath :-))
> > > > >> versions
> > > > >> >> >> later.
> > > > >> >> >>
> > > > >> >> >> On Fri, Jul 10, 2015 at 2:02 PM, Gwen Shapira <
> > > > gshapira@cloudera.com
> > > > >> >
> > > > >> >> >> wrote:
> > > > >> >> >>
> > > > >> >> >> > I don't think we should break existing topics. Just disallow
> > > new
> > > > >> >> >> > topics going forward.
> > > > >> >> >> >
> > > > >> >> >> > Agree that having both is horrible, but we should have a
> > > > solution
> > > > >> that
> > > > >> >> >> > fails when you run "kafka_topics.sh --create", not when you
> > > > >> configure
> > > > >> >> >> > Ganglia.
> > > > >> >> >> >
> > > > >> >> >> > Gwen
> > > > >> >> >> >
> > > > >> >> >> > On Fri, Jul 10, 2015 at 1:53 PM, Jay Kreps <
> > jay@confluent.io>
> > > > >> wrote:
> > > > >> >> >> > > Unfortunately '.' is pretty common too. I agree that it is
> > > > >> perverse,
> > > > >> >> >> but
> > > > >> >> >> > > people seem to do it. Breaking all the topics with '.' in
> > > the
> > > > >> name
> > > > >> >> >> seems
> > > > >> >> >> > > like it could be worse than combining metrics for people
> > who
> > > > >> have a
> > > > >> >> >> > > 'foo_bar' AND 'foo.bar' (and after all, having both is
> > > DEEPLY
> > > > >> >> perverse,
> > > > >> >> >> > > no?).
> > > > >> >> >> > >
> > > > >> >> >> > > Where is our Dean of Compatibility, Ewen, on this?
> > > > >> >> >> > >
> > > > >> >> >> > > -Jay
> > > > >> >> >> > >
> > > > >> >> >> > > On Fri, Jul 10, 2015 at 1:32 PM, Todd Palino <
> > > > tpalino@gmail.com>
> > > > >> >> >> wrote:
> > > > >> >> >> > >
> > > > >> >> >> > >> My selfish point of view is that we do #1, as we use "_"
> > > > >> >> extensively
> > > > >> >> >> in
> > > > >> >> >> > >> topic names here :) I also happen to think it's the right
> > > > >> choice,
> > > > >> >> >> > >> specifically because "." has more special meanings, as
> > you
> > > > >> noted.
> > > > >> >> >> > >>
> > > > >> >> >> > >> -Todd
> > > > >> >> >> > >>
> > > > >> >> >> > >>
> > > > >> >> >> > >> On Fri, Jul 10, 2015 at 1:30 PM, Gwen Shapira <
> > > > >> >> gshapira@cloudera.com>
> > > > >> >> >> > >> wrote:
> > > > >> >> >> > >>
> > > > >> >> >> > >> > Unintentional side effect from allowing IP addresses in
> > > > >> consumer
> > > > >> >> >> > client
> > > > >> >> >> > >> > IDs :)
> > > > >> >> >> > >> >
> > > > >> >> >> > >> > So the question is, what do we do now?
> > > > >> >> >> > >> >
> > > > >> >> >> > >> > 1) disallow "."
> > > > >> >> >> > >> > 2) disallow "_"
> > > > >> >> >> > >> > 3) find a reversible way to encode "." and "_" that
> > won't
> > > > >> break
> > > > >> >> >> > existing
> > > > >> >> >> > >> > metrics
> > > > >> >> >> > >> > 4) all of the above?
> > > > >> >> >> > >> >
> > > > >> >> >> > >> > btw. it looks like "." and ".." are currently valid.
> > > Topic
> > > > >> names
> > > > >> >> are
> > > > >> >> >> > >> > used for directories, right? this sounds like fun :)
> > > > >> >> >> > >> >
> > > > >> >> >> > >> > I vote for option #1, although if someone has a good
> > idea
> > > > for
> > > > >> #3
> > > > >> >> it
> > > > >> >> >> > >> > will be even better.
> > > > >> >> >> > >> >
> > > > >> >> >> > >> > Gwen
> > > > >> >> >> > >> >
> > > > >> >> >> > >> >
> > > > >> >> >> > >> >
> > > > >> >> >> > >> > On Fri, Jul 10, 2015 at 1:22 PM, Grant Henke <
> > > > >> >> ghenke@cloudera.com>
> > > > >> >> >> > >> wrote:
> > > > >> >> >> > >> > > Found it was added here:
> > > > >> >> >> > >> https://issues.apache.org/jira/browse/KAFKA-697
> > > > >> >> >> > >> > >
> > > > >> >> >> > >> > > On Fri, Jul 10, 2015 at 3:18 PM, Todd Palino <
> > > > >> >> tpalino@gmail.com>
> > > > >> >> >> > >> wrote:
> > > > >> >> >> > >> > >
> > > > >> >> >> > >> > >> This was definitely changed at some point after
> > > > KAFKA-495.
> > > > >> The
> > > > >> >> >> > >> question
> > > > >> >> >> > >> > is
> > > > >> >> >> > >> > >> when and why.
> > > > >> >> >> > >> > >>
> > > > >> >> >> > >> > >> Here's the relevant code from that patch:
> > > > >> >> >> > >> > >>
> > > > >> >> >> > >> > >>
> > > > >> >> >>
> > > > ===================================================================
> > > > >> >> >> > >> > >> --- core/src/main/scala/kafka/utils/Topic.scala
> > > > (revision
> > > > >> >> >> 1390178)
> > > > >> >> >> > >> > >> +++ core/src/main/scala/kafka/utils/Topic.scala
> > > (working
> > > > >> copy)
> > > > >> >> >> > >> > >> @@ -21,24 +21,21 @@
> > > > >> >> >> > >> > >>  import util.matching.Regex
> > > > >> >> >> > >> > >>
> > > > >> >> >> > >> > >>  object Topic {
> > > > >> >> >> > >> > >> +  val legalChars = "[a-zA-Z0-9_-]"
> > > > >> >> >> > >> > >>
> > > > >> >> >> > >> > >>
> > > > >> >> >> > >> > >>
> > > > >> >> >> > >> > >> -Todd
> > > > >> >> >> > >> > >>
> > > > >> >> >> > >> > >>
> > > > >> >> >> > >> > >> On Fri, Jul 10, 2015 at 1:02 PM, Grant Henke <
> > > > >> >> >> ghenke@cloudera.com>
> > > > >> >> >> > >> > wrote:
> > > > >> >> >> > >> > >>
> > > > >> >> >> > >> > >> > kafka.common.Topic shows that currently period is
> > a
> > > > valid
> > > > >> >> >> > character
> > > > >> >> >> > >> > and I
> > > > >> >> >> > >> > >> > have verified I can use kafka-topics.sh to create
> > a
> > > > new
> > > > >> >> topic
> > > > >> >> >> > with a
> > > > >> >> >> > >> > >> > period.
> > > > >> >> >> > >> > >> >
> > > > >> >> >> > >> > >> >
> > > > >> >> >> > >> > >> >
> > > > AdminUtils.createOrUpdateTopicPartitionAssignmentPathInZK
> > > > >> >> >> > currently
> > > > >> >> >> > >> > uses
> > > > >> >> >> > >> > >> > Topic.validate before writing to Zookeeper.
> > > > >> >> >> > >> > >> >
> > > > >> >> >> > >> > >> > Should period character support be removed? I was
> > > > under
> > > > >> the
> > > > >> >> >> same
> > > > >> >> >> > >> > >> impression
> > > > >> >> >> > >> > >> > as Gwen, that a period was used by many as a way
> > to
> > > > >> "group"
> > > > >> >> >> > topics.
> > > > >> >> >> > >> > >> >
> > > > >> >> >> > >> > >> > The code is pasted below since its small:
> > > > >> >> >> > >> > >> >
> > > > >> >> >> > >> > >> > object Topic {
> > > > >> >> >> > >> > >> >   val legalChars = "[a-zA-Z0-9\\._\\-]"
> > > > >> >> >> > >> > >> >   private val maxNameLength = 255
> > > > >> >> >> > >> > >> >   private val rgx = new Regex(legalChars + "+")
> > > > >> >> >> > >> > >> >
> > > > >> >> >> > >> > >> >   val InternalTopics =
> > > > >> Set(OffsetManager.OffsetsTopicName)
> > > > >> >> >> > >> > >> >
> > > > >> >> >> > >> > >> >   def validate(topic: String) {
> > > > >> >> >> > >> > >> >     if (topic.length <= 0)
> > > > >> >> >> > >> > >> >       throw new InvalidTopicException("topic name
> > is
> > > > >> >> illegal,
> > > > >> >> >> > can't
> > > > >> >> >> > >> be
> > > > >> >> >> > >> > >> > empty")
> > > > >> >> >> > >> > >> >     else if (topic.equals(".") ||
> > > topic.equals(".."))
> > > > >> >> >> > >> > >> >       throw new InvalidTopicException("topic name
> > > > cannot
> > > > >> be
> > > > >> >> >> > \".\" or
> > > > >> >> >> > >> > >> > \"..\"")
> > > > >> >> >> > >> > >> >     else if (topic.length > maxNameLength)
> > > > >> >> >> > >> > >> >       throw new InvalidTopicException("topic name
> > is
> > > > >> >> illegal,
> > > > >> >> >> > can't
> > > > >> >> >> > >> be
> > > > >> >> >> > >> > >> > longer than " + maxNameLength + " characters")
> > > > >> >> >> > >> > >> >
> > > > >> >> >> > >> > >> >     rgx.findFirstIn(topic) match {
> > > > >> >> >> > >> > >> >       case Some(t) =>
> > > > >> >> >> > >> > >> >         if (!t.equals(topic))
> > > > >> >> >> > >> > >> >           throw new InvalidTopicException("topic
> > > name
> > > > " +
> > > > >> >> topic
> > > > >> >> >> > + "
> > > > >> >> >> > >> is
> > > > >> >> >> > >> > >> > illegal, contains a character other than ASCII
> > > > >> >> alphanumerics,
> > > > >> >> >> > '.',
> > > > >> >> >> > >> '_'
> > > > >> >> >> > >> > >> and
> > > > >> >> >> > >> > >> > '-'")
> > > > >> >> >> > >> > >> >       case None => throw new
> > > > InvalidTopicException("topic
> > > > >> >> name
> > > > >> >> >> "
> > > > >> >> >> > +
> > > > >> >> >> > >> > topic
> > > > >> >> >> > >> > >> +
> > > > >> >> >> > >> > >> > " is illegal,  contains a character other than
> > ASCII
> > > > >> >> >> > alphanumerics,
> > > > >> >> >> > >> > '.',
> > > > >> >> >> > >> > >> > '_' and '-'")
> > > > >> >> >> > >> > >> >     }
> > > > >> >> >> > >> > >> >   }
> > > > >> >> >> > >> > >> > }
> > > > >> >> >> > >> > >> >
> > > > >> >> >> > >> > >> > On Fri, Jul 10, 2015 at 2:50 PM, Todd Palino <
> > > > >> >> >> tpalino@gmail.com>
> > > > >> >> >> > >> > wrote:
> > > > >> >> >> > >> > >> >
> > > > >> >> >> > >> > >> > > I had to go look this one up again to make sure
> > -
> > > > >> >> >> > >> > >> > > https://issues.apache.org/jira/browse/KAFKA-495
> > > > >> >> >> > >> > >> > >
> > > > >> >> >> > >> > >> > > The only valid character names for topics are
> > > > >> >> alphanumeric,
> > > > >> >> >> > >> > underscore,
> > > > >> >> >> > >> > >> > and
> > > > >> >> >> > >> > >> > > dash. A period is not supposed to be a valid
> > > > character
> > > > >> to
> > > > >> >> >> use.
> > > > >> >> >> > If
> > > > >> >> >> > >> > >> you're
> > > > >> >> >> > >> > >> > > seeing them, then one of two things have
> > happened:
> > > > >> >> >> > >> > >> > >
> > > > >> >> >> > >> > >> > > 1) You have topic names that are grandfathered
> > in
> > > > from
> > > > >> >> before
> > > > >> >> >> > that
> > > > >> >> >> > >> > >> patch
> > > > >> >> >> > >> > >> > > 2) The patch is not working properly and there
> > is
> > > > >> >> somewhere
> > > > >> >> >> in
> > > > >> >> >> > the
> > > > >> >> >> > >> > >> broker
> > > > >> >> >> > >> > >> > > that the standard is not being enforced.
> > > > >> >> >> > >> > >> > >
> > > > >> >> >> > >> > >> > > -Todd
> > > > >> >> >> > >> > >> > >
> > > > >> >> >> > >> > >> > >
> > > > >> >> >> > >> > >> > > On Fri, Jul 10, 2015 at 12:13 PM, Brock Noland <
> > > > >> >> >> > brock@apache.org>
> > > > >> >> >> > >> > >> wrote:
> > > > >> >> >> > >> > >> > >
> > > > >> >> >> > >> > >> > > > On Fri, Jul 10, 2015 at 11:34 AM, Gwen
> > Shapira <
> > > > >> >> >> > >> > >> gshapira@cloudera.com>
> > > > >> >> >> > >> > >> > > > wrote:
> > > > >> >> >> > >> > >> > > > > Hi Kafka Fans,
> > > > >> >> >> > >> > >> > > > >
> > > > >> >> >> > >> > >> > > > > If you have one topic named "kafka_lab_2"
> > and
> > > > the
> > > > >> >> other
> > > > >> >> >> > named
> > > > >> >> >> > >> > >> > > > > "kafka.lab.2", the topic level metrics will
> > be
> > > > >> named
> > > > >> >> >> > >> kafka_lab_2
> > > > >> >> >> > >> > >> for
> > > > >> >> >> > >> > >> > > > > both, effectively making it impossible to
> > > > monitor
> > > > >> them
> > > > >> >> >> > >> properly.
> > > > >> >> >> > >> > >> > > > >
> > > > >> >> >> > >> > >> > > > > The reason this happens is that using "." in
> > > > topic
> > > > >> >> names
> > > > >> >> >> is
> > > > >> >> >> > >> > pretty
> > > > >> >> >> > >> > >> > > > > common, especially as a way to group topics
> > > into
> > > > >> data
> > > > >> >> >> > centers,
> > > > >> >> >> > >> > >> > > > > relevant apps, etc - basically a work-around
> > > to
> > > > our
> > > > >> >> >> current
> > > > >> >> >> > >> > lack of
> > > > >> >> >> > >> > >> > > > > name spaces. However, most metric monitoring
> > > > >> systems
> > > > >> >> >> using
> > > > >> >> >> > "."
> > > > >> >> >> > >> > to
> > > > >> >> >> > >> > >> > > > > annotate hierarchy, so to avoid issues
> > around
> > > > >> metric
> > > > >> >> >> names,
> > > > >> >> >> > >> > Kafka
> > > > >> >> >> > >> > >> > > > > replaces the "." in the name with an
> > > underscore.
> > > > >> >> >> > >> > >> > > > >
> > > > >> >> >> > >> > >> > > > > This generates good metric names, but
> > creates
> > > > the
> > > > >> >> problem
> > > > >> >> >> > with
> > > > >> >> >> > >> > name
> > > > >> >> >> > >> > >> > > > collisions.
> > > > >> >> >> > >> > >> > > > >
> > > > >> >> >> > >> > >> > > > > I'm wondering if it makes sense to simply
> > > limit
> > > > the
> > > > >> >> range
> > > > >> >> >> > of
> > > > >> >> >> > >> > >> > > > > characters permitted in a topic name and
> > > > disallow
> > > > >> "_"?
> > > > >> >> >> > >> Obviously
> > > > >> >> >> > >> > >> > > > > existing topics will need to remain as is,
> > > which
> > > > >> is a
> > > > >> >> bit
> > > > >> >> >> > >> > awkward.
> > > > >> >> >> > >> > >> > > >
> > > > >> >> >> > >> > >> > > > Interesting problem! Many if not most users I
> > > > >> >> personally am
> > > > >> >> >> > >> aware
> > > > >> >> >> > >> > of
> > > > >> >> >> > >> > >> > > > use "_" as a separator in topic names. I am
> > sure
> > > > that
> > > > >> >> many
> > > > >> >> >> > users
> > > > >> >> >> > >> > >> would
> > > > >> >> >> > >> > >> > > > be quite surprised by this limitation. With
> > that
> > > > >> said,
> > > > >> >> I am
> > > > >> >> >> > sure
> > > > >> >> >> > >> > >> > > > they'd transition accordingly.
> > > > >> >> >> > >> > >> > > >
> > > > >> >> >> > >> > >> > > > >
> > > > >> >> >> > >> > >> > > > > If anyone has better backward-compatible
> > > > solutions
> > > > >> to
> > > > >> >> >> this,
> > > > >> >> >> > >> I'm
> > > > >> >> >> > >> > all
> > > > >> >> >> > >> > >> > > ears
> > > > >> >> >> > >> > >> > > > :)
> > > > >> >> >> > >> > >> > > > >
> > > > >> >> >> > >> > >> > > > > Gwen
> > > > >> >> >> > >> > >> > > >
> > > > >> >> >> > >> > >> > >
> > > > >> >> >> > >> > >> >
> > > > >> >> >> > >> > >> >
> > > > >> >> >> > >> > >> >
> > > > >> >> >> > >> > >> > --
> > > > >> >> >> > >> > >> > Grant Henke
> > > > >> >> >> > >> > >> > Solutions Consultant | Cloudera
> > > > >> >> >> > >> > >> > ghenke@cloudera.com | twitter.com/gchenke |
> > > > >> >> >> > >> > linkedin.com/in/granthenke
> > > > >> >> >> > >> > >> >
> > > > >> >> >> > >> > >>
> > > > >> >> >> > >> > >
> > > > >> >> >> > >> > >
> > > > >> >> >> > >> > >
> > > > >> >> >> > >> > > --
> > > > >> >> >> > >> > > Grant Henke
> > > > >> >> >> > >> > > Solutions Consultant | Cloudera
> > > > >> >> >> > >> > > ghenke@cloudera.com | twitter.com/gchenke |
> > > > >> >> >> > linkedin.com/in/granthenke
> > > > >> >> >> > >> >
> > > > >> >> >> > >>
> > > > >> >> >> >
> > > > >> >> >>
> > > > >> >> >>
> > > > >> >> >>
> > > > >> >> >> --
> > > > >> >> >> Thanks,
> > > > >> >> >> Neha
> > > > >> >> >>
> > > > >> >>
> > > > >> >
> > > > >> >
> > > > >> >
> > > > >> > --
> > > > >> > Thanks,
> > > > >> > Ewen
> > > > >>
> > > > >
> > > > >
> > > > >
> > > > > --
> > > > > Thanks,
> > > > > Ewen
> > > >
> > >

Re: [Discussion] Limitations on topic names

Posted by Grant Henke <gh...@cloudera.com>.
Noting here that the period '.' also causes potentially confusing behavior
when using regex whitelists or blacklists. It can be easily worked around
but users need to be aware of escaping the period.

If I create two topics 'a.c' and 'abc' and start the following consumer,
both topics will be consumed:

bin/kafka-console-consumer.sh --zookeeper localhost:2181 --whitelist a.c

 To fix this, any period needs to be escaped:

bin/kafka-console-consumer.sh --zookeeper localhost:2181 --whitelist a\\.c


On Mon, Jul 13, 2015 at 1:41 PM, Joel Koshy <jj...@gmail.com> wrote:

> One way to get around this conflict could be to replace . with _ and _
> with __
>
> On Sat, Jul 11, 2015 at 10:33 AM, Todd Palino <tp...@gmail.com> wrote:
> > I tend to agree with this as a compromise at this point. The reality is
> that this is technical debt that has built up in the project, and it does
> not go away by documenting it, and it will only get worse.
> >
> > As pointed out, eliminating either character at this point is going to
> cause problems for someone. And unfortunately, Guozhang, converting to __
> doesn't really solve the problem either because that is still a valid topic
> name that could collide. It's less likely, but all it does is move the debt
> around a little.
> >
> > -Todd
> >
> >> On Jul 11, 2015, at 10:16 AM, Brock Noland <br...@apache.org> wrote:
> >>
> >> On Sat, Jul 11, 2015 at 12:54 AM, Ewen Cheslack-Postava
> >> <ew...@confluent.io> wrote:
> >>> On Fri, Jul 10, 2015 at 4:41 PM, Gwen Shapira <gs...@cloudera.com>
> wrote:
> >>>
> >>>> Yeah, I have an actual customer who ran into this. Unfortunately,
> >>>> inconsistencies in the way things are named are pretty common - just
> >>>> look at Kafka's many CLI options.
> >>>>
> >>>> I don't think that supporting both and pointing at the docs with "I
> >>>> told you so" when our metrics break is a good solution.
> >>>
> >>> I agree, especially since we don't *already* have something in the docs
> >>> indicating this will be an issue. I was flippant about the situation
> >>> because I *wish* there was more careful consideration + naming policy
> in
> >>> place, but I realize that doesn't always happen in practice. I guess I
> need
> >>> to take Compatibility Czar more seriously :)
> >>>
> >>> I see think the obvious practical options are as follows:
> >>>
> >>> 1. Kill support for "_". Piss off the entire set of people who
> currently
> >>> use "_" anywhere in topic names.
> >>> 2. Kill support for ".". Piss off the entire set of people who
> currently
> >>> use "." anywhere in topic names.
> >>> 3. Tell people they need to be careful about this issue. Piss off the
> set
> >>> of people who use both "_" and "." *and* happen to have conflicting
> topic
> >>> names. They will have some pain when they discover the issue and have
> to
> >>> figure out how to move one of those topics over to a non-conflicting
> name.
> >>> I'm going to claim that this group must be an *extremely* small
> fraction of
> >>> users, which doesn't make it better to allow things to break for them,
> but
> >>> at least gives us an idea of the scale of impact.
> >>>
> >>> (One other alternative suggested earlier was encoding metric names to
> >>> account for differences; given the metric renaming mess in the last
> >>> release, I'm extremely hesitant to suggest anything of the sort...)
> >>>
> >>> None of the options are ideal, but to me, 3 seems like the least
> painful.
> >>> Both for us, and for the vast majority of users. It seems to me that
> the
> >>> number of users that would complain about (1) or (2) drastically
> outweigh
> >>> (3).
> >>>
> >>> At this point, I don't think it's practical to keep switching the rules
> >>> about which characters are allowed and which aren't because the
> previous
> >>> attempts haven't been successful -- it seems the rules have changed
> >>> multiple times, whether intentionally or accidentally, such that any
> more
> >>> changes will cause problems. At this point, I think we just need to
> accept
> >>> being liberal in accepting the range of topic names that have been
> >>> permitted so far and make the best of the situation, even if it means
> only
> >>> being able to warn people of conflicts.
> >>>
> >>> Here's another alternative: how about being liberal with topic name
> >>> characters, but upon topic creation we convert the name to the metric
> name
> >>> and fail if there's a conflict with another topic? This is relatively
> >>> expensive (requires getting the metric name of all other topics), but
> it
> >>> avoids the bad situation we're encountering here (conflicting metrics),
> >>> avoids getting into a persistent conflict (we kill topic creation when
> we
> >>> detect the issue rather than noticing it when the metrics conflict
> >>> happens), and keeps the vast majority of existing users happy (both _
> and .
> >>> work in topic names as long as you don't create topics with conflicting
> >>> metric names).
> >>>
> >>> There are definitely details to be worked out (auto topic creation?),
> but
> >>> it seems like a more realistic solution than to start disallowing _ or
> . in
> >>> topic names.
> >>
> >> I was thinking the same. Allow a.b or a_b but not a.b and a_b. This
> >> seems like it will impact a trivial amount of users and keep both the
> >> "." and "_" camps happy.
> >>
> >>>
> >>> -Ewen
> >>>
> >>>
> >>>>
> >>>> On Fri, Jul 10, 2015 at 4:33 PM, Ewen Cheslack-Postava
> >>>> <ew...@confluent.io> wrote:
> >>>>> I figure you'll probably see complaints no matter what change you
> make.
> >>>>> Gwen, given that you raised this, another important question might
> be how
> >>>>> many people you see using *both*. I'm guessing this question came up
> >>>>> because you actually saw a conflict? But I'd imagine (or at least
> hope)
> >>>>> that most organizations are mostly consistent about naming topics --
> they
> >>>>> standardize on one or the other.
> >>>>>
> >>>>> Since there's no "right" way to name them, I'd just leave it
> supporting
> >>>>> both and document the potential conflict in metrics. And if people
> use
> >>>> both
> >>>>> naming schemes, they probably deserve to suffer for their
> inconsistency
> >>>> :)
> >>>>>
> >>>>> -Ewen
> >>>>>
> >>>>>> On Fri, Jul 10, 2015 at 3:28 PM, Gwen Shapira <
> gshapira@cloudera.com>
> >>>>> wrote:
> >>>>>
> >>>>>> I find dots more common in my customer base, so I will definitely
> feel
> >>>>>> the pain of removing them.
> >>>>>>
> >>>>>> However, "." are already used in metrics, file names, directories,
> etc
> >>>>>> - so if we keep the dots, we need to keep code that translates them
> >>>>>> and document the translation. Just banning "." seems more natural.
> >>>>>> Also, as Grant mentioned, we'll probably have our own special usage
> >>>>>> for "." down the line.
> >>>>>>
> >>>>>>> On Fri, Jul 10, 2015 at 2:12 PM, Todd Palino <tp...@gmail.com>
> wrote:
> >>>>>>> I absolutely disagree with #2, Neha. That will break a lot of
> >>>>>>> infrastructure within LinkedIn. That said, removing "." might break
> >>>> other
> >>>>>>> people as well, but I think we should have a clearer idea of how
> much
> >>>>>> usage
> >>>>>>> there is on either side.
> >>>>>>>
> >>>>>>> -Todd
> >>>>>>>
> >>>>>>>
> >>>>>>>> On Fri, Jul 10, 2015 at 2:08 PM, Neha Narkhede <neha@confluent.io
> >
> >>>>>>> wrote:
> >>>>>>>
> >>>>>>>> "." seems natural for grouping topic names. +1 for 2) going
> forward
> >>>> only
> >>>>>>>> without breaking previously created topics with "_" though that
> might
> >>>>>>>> require us to patch the code somewhat awkwardly till we phase it
> out
> >>>> a
> >>>>>>>> couple (purposely left vague to stay out of Ewen's wrath :-))
> >>>> versions
> >>>>>>>> later.
> >>>>>>>>
> >>>>>>>> On Fri, Jul 10, 2015 at 2:02 PM, Gwen Shapira <
> gshapira@cloudera.com
> >>>>>
> >>>>>>>> wrote:
> >>>>>>>>
> >>>>>>>>> I don't think we should break existing topics. Just disallow new
> >>>>>>>>> topics going forward.
> >>>>>>>>>
> >>>>>>>>> Agree that having both is horrible, but we should have a solution
> >>>> that
> >>>>>>>>> fails when you run "kafka_topics.sh --create", not when you
> >>>> configure
> >>>>>>>>> Ganglia.
> >>>>>>>>>
> >>>>>>>>> Gwen
> >>>>>>>>>
> >>>>>>>>> On Fri, Jul 10, 2015 at 1:53 PM, Jay Kreps <ja...@confluent.io>
> >>>> wrote:
> >>>>>>>>>> Unfortunately '.' is pretty common too. I agree that it is
> >>>> perverse,
> >>>>>>>> but
> >>>>>>>>>> people seem to do it. Breaking all the topics with '.' in the
> >>>> name
> >>>>>>>> seems
> >>>>>>>>>> like it could be worse than combining metrics for people who
> >>>> have a
> >>>>>>>>>> 'foo_bar' AND 'foo.bar' (and after all, having both is DEEPLY
> >>>>>> perverse,
> >>>>>>>>>> no?).
> >>>>>>>>>>
> >>>>>>>>>> Where is our Dean of Compatibility, Ewen, on this?
> >>>>>>>>>>
> >>>>>>>>>> -Jay
> >>>>>>>>>>
> >>>>>>>>>> On Fri, Jul 10, 2015 at 1:32 PM, Todd Palino <tpalino@gmail.com
> >
> >>>>>>>> wrote:
> >>>>>>>>>>
> >>>>>>>>>>> My selfish point of view is that we do #1, as we use "_"
> >>>>>> extensively
> >>>>>>>> in
> >>>>>>>>>>> topic names here :) I also happen to think it's the right
> >>>> choice,
> >>>>>>>>>>> specifically because "." has more special meanings, as you
> >>>> noted.
> >>>>>>>>>>>
> >>>>>>>>>>> -Todd
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> On Fri, Jul 10, 2015 at 1:30 PM, Gwen Shapira <
> >>>>>> gshapira@cloudera.com>
> >>>>>>>>>>> wrote:
> >>>>>>>>>>>
> >>>>>>>>>>>> Unintentional side effect from allowing IP addresses in
> >>>> consumer
> >>>>>>>>> client
> >>>>>>>>>>>> IDs :)
> >>>>>>>>>>>>
> >>>>>>>>>>>> So the question is, what do we do now?
> >>>>>>>>>>>>
> >>>>>>>>>>>> 1) disallow "."
> >>>>>>>>>>>> 2) disallow "_"
> >>>>>>>>>>>> 3) find a reversible way to encode "." and "_" that won't
> >>>> break
> >>>>>>>>> existing
> >>>>>>>>>>>> metrics
> >>>>>>>>>>>> 4) all of the above?
> >>>>>>>>>>>>
> >>>>>>>>>>>> btw. it looks like "." and ".." are currently valid. Topic
> >>>> names
> >>>>>> are
> >>>>>>>>>>>> used for directories, right? this sounds like fun :)
> >>>>>>>>>>>>
> >>>>>>>>>>>> I vote for option #1, although if someone has a good idea for
> >>>> #3
> >>>>>> it
> >>>>>>>>>>>> will be even better.
> >>>>>>>>>>>>
> >>>>>>>>>>>> Gwen
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>> On Fri, Jul 10, 2015 at 1:22 PM, Grant Henke <
> >>>>>> ghenke@cloudera.com>
> >>>>>>>>>>> wrote:
> >>>>>>>>>>>>> Found it was added here:
> >>>>>>>>>>> https://issues.apache.org/jira/browse/KAFKA-697
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> On Fri, Jul 10, 2015 at 3:18 PM, Todd Palino <
> >>>>>> tpalino@gmail.com>
> >>>>>>>>>>> wrote:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>> This was definitely changed at some point after KAFKA-495.
> >>>> The
> >>>>>>>>>>> question
> >>>>>>>>>>>> is
> >>>>>>>>>>>>>> when and why.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Here's the relevant code from that patch:
> >>>>>>>>
> ===================================================================
> >>>>>>>>>>>>>> --- core/src/main/scala/kafka/utils/Topic.scala (revision
> >>>>>>>> 1390178)
> >>>>>>>>>>>>>> +++ core/src/main/scala/kafka/utils/Topic.scala (working
> >>>> copy)
> >>>>>>>>>>>>>> @@ -21,24 +21,21 @@
> >>>>>>>>>>>>>> import util.matching.Regex
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> object Topic {
> >>>>>>>>>>>>>> +  val legalChars = "[a-zA-Z0-9_-]"
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> -Todd
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> On Fri, Jul 10, 2015 at 1:02 PM, Grant Henke <
> >>>>>>>> ghenke@cloudera.com>
> >>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> kafka.common.Topic shows that currently period is a valid
> >>>>>>>>> character
> >>>>>>>>>>>> and I
> >>>>>>>>>>>>>>> have verified I can use kafka-topics.sh to create a new
> >>>>>> topic
> >>>>>>>>> with a
> >>>>>>>>>>>>>>> period.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> AdminUtils.createOrUpdateTopicPartitionAssignmentPathInZK
> >>>>>>>>> currently
> >>>>>>>>>>>> uses
> >>>>>>>>>>>>>>> Topic.validate before writing to Zookeeper.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Should period character support be removed? I was under
> >>>> the
> >>>>>>>> same
> >>>>>>>>>>>>>> impression
> >>>>>>>>>>>>>>> as Gwen, that a period was used by many as a way to
> >>>> "group"
> >>>>>>>>> topics.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> The code is pasted below since its small:
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> object Topic {
> >>>>>>>>>>>>>>>  val legalChars = "[a-zA-Z0-9\\._\\-]"
> >>>>>>>>>>>>>>>  private val maxNameLength = 255
> >>>>>>>>>>>>>>>  private val rgx = new Regex(legalChars + "+")
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>  val InternalTopics =
> >>>> Set(OffsetManager.OffsetsTopicName)
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>  def validate(topic: String) {
> >>>>>>>>>>>>>>>    if (topic.length <= 0)
> >>>>>>>>>>>>>>>      throw new InvalidTopicException("topic name is
> >>>>>> illegal,
> >>>>>>>>> can't
> >>>>>>>>>>> be
> >>>>>>>>>>>>>>> empty")
> >>>>>>>>>>>>>>>    else if (topic.equals(".") || topic.equals(".."))
> >>>>>>>>>>>>>>>      throw new InvalidTopicException("topic name cannot
> >>>> be
> >>>>>>>>> \".\" or
> >>>>>>>>>>>>>>> \"..\"")
> >>>>>>>>>>>>>>>    else if (topic.length > maxNameLength)
> >>>>>>>>>>>>>>>      throw new InvalidTopicException("topic name is
> >>>>>> illegal,
> >>>>>>>>> can't
> >>>>>>>>>>> be
> >>>>>>>>>>>>>>> longer than " + maxNameLength + " characters")
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>    rgx.findFirstIn(topic) match {
> >>>>>>>>>>>>>>>      case Some(t) =>
> >>>>>>>>>>>>>>>        if (!t.equals(topic))
> >>>>>>>>>>>>>>>          throw new InvalidTopicException("topic name " +
> >>>>>> topic
> >>>>>>>>> + "
> >>>>>>>>>>> is
> >>>>>>>>>>>>>>> illegal, contains a character other than ASCII
> >>>>>> alphanumerics,
> >>>>>>>>> '.',
> >>>>>>>>>>> '_'
> >>>>>>>>>>>>>> and
> >>>>>>>>>>>>>>> '-'")
> >>>>>>>>>>>>>>>      case None => throw new InvalidTopicException("topic
> >>>>>> name
> >>>>>>>> "
> >>>>>>>>> +
> >>>>>>>>>>>> topic
> >>>>>>>>>>>>>> +
> >>>>>>>>>>>>>>> " is illegal,  contains a character other than ASCII
> >>>>>>>>> alphanumerics,
> >>>>>>>>>>>> '.',
> >>>>>>>>>>>>>>> '_' and '-'")
> >>>>>>>>>>>>>>>    }
> >>>>>>>>>>>>>>>  }
> >>>>>>>>>>>>>>> }
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> On Fri, Jul 10, 2015 at 2:50 PM, Todd Palino <
> >>>>>>>> tpalino@gmail.com>
> >>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> I had to go look this one up again to make sure -
> >>>>>>>>>>>>>>>> https://issues.apache.org/jira/browse/KAFKA-495
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> The only valid character names for topics are
> >>>>>> alphanumeric,
> >>>>>>>>>>>> underscore,
> >>>>>>>>>>>>>>> and
> >>>>>>>>>>>>>>>> dash. A period is not supposed to be a valid character
> >>>> to
> >>>>>>>> use.
> >>>>>>>>> If
> >>>>>>>>>>>>>> you're
> >>>>>>>>>>>>>>>> seeing them, then one of two things have happened:
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> 1) You have topic names that are grandfathered in from
> >>>>>> before
> >>>>>>>>> that
> >>>>>>>>>>>>>> patch
> >>>>>>>>>>>>>>>> 2) The patch is not working properly and there is
> >>>>>> somewhere
> >>>>>>>> in
> >>>>>>>>> the
> >>>>>>>>>>>>>> broker
> >>>>>>>>>>>>>>>> that the standard is not being enforced.
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> -Todd
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> On Fri, Jul 10, 2015 at 12:13 PM, Brock Noland <
> >>>>>>>>> brock@apache.org>
> >>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> On Fri, Jul 10, 2015 at 11:34 AM, Gwen Shapira <
> >>>>>>>>>>>>>> gshapira@cloudera.com>
> >>>>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>> Hi Kafka Fans,
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> If you have one topic named "kafka_lab_2" and the
> >>>>>> other
> >>>>>>>>> named
> >>>>>>>>>>>>>>>>>> "kafka.lab.2", the topic level metrics will be
> >>>> named
> >>>>>>>>>>> kafka_lab_2
> >>>>>>>>>>>>>> for
> >>>>>>>>>>>>>>>>>> both, effectively making it impossible to monitor
> >>>> them
> >>>>>>>>>>> properly.
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> The reason this happens is that using "." in topic
> >>>>>> names
> >>>>>>>> is
> >>>>>>>>>>>> pretty
> >>>>>>>>>>>>>>>>>> common, especially as a way to group topics into
> >>>> data
> >>>>>>>>> centers,
> >>>>>>>>>>>>>>>>>> relevant apps, etc - basically a work-around to our
> >>>>>>>> current
> >>>>>>>>>>>> lack of
> >>>>>>>>>>>>>>>>>> name spaces. However, most metric monitoring
> >>>> systems
> >>>>>>>> using
> >>>>>>>>> "."
> >>>>>>>>>>>> to
> >>>>>>>>>>>>>>>>>> annotate hierarchy, so to avoid issues around
> >>>> metric
> >>>>>>>> names,
> >>>>>>>>>>>> Kafka
> >>>>>>>>>>>>>>>>>> replaces the "." in the name with an underscore.
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> This generates good metric names, but creates the
> >>>>>> problem
> >>>>>>>>> with
> >>>>>>>>>>>> name
> >>>>>>>>>>>>>>>>> collisions.
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> I'm wondering if it makes sense to simply limit the
> >>>>>> range
> >>>>>>>>> of
> >>>>>>>>>>>>>>>>>> characters permitted in a topic name and disallow
> >>>> "_"?
> >>>>>>>>>>> Obviously
> >>>>>>>>>>>>>>>>>> existing topics will need to remain as is, which
> >>>> is a
> >>>>>> bit
> >>>>>>>>>>>> awkward.
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> Interesting problem! Many if not most users I
> >>>>>> personally am
> >>>>>>>>>>> aware
> >>>>>>>>>>>> of
> >>>>>>>>>>>>>>>>> use "_" as a separator in topic names. I am sure that
> >>>>>> many
> >>>>>>>>> users
> >>>>>>>>>>>>>> would
> >>>>>>>>>>>>>>>>> be quite surprised by this limitation. With that
> >>>> said,
> >>>>>> I am
> >>>>>>>>> sure
> >>>>>>>>>>>>>>>>> they'd transition accordingly.
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> If anyone has better backward-compatible solutions
> >>>> to
> >>>>>>>> this,
> >>>>>>>>>>> I'm
> >>>>>>>>>>>> all
> >>>>>>>>>>>>>>>> ears
> >>>>>>>>>>>>>>>>> :)
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> Gwen
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> --
> >>>>>>>>>>>>>>> Grant Henke
> >>>>>>>>>>>>>>> Solutions Consultant | Cloudera
> >>>>>>>>>>>>>>> ghenke@cloudera.com | twitter.com/gchenke |
> >>>>>>>>>>>> linkedin.com/in/granthenke
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> --
> >>>>>>>>>>>>> Grant Henke
> >>>>>>>>>>>>> Solutions Consultant | Cloudera
> >>>>>>>>>>>>> ghenke@cloudera.com | twitter.com/gchenke |
> >>>>>>>>> linkedin.com/in/granthenke
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> --
> >>>>>>>> Thanks,
> >>>>>>>> Neha
> >>>>>
> >>>>>
> >>>>>
> >>>>> --
> >>>>> Thanks,
> >>>>> Ewen
> >>>
> >>>
> >>>
> >>> --
> >>> Thanks,
> >>> Ewen
>



-- 
Grant Henke
Solutions Consultant | Cloudera
ghenke@cloudera.com | twitter.com/gchenke | linkedin.com/in/granthenke

Re: [Discussion] Limitations on topic names

Posted by Joel Koshy <jj...@gmail.com>.
One way to get around this conflict could be to replace . with _ and _ with __

On Sat, Jul 11, 2015 at 10:33 AM, Todd Palino <tp...@gmail.com> wrote:
> I tend to agree with this as a compromise at this point. The reality is that this is technical debt that has built up in the project, and it does not go away by documenting it, and it will only get worse.
>
> As pointed out, eliminating either character at this point is going to cause problems for someone. And unfortunately, Guozhang, converting to __ doesn't really solve the problem either because that is still a valid topic name that could collide. It's less likely, but all it does is move the debt around a little.
>
> -Todd
>
>> On Jul 11, 2015, at 10:16 AM, Brock Noland <br...@apache.org> wrote:
>>
>> On Sat, Jul 11, 2015 at 12:54 AM, Ewen Cheslack-Postava
>> <ew...@confluent.io> wrote:
>>> On Fri, Jul 10, 2015 at 4:41 PM, Gwen Shapira <gs...@cloudera.com> wrote:
>>>
>>>> Yeah, I have an actual customer who ran into this. Unfortunately,
>>>> inconsistencies in the way things are named are pretty common - just
>>>> look at Kafka's many CLI options.
>>>>
>>>> I don't think that supporting both and pointing at the docs with "I
>>>> told you so" when our metrics break is a good solution.
>>>
>>> I agree, especially since we don't *already* have something in the docs
>>> indicating this will be an issue. I was flippant about the situation
>>> because I *wish* there was more careful consideration + naming policy in
>>> place, but I realize that doesn't always happen in practice. I guess I need
>>> to take Compatibility Czar more seriously :)
>>>
>>> I see think the obvious practical options are as follows:
>>>
>>> 1. Kill support for "_". Piss off the entire set of people who currently
>>> use "_" anywhere in topic names.
>>> 2. Kill support for ".". Piss off the entire set of people who currently
>>> use "." anywhere in topic names.
>>> 3. Tell people they need to be careful about this issue. Piss off the set
>>> of people who use both "_" and "." *and* happen to have conflicting topic
>>> names. They will have some pain when they discover the issue and have to
>>> figure out how to move one of those topics over to a non-conflicting name.
>>> I'm going to claim that this group must be an *extremely* small fraction of
>>> users, which doesn't make it better to allow things to break for them, but
>>> at least gives us an idea of the scale of impact.
>>>
>>> (One other alternative suggested earlier was encoding metric names to
>>> account for differences; given the metric renaming mess in the last
>>> release, I'm extremely hesitant to suggest anything of the sort...)
>>>
>>> None of the options are ideal, but to me, 3 seems like the least painful.
>>> Both for us, and for the vast majority of users. It seems to me that the
>>> number of users that would complain about (1) or (2) drastically outweigh
>>> (3).
>>>
>>> At this point, I don't think it's practical to keep switching the rules
>>> about which characters are allowed and which aren't because the previous
>>> attempts haven't been successful -- it seems the rules have changed
>>> multiple times, whether intentionally or accidentally, such that any more
>>> changes will cause problems. At this point, I think we just need to accept
>>> being liberal in accepting the range of topic names that have been
>>> permitted so far and make the best of the situation, even if it means only
>>> being able to warn people of conflicts.
>>>
>>> Here's another alternative: how about being liberal with topic name
>>> characters, but upon topic creation we convert the name to the metric name
>>> and fail if there's a conflict with another topic? This is relatively
>>> expensive (requires getting the metric name of all other topics), but it
>>> avoids the bad situation we're encountering here (conflicting metrics),
>>> avoids getting into a persistent conflict (we kill topic creation when we
>>> detect the issue rather than noticing it when the metrics conflict
>>> happens), and keeps the vast majority of existing users happy (both _ and .
>>> work in topic names as long as you don't create topics with conflicting
>>> metric names).
>>>
>>> There are definitely details to be worked out (auto topic creation?), but
>>> it seems like a more realistic solution than to start disallowing _ or . in
>>> topic names.
>>
>> I was thinking the same. Allow a.b or a_b but not a.b and a_b. This
>> seems like it will impact a trivial amount of users and keep both the
>> "." and "_" camps happy.
>>
>>>
>>> -Ewen
>>>
>>>
>>>>
>>>> On Fri, Jul 10, 2015 at 4:33 PM, Ewen Cheslack-Postava
>>>> <ew...@confluent.io> wrote:
>>>>> I figure you'll probably see complaints no matter what change you make.
>>>>> Gwen, given that you raised this, another important question might be how
>>>>> many people you see using *both*. I'm guessing this question came up
>>>>> because you actually saw a conflict? But I'd imagine (or at least hope)
>>>>> that most organizations are mostly consistent about naming topics -- they
>>>>> standardize on one or the other.
>>>>>
>>>>> Since there's no "right" way to name them, I'd just leave it supporting
>>>>> both and document the potential conflict in metrics. And if people use
>>>> both
>>>>> naming schemes, they probably deserve to suffer for their inconsistency
>>>> :)
>>>>>
>>>>> -Ewen
>>>>>
>>>>>> On Fri, Jul 10, 2015 at 3:28 PM, Gwen Shapira <gs...@cloudera.com>
>>>>> wrote:
>>>>>
>>>>>> I find dots more common in my customer base, so I will definitely feel
>>>>>> the pain of removing them.
>>>>>>
>>>>>> However, "." are already used in metrics, file names, directories, etc
>>>>>> - so if we keep the dots, we need to keep code that translates them
>>>>>> and document the translation. Just banning "." seems more natural.
>>>>>> Also, as Grant mentioned, we'll probably have our own special usage
>>>>>> for "." down the line.
>>>>>>
>>>>>>> On Fri, Jul 10, 2015 at 2:12 PM, Todd Palino <tp...@gmail.com> wrote:
>>>>>>> I absolutely disagree with #2, Neha. That will break a lot of
>>>>>>> infrastructure within LinkedIn. That said, removing "." might break
>>>> other
>>>>>>> people as well, but I think we should have a clearer idea of how much
>>>>>> usage
>>>>>>> there is on either side.
>>>>>>>
>>>>>>> -Todd
>>>>>>>
>>>>>>>
>>>>>>>> On Fri, Jul 10, 2015 at 2:08 PM, Neha Narkhede <ne...@confluent.io>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> "." seems natural for grouping topic names. +1 for 2) going forward
>>>> only
>>>>>>>> without breaking previously created topics with "_" though that might
>>>>>>>> require us to patch the code somewhat awkwardly till we phase it out
>>>> a
>>>>>>>> couple (purposely left vague to stay out of Ewen's wrath :-))
>>>> versions
>>>>>>>> later.
>>>>>>>>
>>>>>>>> On Fri, Jul 10, 2015 at 2:02 PM, Gwen Shapira <gshapira@cloudera.com
>>>>>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> I don't think we should break existing topics. Just disallow new
>>>>>>>>> topics going forward.
>>>>>>>>>
>>>>>>>>> Agree that having both is horrible, but we should have a solution
>>>> that
>>>>>>>>> fails when you run "kafka_topics.sh --create", not when you
>>>> configure
>>>>>>>>> Ganglia.
>>>>>>>>>
>>>>>>>>> Gwen
>>>>>>>>>
>>>>>>>>> On Fri, Jul 10, 2015 at 1:53 PM, Jay Kreps <ja...@confluent.io>
>>>> wrote:
>>>>>>>>>> Unfortunately '.' is pretty common too. I agree that it is
>>>> perverse,
>>>>>>>> but
>>>>>>>>>> people seem to do it. Breaking all the topics with '.' in the
>>>> name
>>>>>>>> seems
>>>>>>>>>> like it could be worse than combining metrics for people who
>>>> have a
>>>>>>>>>> 'foo_bar' AND 'foo.bar' (and after all, having both is DEEPLY
>>>>>> perverse,
>>>>>>>>>> no?).
>>>>>>>>>>
>>>>>>>>>> Where is our Dean of Compatibility, Ewen, on this?
>>>>>>>>>>
>>>>>>>>>> -Jay
>>>>>>>>>>
>>>>>>>>>> On Fri, Jul 10, 2015 at 1:32 PM, Todd Palino <tp...@gmail.com>
>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> My selfish point of view is that we do #1, as we use "_"
>>>>>> extensively
>>>>>>>> in
>>>>>>>>>>> topic names here :) I also happen to think it's the right
>>>> choice,
>>>>>>>>>>> specifically because "." has more special meanings, as you
>>>> noted.
>>>>>>>>>>>
>>>>>>>>>>> -Todd
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Fri, Jul 10, 2015 at 1:30 PM, Gwen Shapira <
>>>>>> gshapira@cloudera.com>
>>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Unintentional side effect from allowing IP addresses in
>>>> consumer
>>>>>>>>> client
>>>>>>>>>>>> IDs :)
>>>>>>>>>>>>
>>>>>>>>>>>> So the question is, what do we do now?
>>>>>>>>>>>>
>>>>>>>>>>>> 1) disallow "."
>>>>>>>>>>>> 2) disallow "_"
>>>>>>>>>>>> 3) find a reversible way to encode "." and "_" that won't
>>>> break
>>>>>>>>> existing
>>>>>>>>>>>> metrics
>>>>>>>>>>>> 4) all of the above?
>>>>>>>>>>>>
>>>>>>>>>>>> btw. it looks like "." and ".." are currently valid. Topic
>>>> names
>>>>>> are
>>>>>>>>>>>> used for directories, right? this sounds like fun :)
>>>>>>>>>>>>
>>>>>>>>>>>> I vote for option #1, although if someone has a good idea for
>>>> #3
>>>>>> it
>>>>>>>>>>>> will be even better.
>>>>>>>>>>>>
>>>>>>>>>>>> Gwen
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Fri, Jul 10, 2015 at 1:22 PM, Grant Henke <
>>>>>> ghenke@cloudera.com>
>>>>>>>>>>> wrote:
>>>>>>>>>>>>> Found it was added here:
>>>>>>>>>>> https://issues.apache.org/jira/browse/KAFKA-697
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Fri, Jul 10, 2015 at 3:18 PM, Todd Palino <
>>>>>> tpalino@gmail.com>
>>>>>>>>>>> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> This was definitely changed at some point after KAFKA-495.
>>>> The
>>>>>>>>>>> question
>>>>>>>>>>>> is
>>>>>>>>>>>>>> when and why.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Here's the relevant code from that patch:
>>>>>>>> ===================================================================
>>>>>>>>>>>>>> --- core/src/main/scala/kafka/utils/Topic.scala (revision
>>>>>>>> 1390178)
>>>>>>>>>>>>>> +++ core/src/main/scala/kafka/utils/Topic.scala (working
>>>> copy)
>>>>>>>>>>>>>> @@ -21,24 +21,21 @@
>>>>>>>>>>>>>> import util.matching.Regex
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> object Topic {
>>>>>>>>>>>>>> +  val legalChars = "[a-zA-Z0-9_-]"
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> -Todd
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Fri, Jul 10, 2015 at 1:02 PM, Grant Henke <
>>>>>>>> ghenke@cloudera.com>
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> kafka.common.Topic shows that currently period is a valid
>>>>>>>>> character
>>>>>>>>>>>> and I
>>>>>>>>>>>>>>> have verified I can use kafka-topics.sh to create a new
>>>>>> topic
>>>>>>>>> with a
>>>>>>>>>>>>>>> period.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> AdminUtils.createOrUpdateTopicPartitionAssignmentPathInZK
>>>>>>>>> currently
>>>>>>>>>>>> uses
>>>>>>>>>>>>>>> Topic.validate before writing to Zookeeper.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Should period character support be removed? I was under
>>>> the
>>>>>>>> same
>>>>>>>>>>>>>> impression
>>>>>>>>>>>>>>> as Gwen, that a period was used by many as a way to
>>>> "group"
>>>>>>>>> topics.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> The code is pasted below since its small:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> object Topic {
>>>>>>>>>>>>>>>  val legalChars = "[a-zA-Z0-9\\._\\-]"
>>>>>>>>>>>>>>>  private val maxNameLength = 255
>>>>>>>>>>>>>>>  private val rgx = new Regex(legalChars + "+")
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>  val InternalTopics =
>>>> Set(OffsetManager.OffsetsTopicName)
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>  def validate(topic: String) {
>>>>>>>>>>>>>>>    if (topic.length <= 0)
>>>>>>>>>>>>>>>      throw new InvalidTopicException("topic name is
>>>>>> illegal,
>>>>>>>>> can't
>>>>>>>>>>> be
>>>>>>>>>>>>>>> empty")
>>>>>>>>>>>>>>>    else if (topic.equals(".") || topic.equals(".."))
>>>>>>>>>>>>>>>      throw new InvalidTopicException("topic name cannot
>>>> be
>>>>>>>>> \".\" or
>>>>>>>>>>>>>>> \"..\"")
>>>>>>>>>>>>>>>    else if (topic.length > maxNameLength)
>>>>>>>>>>>>>>>      throw new InvalidTopicException("topic name is
>>>>>> illegal,
>>>>>>>>> can't
>>>>>>>>>>> be
>>>>>>>>>>>>>>> longer than " + maxNameLength + " characters")
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>    rgx.findFirstIn(topic) match {
>>>>>>>>>>>>>>>      case Some(t) =>
>>>>>>>>>>>>>>>        if (!t.equals(topic))
>>>>>>>>>>>>>>>          throw new InvalidTopicException("topic name " +
>>>>>> topic
>>>>>>>>> + "
>>>>>>>>>>> is
>>>>>>>>>>>>>>> illegal, contains a character other than ASCII
>>>>>> alphanumerics,
>>>>>>>>> '.',
>>>>>>>>>>> '_'
>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>> '-'")
>>>>>>>>>>>>>>>      case None => throw new InvalidTopicException("topic
>>>>>> name
>>>>>>>> "
>>>>>>>>> +
>>>>>>>>>>>> topic
>>>>>>>>>>>>>> +
>>>>>>>>>>>>>>> " is illegal,  contains a character other than ASCII
>>>>>>>>> alphanumerics,
>>>>>>>>>>>> '.',
>>>>>>>>>>>>>>> '_' and '-'")
>>>>>>>>>>>>>>>    }
>>>>>>>>>>>>>>>  }
>>>>>>>>>>>>>>> }
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Fri, Jul 10, 2015 at 2:50 PM, Todd Palino <
>>>>>>>> tpalino@gmail.com>
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I had to go look this one up again to make sure -
>>>>>>>>>>>>>>>> https://issues.apache.org/jira/browse/KAFKA-495
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> The only valid character names for topics are
>>>>>> alphanumeric,
>>>>>>>>>>>> underscore,
>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>> dash. A period is not supposed to be a valid character
>>>> to
>>>>>>>> use.
>>>>>>>>> If
>>>>>>>>>>>>>> you're
>>>>>>>>>>>>>>>> seeing them, then one of two things have happened:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> 1) You have topic names that are grandfathered in from
>>>>>> before
>>>>>>>>> that
>>>>>>>>>>>>>> patch
>>>>>>>>>>>>>>>> 2) The patch is not working properly and there is
>>>>>> somewhere
>>>>>>>> in
>>>>>>>>> the
>>>>>>>>>>>>>> broker
>>>>>>>>>>>>>>>> that the standard is not being enforced.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> -Todd
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Fri, Jul 10, 2015 at 12:13 PM, Brock Noland <
>>>>>>>>> brock@apache.org>
>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On Fri, Jul 10, 2015 at 11:34 AM, Gwen Shapira <
>>>>>>>>>>>>>> gshapira@cloudera.com>
>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>> Hi Kafka Fans,
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> If you have one topic named "kafka_lab_2" and the
>>>>>> other
>>>>>>>>> named
>>>>>>>>>>>>>>>>>> "kafka.lab.2", the topic level metrics will be
>>>> named
>>>>>>>>>>> kafka_lab_2
>>>>>>>>>>>>>> for
>>>>>>>>>>>>>>>>>> both, effectively making it impossible to monitor
>>>> them
>>>>>>>>>>> properly.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> The reason this happens is that using "." in topic
>>>>>> names
>>>>>>>> is
>>>>>>>>>>>> pretty
>>>>>>>>>>>>>>>>>> common, especially as a way to group topics into
>>>> data
>>>>>>>>> centers,
>>>>>>>>>>>>>>>>>> relevant apps, etc - basically a work-around to our
>>>>>>>> current
>>>>>>>>>>>> lack of
>>>>>>>>>>>>>>>>>> name spaces. However, most metric monitoring
>>>> systems
>>>>>>>> using
>>>>>>>>> "."
>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>> annotate hierarchy, so to avoid issues around
>>>> metric
>>>>>>>> names,
>>>>>>>>>>>> Kafka
>>>>>>>>>>>>>>>>>> replaces the "." in the name with an underscore.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> This generates good metric names, but creates the
>>>>>> problem
>>>>>>>>> with
>>>>>>>>>>>> name
>>>>>>>>>>>>>>>>> collisions.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> I'm wondering if it makes sense to simply limit the
>>>>>> range
>>>>>>>>> of
>>>>>>>>>>>>>>>>>> characters permitted in a topic name and disallow
>>>> "_"?
>>>>>>>>>>> Obviously
>>>>>>>>>>>>>>>>>> existing topics will need to remain as is, which
>>>> is a
>>>>>> bit
>>>>>>>>>>>> awkward.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Interesting problem! Many if not most users I
>>>>>> personally am
>>>>>>>>>>> aware
>>>>>>>>>>>> of
>>>>>>>>>>>>>>>>> use "_" as a separator in topic names. I am sure that
>>>>>> many
>>>>>>>>> users
>>>>>>>>>>>>>> would
>>>>>>>>>>>>>>>>> be quite surprised by this limitation. With that
>>>> said,
>>>>>> I am
>>>>>>>>> sure
>>>>>>>>>>>>>>>>> they'd transition accordingly.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> If anyone has better backward-compatible solutions
>>>> to
>>>>>>>> this,
>>>>>>>>>>> I'm
>>>>>>>>>>>> all
>>>>>>>>>>>>>>>> ears
>>>>>>>>>>>>>>>>> :)
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Gwen
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>> Grant Henke
>>>>>>>>>>>>>>> Solutions Consultant | Cloudera
>>>>>>>>>>>>>>> ghenke@cloudera.com | twitter.com/gchenke |
>>>>>>>>>>>> linkedin.com/in/granthenke
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> --
>>>>>>>>>>>>> Grant Henke
>>>>>>>>>>>>> Solutions Consultant | Cloudera
>>>>>>>>>>>>> ghenke@cloudera.com | twitter.com/gchenke |
>>>>>>>>> linkedin.com/in/granthenke
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> Thanks,
>>>>>>>> Neha
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Thanks,
>>>>> Ewen
>>>
>>>
>>>
>>> --
>>> Thanks,
>>> Ewen

Re: [Discussion] Limitations on topic names

Posted by Joe Stein <jo...@stealth.ly>.
Can we provide a tool so folks can "sync back" old topic names to new so
their clusters aren't format lopsided.

~ Joestein
On Jul 11, 2015 1:33 PM, "Todd Palino" <tp...@gmail.com> wrote:

> I tend to agree with this as a compromise at this point. The reality is
> that this is technical debt that has built up in the project, and it does
> not go away by documenting it, and it will only get worse.
>
> As pointed out, eliminating either character at this point is going to
> cause problems for someone. And unfortunately, Guozhang, converting to __
> doesn't really solve the problem either because that is still a valid topic
> name that could collide. It's less likely, but all it does is move the debt
> around a little.
>
> -Todd
>
> > On Jul 11, 2015, at 10:16 AM, Brock Noland <br...@apache.org> wrote:
> >
> > On Sat, Jul 11, 2015 at 12:54 AM, Ewen Cheslack-Postava
> > <ew...@confluent.io> wrote:
> >> On Fri, Jul 10, 2015 at 4:41 PM, Gwen Shapira <gs...@cloudera.com>
> wrote:
> >>
> >>> Yeah, I have an actual customer who ran into this. Unfortunately,
> >>> inconsistencies in the way things are named are pretty common - just
> >>> look at Kafka's many CLI options.
> >>>
> >>> I don't think that supporting both and pointing at the docs with "I
> >>> told you so" when our metrics break is a good solution.
> >>
> >> I agree, especially since we don't *already* have something in the docs
> >> indicating this will be an issue. I was flippant about the situation
> >> because I *wish* there was more careful consideration + naming policy in
> >> place, but I realize that doesn't always happen in practice. I guess I
> need
> >> to take Compatibility Czar more seriously :)
> >>
> >> I see think the obvious practical options are as follows:
> >>
> >> 1. Kill support for "_". Piss off the entire set of people who currently
> >> use "_" anywhere in topic names.
> >> 2. Kill support for ".". Piss off the entire set of people who currently
> >> use "." anywhere in topic names.
> >> 3. Tell people they need to be careful about this issue. Piss off the
> set
> >> of people who use both "_" and "." *and* happen to have conflicting
> topic
> >> names. They will have some pain when they discover the issue and have to
> >> figure out how to move one of those topics over to a non-conflicting
> name.
> >> I'm going to claim that this group must be an *extremely* small
> fraction of
> >> users, which doesn't make it better to allow things to break for them,
> but
> >> at least gives us an idea of the scale of impact.
> >>
> >> (One other alternative suggested earlier was encoding metric names to
> >> account for differences; given the metric renaming mess in the last
> >> release, I'm extremely hesitant to suggest anything of the sort...)
> >>
> >> None of the options are ideal, but to me, 3 seems like the least
> painful.
> >> Both for us, and for the vast majority of users. It seems to me that the
> >> number of users that would complain about (1) or (2) drastically
> outweigh
> >> (3).
> >>
> >> At this point, I don't think it's practical to keep switching the rules
> >> about which characters are allowed and which aren't because the previous
> >> attempts haven't been successful -- it seems the rules have changed
> >> multiple times, whether intentionally or accidentally, such that any
> more
> >> changes will cause problems. At this point, I think we just need to
> accept
> >> being liberal in accepting the range of topic names that have been
> >> permitted so far and make the best of the situation, even if it means
> only
> >> being able to warn people of conflicts.
> >>
> >> Here's another alternative: how about being liberal with topic name
> >> characters, but upon topic creation we convert the name to the metric
> name
> >> and fail if there's a conflict with another topic? This is relatively
> >> expensive (requires getting the metric name of all other topics), but it
> >> avoids the bad situation we're encountering here (conflicting metrics),
> >> avoids getting into a persistent conflict (we kill topic creation when
> we
> >> detect the issue rather than noticing it when the metrics conflict
> >> happens), and keeps the vast majority of existing users happy (both _
> and .
> >> work in topic names as long as you don't create topics with conflicting
> >> metric names).
> >>
> >> There are definitely details to be worked out (auto topic creation?),
> but
> >> it seems like a more realistic solution than to start disallowing _ or
> . in
> >> topic names.
> >
> > I was thinking the same. Allow a.b or a_b but not a.b and a_b. This
> > seems like it will impact a trivial amount of users and keep both the
> > "." and "_" camps happy.
> >
> >>
> >> -Ewen
> >>
> >>
> >>>
> >>> On Fri, Jul 10, 2015 at 4:33 PM, Ewen Cheslack-Postava
> >>> <ew...@confluent.io> wrote:
> >>>> I figure you'll probably see complaints no matter what change you
> make.
> >>>> Gwen, given that you raised this, another important question might be
> how
> >>>> many people you see using *both*. I'm guessing this question came up
> >>>> because you actually saw a conflict? But I'd imagine (or at least
> hope)
> >>>> that most organizations are mostly consistent about naming topics --
> they
> >>>> standardize on one or the other.
> >>>>
> >>>> Since there's no "right" way to name them, I'd just leave it
> supporting
> >>>> both and document the potential conflict in metrics. And if people use
> >>> both
> >>>> naming schemes, they probably deserve to suffer for their
> inconsistency
> >>> :)
> >>>>
> >>>> -Ewen
> >>>>
> >>>>> On Fri, Jul 10, 2015 at 3:28 PM, Gwen Shapira <gshapira@cloudera.com
> >
> >>>> wrote:
> >>>>
> >>>>> I find dots more common in my customer base, so I will definitely
> feel
> >>>>> the pain of removing them.
> >>>>>
> >>>>> However, "." are already used in metrics, file names, directories,
> etc
> >>>>> - so if we keep the dots, we need to keep code that translates them
> >>>>> and document the translation. Just banning "." seems more natural.
> >>>>> Also, as Grant mentioned, we'll probably have our own special usage
> >>>>> for "." down the line.
> >>>>>
> >>>>>> On Fri, Jul 10, 2015 at 2:12 PM, Todd Palino <tp...@gmail.com>
> wrote:
> >>>>>> I absolutely disagree with #2, Neha. That will break a lot of
> >>>>>> infrastructure within LinkedIn. That said, removing "." might break
> >>> other
> >>>>>> people as well, but I think we should have a clearer idea of how
> much
> >>>>> usage
> >>>>>> there is on either side.
> >>>>>>
> >>>>>> -Todd
> >>>>>>
> >>>>>>
> >>>>>>> On Fri, Jul 10, 2015 at 2:08 PM, Neha Narkhede <ne...@confluent.io>
> >>>>>> wrote:
> >>>>>>
> >>>>>>> "." seems natural for grouping topic names. +1 for 2) going forward
> >>> only
> >>>>>>> without breaking previously created topics with "_" though that
> might
> >>>>>>> require us to patch the code somewhat awkwardly till we phase it
> out
> >>> a
> >>>>>>> couple (purposely left vague to stay out of Ewen's wrath :-))
> >>> versions
> >>>>>>> later.
> >>>>>>>
> >>>>>>> On Fri, Jul 10, 2015 at 2:02 PM, Gwen Shapira <
> gshapira@cloudera.com
> >>>>
> >>>>>>> wrote:
> >>>>>>>
> >>>>>>>> I don't think we should break existing topics. Just disallow new
> >>>>>>>> topics going forward.
> >>>>>>>>
> >>>>>>>> Agree that having both is horrible, but we should have a solution
> >>> that
> >>>>>>>> fails when you run "kafka_topics.sh --create", not when you
> >>> configure
> >>>>>>>> Ganglia.
> >>>>>>>>
> >>>>>>>> Gwen
> >>>>>>>>
> >>>>>>>> On Fri, Jul 10, 2015 at 1:53 PM, Jay Kreps <ja...@confluent.io>
> >>> wrote:
> >>>>>>>>> Unfortunately '.' is pretty common too. I agree that it is
> >>> perverse,
> >>>>>>> but
> >>>>>>>>> people seem to do it. Breaking all the topics with '.' in the
> >>> name
> >>>>>>> seems
> >>>>>>>>> like it could be worse than combining metrics for people who
> >>> have a
> >>>>>>>>> 'foo_bar' AND 'foo.bar' (and after all, having both is DEEPLY
> >>>>> perverse,
> >>>>>>>>> no?).
> >>>>>>>>>
> >>>>>>>>> Where is our Dean of Compatibility, Ewen, on this?
> >>>>>>>>>
> >>>>>>>>> -Jay
> >>>>>>>>>
> >>>>>>>>> On Fri, Jul 10, 2015 at 1:32 PM, Todd Palino <tp...@gmail.com>
> >>>>>>> wrote:
> >>>>>>>>>
> >>>>>>>>>> My selfish point of view is that we do #1, as we use "_"
> >>>>> extensively
> >>>>>>> in
> >>>>>>>>>> topic names here :) I also happen to think it's the right
> >>> choice,
> >>>>>>>>>> specifically because "." has more special meanings, as you
> >>> noted.
> >>>>>>>>>>
> >>>>>>>>>> -Todd
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> On Fri, Jul 10, 2015 at 1:30 PM, Gwen Shapira <
> >>>>> gshapira@cloudera.com>
> >>>>>>>>>> wrote:
> >>>>>>>>>>
> >>>>>>>>>>> Unintentional side effect from allowing IP addresses in
> >>> consumer
> >>>>>>>> client
> >>>>>>>>>>> IDs :)
> >>>>>>>>>>>
> >>>>>>>>>>> So the question is, what do we do now?
> >>>>>>>>>>>
> >>>>>>>>>>> 1) disallow "."
> >>>>>>>>>>> 2) disallow "_"
> >>>>>>>>>>> 3) find a reversible way to encode "." and "_" that won't
> >>> break
> >>>>>>>> existing
> >>>>>>>>>>> metrics
> >>>>>>>>>>> 4) all of the above?
> >>>>>>>>>>>
> >>>>>>>>>>> btw. it looks like "." and ".." are currently valid. Topic
> >>> names
> >>>>> are
> >>>>>>>>>>> used for directories, right? this sounds like fun :)
> >>>>>>>>>>>
> >>>>>>>>>>> I vote for option #1, although if someone has a good idea for
> >>> #3
> >>>>> it
> >>>>>>>>>>> will be even better.
> >>>>>>>>>>>
> >>>>>>>>>>> Gwen
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> On Fri, Jul 10, 2015 at 1:22 PM, Grant Henke <
> >>>>> ghenke@cloudera.com>
> >>>>>>>>>> wrote:
> >>>>>>>>>>>> Found it was added here:
> >>>>>>>>>> https://issues.apache.org/jira/browse/KAFKA-697
> >>>>>>>>>>>>
> >>>>>>>>>>>> On Fri, Jul 10, 2015 at 3:18 PM, Todd Palino <
> >>>>> tpalino@gmail.com>
> >>>>>>>>>> wrote:
> >>>>>>>>>>>>
> >>>>>>>>>>>>> This was definitely changed at some point after KAFKA-495.
> >>> The
> >>>>>>>>>> question
> >>>>>>>>>>> is
> >>>>>>>>>>>>> when and why.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Here's the relevant code from that patch:
> >>>>>>> ===================================================================
> >>>>>>>>>>>>> --- core/src/main/scala/kafka/utils/Topic.scala (revision
> >>>>>>> 1390178)
> >>>>>>>>>>>>> +++ core/src/main/scala/kafka/utils/Topic.scala (working
> >>> copy)
> >>>>>>>>>>>>> @@ -21,24 +21,21 @@
> >>>>>>>>>>>>> import util.matching.Regex
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> object Topic {
> >>>>>>>>>>>>> +  val legalChars = "[a-zA-Z0-9_-]"
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> -Todd
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> On Fri, Jul 10, 2015 at 1:02 PM, Grant Henke <
> >>>>>>> ghenke@cloudera.com>
> >>>>>>>>>>> wrote:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>> kafka.common.Topic shows that currently period is a valid
> >>>>>>>> character
> >>>>>>>>>>> and I
> >>>>>>>>>>>>>> have verified I can use kafka-topics.sh to create a new
> >>>>> topic
> >>>>>>>> with a
> >>>>>>>>>>>>>> period.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> AdminUtils.createOrUpdateTopicPartitionAssignmentPathInZK
> >>>>>>>> currently
> >>>>>>>>>>> uses
> >>>>>>>>>>>>>> Topic.validate before writing to Zookeeper.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Should period character support be removed? I was under
> >>> the
> >>>>>>> same
> >>>>>>>>>>>>> impression
> >>>>>>>>>>>>>> as Gwen, that a period was used by many as a way to
> >>> "group"
> >>>>>>>> topics.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> The code is pasted below since its small:
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> object Topic {
> >>>>>>>>>>>>>>  val legalChars = "[a-zA-Z0-9\\._\\-]"
> >>>>>>>>>>>>>>  private val maxNameLength = 255
> >>>>>>>>>>>>>>  private val rgx = new Regex(legalChars + "+")
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>  val InternalTopics =
> >>> Set(OffsetManager.OffsetsTopicName)
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>  def validate(topic: String) {
> >>>>>>>>>>>>>>    if (topic.length <= 0)
> >>>>>>>>>>>>>>      throw new InvalidTopicException("topic name is
> >>>>> illegal,
> >>>>>>>> can't
> >>>>>>>>>> be
> >>>>>>>>>>>>>> empty")
> >>>>>>>>>>>>>>    else if (topic.equals(".") || topic.equals(".."))
> >>>>>>>>>>>>>>      throw new InvalidTopicException("topic name cannot
> >>> be
> >>>>>>>> \".\" or
> >>>>>>>>>>>>>> \"..\"")
> >>>>>>>>>>>>>>    else if (topic.length > maxNameLength)
> >>>>>>>>>>>>>>      throw new InvalidTopicException("topic name is
> >>>>> illegal,
> >>>>>>>> can't
> >>>>>>>>>> be
> >>>>>>>>>>>>>> longer than " + maxNameLength + " characters")
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>    rgx.findFirstIn(topic) match {
> >>>>>>>>>>>>>>      case Some(t) =>
> >>>>>>>>>>>>>>        if (!t.equals(topic))
> >>>>>>>>>>>>>>          throw new InvalidTopicException("topic name " +
> >>>>> topic
> >>>>>>>> + "
> >>>>>>>>>> is
> >>>>>>>>>>>>>> illegal, contains a character other than ASCII
> >>>>> alphanumerics,
> >>>>>>>> '.',
> >>>>>>>>>> '_'
> >>>>>>>>>>>>> and
> >>>>>>>>>>>>>> '-'")
> >>>>>>>>>>>>>>      case None => throw new InvalidTopicException("topic
> >>>>> name
> >>>>>>> "
> >>>>>>>> +
> >>>>>>>>>>> topic
> >>>>>>>>>>>>> +
> >>>>>>>>>>>>>> " is illegal,  contains a character other than ASCII
> >>>>>>>> alphanumerics,
> >>>>>>>>>>> '.',
> >>>>>>>>>>>>>> '_' and '-'")
> >>>>>>>>>>>>>>    }
> >>>>>>>>>>>>>>  }
> >>>>>>>>>>>>>> }
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> On Fri, Jul 10, 2015 at 2:50 PM, Todd Palino <
> >>>>>>> tpalino@gmail.com>
> >>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> I had to go look this one up again to make sure -
> >>>>>>>>>>>>>>> https://issues.apache.org/jira/browse/KAFKA-495
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> The only valid character names for topics are
> >>>>> alphanumeric,
> >>>>>>>>>>> underscore,
> >>>>>>>>>>>>>> and
> >>>>>>>>>>>>>>> dash. A period is not supposed to be a valid character
> >>> to
> >>>>>>> use.
> >>>>>>>> If
> >>>>>>>>>>>>> you're
> >>>>>>>>>>>>>>> seeing them, then one of two things have happened:
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> 1) You have topic names that are grandfathered in from
> >>>>> before
> >>>>>>>> that
> >>>>>>>>>>>>> patch
> >>>>>>>>>>>>>>> 2) The patch is not working properly and there is
> >>>>> somewhere
> >>>>>>> in
> >>>>>>>> the
> >>>>>>>>>>>>> broker
> >>>>>>>>>>>>>>> that the standard is not being enforced.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> -Todd
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> On Fri, Jul 10, 2015 at 12:13 PM, Brock Noland <
> >>>>>>>> brock@apache.org>
> >>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> On Fri, Jul 10, 2015 at 11:34 AM, Gwen Shapira <
> >>>>>>>>>>>>> gshapira@cloudera.com>
> >>>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>> Hi Kafka Fans,
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> If you have one topic named "kafka_lab_2" and the
> >>>>> other
> >>>>>>>> named
> >>>>>>>>>>>>>>>>> "kafka.lab.2", the topic level metrics will be
> >>> named
> >>>>>>>>>> kafka_lab_2
> >>>>>>>>>>>>> for
> >>>>>>>>>>>>>>>>> both, effectively making it impossible to monitor
> >>> them
> >>>>>>>>>> properly.
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> The reason this happens is that using "." in topic
> >>>>> names
> >>>>>>> is
> >>>>>>>>>>> pretty
> >>>>>>>>>>>>>>>>> common, especially as a way to group topics into
> >>> data
> >>>>>>>> centers,
> >>>>>>>>>>>>>>>>> relevant apps, etc - basically a work-around to our
> >>>>>>> current
> >>>>>>>>>>> lack of
> >>>>>>>>>>>>>>>>> name spaces. However, most metric monitoring
> >>> systems
> >>>>>>> using
> >>>>>>>> "."
> >>>>>>>>>>> to
> >>>>>>>>>>>>>>>>> annotate hierarchy, so to avoid issues around
> >>> metric
> >>>>>>> names,
> >>>>>>>>>>> Kafka
> >>>>>>>>>>>>>>>>> replaces the "." in the name with an underscore.
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> This generates good metric names, but creates the
> >>>>> problem
> >>>>>>>> with
> >>>>>>>>>>> name
> >>>>>>>>>>>>>>>> collisions.
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> I'm wondering if it makes sense to simply limit the
> >>>>> range
> >>>>>>>> of
> >>>>>>>>>>>>>>>>> characters permitted in a topic name and disallow
> >>> "_"?
> >>>>>>>>>> Obviously
> >>>>>>>>>>>>>>>>> existing topics will need to remain as is, which
> >>> is a
> >>>>> bit
> >>>>>>>>>>> awkward.
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Interesting problem! Many if not most users I
> >>>>> personally am
> >>>>>>>>>> aware
> >>>>>>>>>>> of
> >>>>>>>>>>>>>>>> use "_" as a separator in topic names. I am sure that
> >>>>> many
> >>>>>>>> users
> >>>>>>>>>>>>> would
> >>>>>>>>>>>>>>>> be quite surprised by this limitation. With that
> >>> said,
> >>>>> I am
> >>>>>>>> sure
> >>>>>>>>>>>>>>>> they'd transition accordingly.
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> If anyone has better backward-compatible solutions
> >>> to
> >>>>>>> this,
> >>>>>>>>>> I'm
> >>>>>>>>>>> all
> >>>>>>>>>>>>>>> ears
> >>>>>>>>>>>>>>>> :)
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> Gwen
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> --
> >>>>>>>>>>>>>> Grant Henke
> >>>>>>>>>>>>>> Solutions Consultant | Cloudera
> >>>>>>>>>>>>>> ghenke@cloudera.com | twitter.com/gchenke |
> >>>>>>>>>>> linkedin.com/in/granthenke
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>> --
> >>>>>>>>>>>> Grant Henke
> >>>>>>>>>>>> Solutions Consultant | Cloudera
> >>>>>>>>>>>> ghenke@cloudera.com | twitter.com/gchenke |
> >>>>>>>> linkedin.com/in/granthenke
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> --
> >>>>>>> Thanks,
> >>>>>>> Neha
> >>>>
> >>>>
> >>>>
> >>>> --
> >>>> Thanks,
> >>>> Ewen
> >>
> >>
> >>
> >> --
> >> Thanks,
> >> Ewen
>

Re: [Discussion] Limitations on topic names

Posted by Todd Palino <tp...@gmail.com>.
I tend to agree with this as a compromise at this point. The reality is that this is technical debt that has built up in the project, and it does not go away by documenting it, and it will only get worse.

As pointed out, eliminating either character at this point is going to cause problems for someone. And unfortunately, Guozhang, converting to __ doesn't really solve the problem either because that is still a valid topic name that could collide. It's less likely, but all it does is move the debt around a little.

-Todd

> On Jul 11, 2015, at 10:16 AM, Brock Noland <br...@apache.org> wrote:
> 
> On Sat, Jul 11, 2015 at 12:54 AM, Ewen Cheslack-Postava
> <ew...@confluent.io> wrote:
>> On Fri, Jul 10, 2015 at 4:41 PM, Gwen Shapira <gs...@cloudera.com> wrote:
>> 
>>> Yeah, I have an actual customer who ran into this. Unfortunately,
>>> inconsistencies in the way things are named are pretty common - just
>>> look at Kafka's many CLI options.
>>> 
>>> I don't think that supporting both and pointing at the docs with "I
>>> told you so" when our metrics break is a good solution.
>> 
>> I agree, especially since we don't *already* have something in the docs
>> indicating this will be an issue. I was flippant about the situation
>> because I *wish* there was more careful consideration + naming policy in
>> place, but I realize that doesn't always happen in practice. I guess I need
>> to take Compatibility Czar more seriously :)
>> 
>> I see think the obvious practical options are as follows:
>> 
>> 1. Kill support for "_". Piss off the entire set of people who currently
>> use "_" anywhere in topic names.
>> 2. Kill support for ".". Piss off the entire set of people who currently
>> use "." anywhere in topic names.
>> 3. Tell people they need to be careful about this issue. Piss off the set
>> of people who use both "_" and "." *and* happen to have conflicting topic
>> names. They will have some pain when they discover the issue and have to
>> figure out how to move one of those topics over to a non-conflicting name.
>> I'm going to claim that this group must be an *extremely* small fraction of
>> users, which doesn't make it better to allow things to break for them, but
>> at least gives us an idea of the scale of impact.
>> 
>> (One other alternative suggested earlier was encoding metric names to
>> account for differences; given the metric renaming mess in the last
>> release, I'm extremely hesitant to suggest anything of the sort...)
>> 
>> None of the options are ideal, but to me, 3 seems like the least painful.
>> Both for us, and for the vast majority of users. It seems to me that the
>> number of users that would complain about (1) or (2) drastically outweigh
>> (3).
>> 
>> At this point, I don't think it's practical to keep switching the rules
>> about which characters are allowed and which aren't because the previous
>> attempts haven't been successful -- it seems the rules have changed
>> multiple times, whether intentionally or accidentally, such that any more
>> changes will cause problems. At this point, I think we just need to accept
>> being liberal in accepting the range of topic names that have been
>> permitted so far and make the best of the situation, even if it means only
>> being able to warn people of conflicts.
>> 
>> Here's another alternative: how about being liberal with topic name
>> characters, but upon topic creation we convert the name to the metric name
>> and fail if there's a conflict with another topic? This is relatively
>> expensive (requires getting the metric name of all other topics), but it
>> avoids the bad situation we're encountering here (conflicting metrics),
>> avoids getting into a persistent conflict (we kill topic creation when we
>> detect the issue rather than noticing it when the metrics conflict
>> happens), and keeps the vast majority of existing users happy (both _ and .
>> work in topic names as long as you don't create topics with conflicting
>> metric names).
>> 
>> There are definitely details to be worked out (auto topic creation?), but
>> it seems like a more realistic solution than to start disallowing _ or . in
>> topic names.
> 
> I was thinking the same. Allow a.b or a_b but not a.b and a_b. This
> seems like it will impact a trivial amount of users and keep both the
> "." and "_" camps happy.
> 
>> 
>> -Ewen
>> 
>> 
>>> 
>>> On Fri, Jul 10, 2015 at 4:33 PM, Ewen Cheslack-Postava
>>> <ew...@confluent.io> wrote:
>>>> I figure you'll probably see complaints no matter what change you make.
>>>> Gwen, given that you raised this, another important question might be how
>>>> many people you see using *both*. I'm guessing this question came up
>>>> because you actually saw a conflict? But I'd imagine (or at least hope)
>>>> that most organizations are mostly consistent about naming topics -- they
>>>> standardize on one or the other.
>>>> 
>>>> Since there's no "right" way to name them, I'd just leave it supporting
>>>> both and document the potential conflict in metrics. And if people use
>>> both
>>>> naming schemes, they probably deserve to suffer for their inconsistency
>>> :)
>>>> 
>>>> -Ewen
>>>> 
>>>>> On Fri, Jul 10, 2015 at 3:28 PM, Gwen Shapira <gs...@cloudera.com>
>>>> wrote:
>>>> 
>>>>> I find dots more common in my customer base, so I will definitely feel
>>>>> the pain of removing them.
>>>>> 
>>>>> However, "." are already used in metrics, file names, directories, etc
>>>>> - so if we keep the dots, we need to keep code that translates them
>>>>> and document the translation. Just banning "." seems more natural.
>>>>> Also, as Grant mentioned, we'll probably have our own special usage
>>>>> for "." down the line.
>>>>> 
>>>>>> On Fri, Jul 10, 2015 at 2:12 PM, Todd Palino <tp...@gmail.com> wrote:
>>>>>> I absolutely disagree with #2, Neha. That will break a lot of
>>>>>> infrastructure within LinkedIn. That said, removing "." might break
>>> other
>>>>>> people as well, but I think we should have a clearer idea of how much
>>>>> usage
>>>>>> there is on either side.
>>>>>> 
>>>>>> -Todd
>>>>>> 
>>>>>> 
>>>>>>> On Fri, Jul 10, 2015 at 2:08 PM, Neha Narkhede <ne...@confluent.io>
>>>>>> wrote:
>>>>>> 
>>>>>>> "." seems natural for grouping topic names. +1 for 2) going forward
>>> only
>>>>>>> without breaking previously created topics with "_" though that might
>>>>>>> require us to patch the code somewhat awkwardly till we phase it out
>>> a
>>>>>>> couple (purposely left vague to stay out of Ewen's wrath :-))
>>> versions
>>>>>>> later.
>>>>>>> 
>>>>>>> On Fri, Jul 10, 2015 at 2:02 PM, Gwen Shapira <gshapira@cloudera.com
>>>> 
>>>>>>> wrote:
>>>>>>> 
>>>>>>>> I don't think we should break existing topics. Just disallow new
>>>>>>>> topics going forward.
>>>>>>>> 
>>>>>>>> Agree that having both is horrible, but we should have a solution
>>> that
>>>>>>>> fails when you run "kafka_topics.sh --create", not when you
>>> configure
>>>>>>>> Ganglia.
>>>>>>>> 
>>>>>>>> Gwen
>>>>>>>> 
>>>>>>>> On Fri, Jul 10, 2015 at 1:53 PM, Jay Kreps <ja...@confluent.io>
>>> wrote:
>>>>>>>>> Unfortunately '.' is pretty common too. I agree that it is
>>> perverse,
>>>>>>> but
>>>>>>>>> people seem to do it. Breaking all the topics with '.' in the
>>> name
>>>>>>> seems
>>>>>>>>> like it could be worse than combining metrics for people who
>>> have a
>>>>>>>>> 'foo_bar' AND 'foo.bar' (and after all, having both is DEEPLY
>>>>> perverse,
>>>>>>>>> no?).
>>>>>>>>> 
>>>>>>>>> Where is our Dean of Compatibility, Ewen, on this?
>>>>>>>>> 
>>>>>>>>> -Jay
>>>>>>>>> 
>>>>>>>>> On Fri, Jul 10, 2015 at 1:32 PM, Todd Palino <tp...@gmail.com>
>>>>>>> wrote:
>>>>>>>>> 
>>>>>>>>>> My selfish point of view is that we do #1, as we use "_"
>>>>> extensively
>>>>>>> in
>>>>>>>>>> topic names here :) I also happen to think it's the right
>>> choice,
>>>>>>>>>> specifically because "." has more special meanings, as you
>>> noted.
>>>>>>>>>> 
>>>>>>>>>> -Todd
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> On Fri, Jul 10, 2015 at 1:30 PM, Gwen Shapira <
>>>>> gshapira@cloudera.com>
>>>>>>>>>> wrote:
>>>>>>>>>> 
>>>>>>>>>>> Unintentional side effect from allowing IP addresses in
>>> consumer
>>>>>>>> client
>>>>>>>>>>> IDs :)
>>>>>>>>>>> 
>>>>>>>>>>> So the question is, what do we do now?
>>>>>>>>>>> 
>>>>>>>>>>> 1) disallow "."
>>>>>>>>>>> 2) disallow "_"
>>>>>>>>>>> 3) find a reversible way to encode "." and "_" that won't
>>> break
>>>>>>>> existing
>>>>>>>>>>> metrics
>>>>>>>>>>> 4) all of the above?
>>>>>>>>>>> 
>>>>>>>>>>> btw. it looks like "." and ".." are currently valid. Topic
>>> names
>>>>> are
>>>>>>>>>>> used for directories, right? this sounds like fun :)
>>>>>>>>>>> 
>>>>>>>>>>> I vote for option #1, although if someone has a good idea for
>>> #3
>>>>> it
>>>>>>>>>>> will be even better.
>>>>>>>>>>> 
>>>>>>>>>>> Gwen
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> On Fri, Jul 10, 2015 at 1:22 PM, Grant Henke <
>>>>> ghenke@cloudera.com>
>>>>>>>>>> wrote:
>>>>>>>>>>>> Found it was added here:
>>>>>>>>>> https://issues.apache.org/jira/browse/KAFKA-697
>>>>>>>>>>>> 
>>>>>>>>>>>> On Fri, Jul 10, 2015 at 3:18 PM, Todd Palino <
>>>>> tpalino@gmail.com>
>>>>>>>>>> wrote:
>>>>>>>>>>>> 
>>>>>>>>>>>>> This was definitely changed at some point after KAFKA-495.
>>> The
>>>>>>>>>> question
>>>>>>>>>>> is
>>>>>>>>>>>>> when and why.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Here's the relevant code from that patch:
>>>>>>> ===================================================================
>>>>>>>>>>>>> --- core/src/main/scala/kafka/utils/Topic.scala (revision
>>>>>>> 1390178)
>>>>>>>>>>>>> +++ core/src/main/scala/kafka/utils/Topic.scala (working
>>> copy)
>>>>>>>>>>>>> @@ -21,24 +21,21 @@
>>>>>>>>>>>>> import util.matching.Regex
>>>>>>>>>>>>> 
>>>>>>>>>>>>> object Topic {
>>>>>>>>>>>>> +  val legalChars = "[a-zA-Z0-9_-]"
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> -Todd
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> On Fri, Jul 10, 2015 at 1:02 PM, Grant Henke <
>>>>>>> ghenke@cloudera.com>
>>>>>>>>>>> wrote:
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> kafka.common.Topic shows that currently period is a valid
>>>>>>>> character
>>>>>>>>>>> and I
>>>>>>>>>>>>>> have verified I can use kafka-topics.sh to create a new
>>>>> topic
>>>>>>>> with a
>>>>>>>>>>>>>> period.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> AdminUtils.createOrUpdateTopicPartitionAssignmentPathInZK
>>>>>>>> currently
>>>>>>>>>>> uses
>>>>>>>>>>>>>> Topic.validate before writing to Zookeeper.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Should period character support be removed? I was under
>>> the
>>>>>>> same
>>>>>>>>>>>>> impression
>>>>>>>>>>>>>> as Gwen, that a period was used by many as a way to
>>> "group"
>>>>>>>> topics.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> The code is pasted below since its small:
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> object Topic {
>>>>>>>>>>>>>>  val legalChars = "[a-zA-Z0-9\\._\\-]"
>>>>>>>>>>>>>>  private val maxNameLength = 255
>>>>>>>>>>>>>>  private val rgx = new Regex(legalChars + "+")
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>  val InternalTopics =
>>> Set(OffsetManager.OffsetsTopicName)
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>  def validate(topic: String) {
>>>>>>>>>>>>>>    if (topic.length <= 0)
>>>>>>>>>>>>>>      throw new InvalidTopicException("topic name is
>>>>> illegal,
>>>>>>>> can't
>>>>>>>>>> be
>>>>>>>>>>>>>> empty")
>>>>>>>>>>>>>>    else if (topic.equals(".") || topic.equals(".."))
>>>>>>>>>>>>>>      throw new InvalidTopicException("topic name cannot
>>> be
>>>>>>>> \".\" or
>>>>>>>>>>>>>> \"..\"")
>>>>>>>>>>>>>>    else if (topic.length > maxNameLength)
>>>>>>>>>>>>>>      throw new InvalidTopicException("topic name is
>>>>> illegal,
>>>>>>>> can't
>>>>>>>>>> be
>>>>>>>>>>>>>> longer than " + maxNameLength + " characters")
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>    rgx.findFirstIn(topic) match {
>>>>>>>>>>>>>>      case Some(t) =>
>>>>>>>>>>>>>>        if (!t.equals(topic))
>>>>>>>>>>>>>>          throw new InvalidTopicException("topic name " +
>>>>> topic
>>>>>>>> + "
>>>>>>>>>> is
>>>>>>>>>>>>>> illegal, contains a character other than ASCII
>>>>> alphanumerics,
>>>>>>>> '.',
>>>>>>>>>> '_'
>>>>>>>>>>>>> and
>>>>>>>>>>>>>> '-'")
>>>>>>>>>>>>>>      case None => throw new InvalidTopicException("topic
>>>>> name
>>>>>>> "
>>>>>>>> +
>>>>>>>>>>> topic
>>>>>>>>>>>>> +
>>>>>>>>>>>>>> " is illegal,  contains a character other than ASCII
>>>>>>>> alphanumerics,
>>>>>>>>>>> '.',
>>>>>>>>>>>>>> '_' and '-'")
>>>>>>>>>>>>>>    }
>>>>>>>>>>>>>>  }
>>>>>>>>>>>>>> }
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> On Fri, Jul 10, 2015 at 2:50 PM, Todd Palino <
>>>>>>> tpalino@gmail.com>
>>>>>>>>>>> wrote:
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> I had to go look this one up again to make sure -
>>>>>>>>>>>>>>> https://issues.apache.org/jira/browse/KAFKA-495
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> The only valid character names for topics are
>>>>> alphanumeric,
>>>>>>>>>>> underscore,
>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>> dash. A period is not supposed to be a valid character
>>> to
>>>>>>> use.
>>>>>>>> If
>>>>>>>>>>>>> you're
>>>>>>>>>>>>>>> seeing them, then one of two things have happened:
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 1) You have topic names that are grandfathered in from
>>>>> before
>>>>>>>> that
>>>>>>>>>>>>> patch
>>>>>>>>>>>>>>> 2) The patch is not working properly and there is
>>>>> somewhere
>>>>>>> in
>>>>>>>> the
>>>>>>>>>>>>> broker
>>>>>>>>>>>>>>> that the standard is not being enforced.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> -Todd
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> On Fri, Jul 10, 2015 at 12:13 PM, Brock Noland <
>>>>>>>> brock@apache.org>
>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> On Fri, Jul 10, 2015 at 11:34 AM, Gwen Shapira <
>>>>>>>>>>>>> gshapira@cloudera.com>
>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>> Hi Kafka Fans,
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> If you have one topic named "kafka_lab_2" and the
>>>>> other
>>>>>>>> named
>>>>>>>>>>>>>>>>> "kafka.lab.2", the topic level metrics will be
>>> named
>>>>>>>>>> kafka_lab_2
>>>>>>>>>>>>> for
>>>>>>>>>>>>>>>>> both, effectively making it impossible to monitor
>>> them
>>>>>>>>>> properly.
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> The reason this happens is that using "." in topic
>>>>> names
>>>>>>> is
>>>>>>>>>>> pretty
>>>>>>>>>>>>>>>>> common, especially as a way to group topics into
>>> data
>>>>>>>> centers,
>>>>>>>>>>>>>>>>> relevant apps, etc - basically a work-around to our
>>>>>>> current
>>>>>>>>>>> lack of
>>>>>>>>>>>>>>>>> name spaces. However, most metric monitoring
>>> systems
>>>>>>> using
>>>>>>>> "."
>>>>>>>>>>> to
>>>>>>>>>>>>>>>>> annotate hierarchy, so to avoid issues around
>>> metric
>>>>>>> names,
>>>>>>>>>>> Kafka
>>>>>>>>>>>>>>>>> replaces the "." in the name with an underscore.
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> This generates good metric names, but creates the
>>>>> problem
>>>>>>>> with
>>>>>>>>>>> name
>>>>>>>>>>>>>>>> collisions.
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> I'm wondering if it makes sense to simply limit the
>>>>> range
>>>>>>>> of
>>>>>>>>>>>>>>>>> characters permitted in a topic name and disallow
>>> "_"?
>>>>>>>>>> Obviously
>>>>>>>>>>>>>>>>> existing topics will need to remain as is, which
>>> is a
>>>>> bit
>>>>>>>>>>> awkward.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Interesting problem! Many if not most users I
>>>>> personally am
>>>>>>>>>> aware
>>>>>>>>>>> of
>>>>>>>>>>>>>>>> use "_" as a separator in topic names. I am sure that
>>>>> many
>>>>>>>> users
>>>>>>>>>>>>> would
>>>>>>>>>>>>>>>> be quite surprised by this limitation. With that
>>> said,
>>>>> I am
>>>>>>>> sure
>>>>>>>>>>>>>>>> they'd transition accordingly.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> If anyone has better backward-compatible solutions
>>> to
>>>>>>> this,
>>>>>>>>>> I'm
>>>>>>>>>>> all
>>>>>>>>>>>>>>> ears
>>>>>>>>>>>>>>>> :)
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> Gwen
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> --
>>>>>>>>>>>>>> Grant Henke
>>>>>>>>>>>>>> Solutions Consultant | Cloudera
>>>>>>>>>>>>>> ghenke@cloudera.com | twitter.com/gchenke |
>>>>>>>>>>> linkedin.com/in/granthenke
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> --
>>>>>>>>>>>> Grant Henke
>>>>>>>>>>>> Solutions Consultant | Cloudera
>>>>>>>>>>>> ghenke@cloudera.com | twitter.com/gchenke |
>>>>>>>> linkedin.com/in/granthenke
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> --
>>>>>>> Thanks,
>>>>>>> Neha
>>>> 
>>>> 
>>>> 
>>>> --
>>>> Thanks,
>>>> Ewen
>> 
>> 
>> 
>> --
>> Thanks,
>> Ewen

Re: [Discussion] Limitations on topic names

Posted by Brock Noland <br...@apache.org>.
On Sat, Jul 11, 2015 at 12:54 AM, Ewen Cheslack-Postava
<ew...@confluent.io> wrote:
> On Fri, Jul 10, 2015 at 4:41 PM, Gwen Shapira <gs...@cloudera.com> wrote:
>
>> Yeah, I have an actual customer who ran into this. Unfortunately,
>> inconsistencies in the way things are named are pretty common - just
>> look at Kafka's many CLI options.
>>
>> I don't think that supporting both and pointing at the docs with "I
>> told you so" when our metrics break is a good solution.
>>
>
> I agree, especially since we don't *already* have something in the docs
> indicating this will be an issue. I was flippant about the situation
> because I *wish* there was more careful consideration + naming policy in
> place, but I realize that doesn't always happen in practice. I guess I need
> to take Compatibility Czar more seriously :)
>
> I see think the obvious practical options are as follows:
>
> 1. Kill support for "_". Piss off the entire set of people who currently
> use "_" anywhere in topic names.
> 2. Kill support for ".". Piss off the entire set of people who currently
> use "." anywhere in topic names.
> 3. Tell people they need to be careful about this issue. Piss off the set
> of people who use both "_" and "." *and* happen to have conflicting topic
> names. They will have some pain when they discover the issue and have to
> figure out how to move one of those topics over to a non-conflicting name.
> I'm going to claim that this group must be an *extremely* small fraction of
> users, which doesn't make it better to allow things to break for them, but
> at least gives us an idea of the scale of impact.
>
> (One other alternative suggested earlier was encoding metric names to
> account for differences; given the metric renaming mess in the last
> release, I'm extremely hesitant to suggest anything of the sort...)
>
> None of the options are ideal, but to me, 3 seems like the least painful.
> Both for us, and for the vast majority of users. It seems to me that the
> number of users that would complain about (1) or (2) drastically outweigh
> (3).
>
> At this point, I don't think it's practical to keep switching the rules
> about which characters are allowed and which aren't because the previous
> attempts haven't been successful -- it seems the rules have changed
> multiple times, whether intentionally or accidentally, such that any more
> changes will cause problems. At this point, I think we just need to accept
> being liberal in accepting the range of topic names that have been
> permitted so far and make the best of the situation, even if it means only
> being able to warn people of conflicts.
>
> Here's another alternative: how about being liberal with topic name
> characters, but upon topic creation we convert the name to the metric name
> and fail if there's a conflict with another topic? This is relatively
> expensive (requires getting the metric name of all other topics), but it
> avoids the bad situation we're encountering here (conflicting metrics),
> avoids getting into a persistent conflict (we kill topic creation when we
> detect the issue rather than noticing it when the metrics conflict
> happens), and keeps the vast majority of existing users happy (both _ and .
> work in topic names as long as you don't create topics with conflicting
> metric names).
>
> There are definitely details to be worked out (auto topic creation?), but
> it seems like a more realistic solution than to start disallowing _ or . in
> topic names.

I was thinking the same. Allow a.b or a_b but not a.b and a_b. This
seems like it will impact a trivial amount of users and keep both the
"." and "_" camps happy.

>
> -Ewen
>
>
>>
>> On Fri, Jul 10, 2015 at 4:33 PM, Ewen Cheslack-Postava
>> <ew...@confluent.io> wrote:
>> > I figure you'll probably see complaints no matter what change you make.
>> > Gwen, given that you raised this, another important question might be how
>> > many people you see using *both*. I'm guessing this question came up
>> > because you actually saw a conflict? But I'd imagine (or at least hope)
>> > that most organizations are mostly consistent about naming topics -- they
>> > standardize on one or the other.
>> >
>> > Since there's no "right" way to name them, I'd just leave it supporting
>> > both and document the potential conflict in metrics. And if people use
>> both
>> > naming schemes, they probably deserve to suffer for their inconsistency
>> :)
>> >
>> > -Ewen
>> >
>> > On Fri, Jul 10, 2015 at 3:28 PM, Gwen Shapira <gs...@cloudera.com>
>> wrote:
>> >
>> >> I find dots more common in my customer base, so I will definitely feel
>> >> the pain of removing them.
>> >>
>> >> However, "." are already used in metrics, file names, directories, etc
>> >> - so if we keep the dots, we need to keep code that translates them
>> >> and document the translation. Just banning "." seems more natural.
>> >> Also, as Grant mentioned, we'll probably have our own special usage
>> >> for "." down the line.
>> >>
>> >> On Fri, Jul 10, 2015 at 2:12 PM, Todd Palino <tp...@gmail.com> wrote:
>> >> > I absolutely disagree with #2, Neha. That will break a lot of
>> >> > infrastructure within LinkedIn. That said, removing "." might break
>> other
>> >> > people as well, but I think we should have a clearer idea of how much
>> >> usage
>> >> > there is on either side.
>> >> >
>> >> > -Todd
>> >> >
>> >> >
>> >> > On Fri, Jul 10, 2015 at 2:08 PM, Neha Narkhede <ne...@confluent.io>
>> >> wrote:
>> >> >
>> >> >> "." seems natural for grouping topic names. +1 for 2) going forward
>> only
>> >> >> without breaking previously created topics with "_" though that might
>> >> >> require us to patch the code somewhat awkwardly till we phase it out
>> a
>> >> >> couple (purposely left vague to stay out of Ewen's wrath :-))
>> versions
>> >> >> later.
>> >> >>
>> >> >> On Fri, Jul 10, 2015 at 2:02 PM, Gwen Shapira <gshapira@cloudera.com
>> >
>> >> >> wrote:
>> >> >>
>> >> >> > I don't think we should break existing topics. Just disallow new
>> >> >> > topics going forward.
>> >> >> >
>> >> >> > Agree that having both is horrible, but we should have a solution
>> that
>> >> >> > fails when you run "kafka_topics.sh --create", not when you
>> configure
>> >> >> > Ganglia.
>> >> >> >
>> >> >> > Gwen
>> >> >> >
>> >> >> > On Fri, Jul 10, 2015 at 1:53 PM, Jay Kreps <ja...@confluent.io>
>> wrote:
>> >> >> > > Unfortunately '.' is pretty common too. I agree that it is
>> perverse,
>> >> >> but
>> >> >> > > people seem to do it. Breaking all the topics with '.' in the
>> name
>> >> >> seems
>> >> >> > > like it could be worse than combining metrics for people who
>> have a
>> >> >> > > 'foo_bar' AND 'foo.bar' (and after all, having both is DEEPLY
>> >> perverse,
>> >> >> > > no?).
>> >> >> > >
>> >> >> > > Where is our Dean of Compatibility, Ewen, on this?
>> >> >> > >
>> >> >> > > -Jay
>> >> >> > >
>> >> >> > > On Fri, Jul 10, 2015 at 1:32 PM, Todd Palino <tp...@gmail.com>
>> >> >> wrote:
>> >> >> > >
>> >> >> > >> My selfish point of view is that we do #1, as we use "_"
>> >> extensively
>> >> >> in
>> >> >> > >> topic names here :) I also happen to think it's the right
>> choice,
>> >> >> > >> specifically because "." has more special meanings, as you
>> noted.
>> >> >> > >>
>> >> >> > >> -Todd
>> >> >> > >>
>> >> >> > >>
>> >> >> > >> On Fri, Jul 10, 2015 at 1:30 PM, Gwen Shapira <
>> >> gshapira@cloudera.com>
>> >> >> > >> wrote:
>> >> >> > >>
>> >> >> > >> > Unintentional side effect from allowing IP addresses in
>> consumer
>> >> >> > client
>> >> >> > >> > IDs :)
>> >> >> > >> >
>> >> >> > >> > So the question is, what do we do now?
>> >> >> > >> >
>> >> >> > >> > 1) disallow "."
>> >> >> > >> > 2) disallow "_"
>> >> >> > >> > 3) find a reversible way to encode "." and "_" that won't
>> break
>> >> >> > existing
>> >> >> > >> > metrics
>> >> >> > >> > 4) all of the above?
>> >> >> > >> >
>> >> >> > >> > btw. it looks like "." and ".." are currently valid. Topic
>> names
>> >> are
>> >> >> > >> > used for directories, right? this sounds like fun :)
>> >> >> > >> >
>> >> >> > >> > I vote for option #1, although if someone has a good idea for
>> #3
>> >> it
>> >> >> > >> > will be even better.
>> >> >> > >> >
>> >> >> > >> > Gwen
>> >> >> > >> >
>> >> >> > >> >
>> >> >> > >> >
>> >> >> > >> > On Fri, Jul 10, 2015 at 1:22 PM, Grant Henke <
>> >> ghenke@cloudera.com>
>> >> >> > >> wrote:
>> >> >> > >> > > Found it was added here:
>> >> >> > >> https://issues.apache.org/jira/browse/KAFKA-697
>> >> >> > >> > >
>> >> >> > >> > > On Fri, Jul 10, 2015 at 3:18 PM, Todd Palino <
>> >> tpalino@gmail.com>
>> >> >> > >> wrote:
>> >> >> > >> > >
>> >> >> > >> > >> This was definitely changed at some point after KAFKA-495.
>> The
>> >> >> > >> question
>> >> >> > >> > is
>> >> >> > >> > >> when and why.
>> >> >> > >> > >>
>> >> >> > >> > >> Here's the relevant code from that patch:
>> >> >> > >> > >>
>> >> >> > >> > >>
>> >> >> ===================================================================
>> >> >> > >> > >> --- core/src/main/scala/kafka/utils/Topic.scala (revision
>> >> >> 1390178)
>> >> >> > >> > >> +++ core/src/main/scala/kafka/utils/Topic.scala (working
>> copy)
>> >> >> > >> > >> @@ -21,24 +21,21 @@
>> >> >> > >> > >>  import util.matching.Regex
>> >> >> > >> > >>
>> >> >> > >> > >>  object Topic {
>> >> >> > >> > >> +  val legalChars = "[a-zA-Z0-9_-]"
>> >> >> > >> > >>
>> >> >> > >> > >>
>> >> >> > >> > >>
>> >> >> > >> > >> -Todd
>> >> >> > >> > >>
>> >> >> > >> > >>
>> >> >> > >> > >> On Fri, Jul 10, 2015 at 1:02 PM, Grant Henke <
>> >> >> ghenke@cloudera.com>
>> >> >> > >> > wrote:
>> >> >> > >> > >>
>> >> >> > >> > >> > kafka.common.Topic shows that currently period is a valid
>> >> >> > character
>> >> >> > >> > and I
>> >> >> > >> > >> > have verified I can use kafka-topics.sh to create a new
>> >> topic
>> >> >> > with a
>> >> >> > >> > >> > period.
>> >> >> > >> > >> >
>> >> >> > >> > >> >
>> >> >> > >> > >> > AdminUtils.createOrUpdateTopicPartitionAssignmentPathInZK
>> >> >> > currently
>> >> >> > >> > uses
>> >> >> > >> > >> > Topic.validate before writing to Zookeeper.
>> >> >> > >> > >> >
>> >> >> > >> > >> > Should period character support be removed? I was under
>> the
>> >> >> same
>> >> >> > >> > >> impression
>> >> >> > >> > >> > as Gwen, that a period was used by many as a way to
>> "group"
>> >> >> > topics.
>> >> >> > >> > >> >
>> >> >> > >> > >> > The code is pasted below since its small:
>> >> >> > >> > >> >
>> >> >> > >> > >> > object Topic {
>> >> >> > >> > >> >   val legalChars = "[a-zA-Z0-9\\._\\-]"
>> >> >> > >> > >> >   private val maxNameLength = 255
>> >> >> > >> > >> >   private val rgx = new Regex(legalChars + "+")
>> >> >> > >> > >> >
>> >> >> > >> > >> >   val InternalTopics =
>> Set(OffsetManager.OffsetsTopicName)
>> >> >> > >> > >> >
>> >> >> > >> > >> >   def validate(topic: String) {
>> >> >> > >> > >> >     if (topic.length <= 0)
>> >> >> > >> > >> >       throw new InvalidTopicException("topic name is
>> >> illegal,
>> >> >> > can't
>> >> >> > >> be
>> >> >> > >> > >> > empty")
>> >> >> > >> > >> >     else if (topic.equals(".") || topic.equals(".."))
>> >> >> > >> > >> >       throw new InvalidTopicException("topic name cannot
>> be
>> >> >> > \".\" or
>> >> >> > >> > >> > \"..\"")
>> >> >> > >> > >> >     else if (topic.length > maxNameLength)
>> >> >> > >> > >> >       throw new InvalidTopicException("topic name is
>> >> illegal,
>> >> >> > can't
>> >> >> > >> be
>> >> >> > >> > >> > longer than " + maxNameLength + " characters")
>> >> >> > >> > >> >
>> >> >> > >> > >> >     rgx.findFirstIn(topic) match {
>> >> >> > >> > >> >       case Some(t) =>
>> >> >> > >> > >> >         if (!t.equals(topic))
>> >> >> > >> > >> >           throw new InvalidTopicException("topic name " +
>> >> topic
>> >> >> > + "
>> >> >> > >> is
>> >> >> > >> > >> > illegal, contains a character other than ASCII
>> >> alphanumerics,
>> >> >> > '.',
>> >> >> > >> '_'
>> >> >> > >> > >> and
>> >> >> > >> > >> > '-'")
>> >> >> > >> > >> >       case None => throw new InvalidTopicException("topic
>> >> name
>> >> >> "
>> >> >> > +
>> >> >> > >> > topic
>> >> >> > >> > >> +
>> >> >> > >> > >> > " is illegal,  contains a character other than ASCII
>> >> >> > alphanumerics,
>> >> >> > >> > '.',
>> >> >> > >> > >> > '_' and '-'")
>> >> >> > >> > >> >     }
>> >> >> > >> > >> >   }
>> >> >> > >> > >> > }
>> >> >> > >> > >> >
>> >> >> > >> > >> > On Fri, Jul 10, 2015 at 2:50 PM, Todd Palino <
>> >> >> tpalino@gmail.com>
>> >> >> > >> > wrote:
>> >> >> > >> > >> >
>> >> >> > >> > >> > > I had to go look this one up again to make sure -
>> >> >> > >> > >> > > https://issues.apache.org/jira/browse/KAFKA-495
>> >> >> > >> > >> > >
>> >> >> > >> > >> > > The only valid character names for topics are
>> >> alphanumeric,
>> >> >> > >> > underscore,
>> >> >> > >> > >> > and
>> >> >> > >> > >> > > dash. A period is not supposed to be a valid character
>> to
>> >> >> use.
>> >> >> > If
>> >> >> > >> > >> you're
>> >> >> > >> > >> > > seeing them, then one of two things have happened:
>> >> >> > >> > >> > >
>> >> >> > >> > >> > > 1) You have topic names that are grandfathered in from
>> >> before
>> >> >> > that
>> >> >> > >> > >> patch
>> >> >> > >> > >> > > 2) The patch is not working properly and there is
>> >> somewhere
>> >> >> in
>> >> >> > the
>> >> >> > >> > >> broker
>> >> >> > >> > >> > > that the standard is not being enforced.
>> >> >> > >> > >> > >
>> >> >> > >> > >> > > -Todd
>> >> >> > >> > >> > >
>> >> >> > >> > >> > >
>> >> >> > >> > >> > > On Fri, Jul 10, 2015 at 12:13 PM, Brock Noland <
>> >> >> > brock@apache.org>
>> >> >> > >> > >> wrote:
>> >> >> > >> > >> > >
>> >> >> > >> > >> > > > On Fri, Jul 10, 2015 at 11:34 AM, Gwen Shapira <
>> >> >> > >> > >> gshapira@cloudera.com>
>> >> >> > >> > >> > > > wrote:
>> >> >> > >> > >> > > > > Hi Kafka Fans,
>> >> >> > >> > >> > > > >
>> >> >> > >> > >> > > > > If you have one topic named "kafka_lab_2" and the
>> >> other
>> >> >> > named
>> >> >> > >> > >> > > > > "kafka.lab.2", the topic level metrics will be
>> named
>> >> >> > >> kafka_lab_2
>> >> >> > >> > >> for
>> >> >> > >> > >> > > > > both, effectively making it impossible to monitor
>> them
>> >> >> > >> properly.
>> >> >> > >> > >> > > > >
>> >> >> > >> > >> > > > > The reason this happens is that using "." in topic
>> >> names
>> >> >> is
>> >> >> > >> > pretty
>> >> >> > >> > >> > > > > common, especially as a way to group topics into
>> data
>> >> >> > centers,
>> >> >> > >> > >> > > > > relevant apps, etc - basically a work-around to our
>> >> >> current
>> >> >> > >> > lack of
>> >> >> > >> > >> > > > > name spaces. However, most metric monitoring
>> systems
>> >> >> using
>> >> >> > "."
>> >> >> > >> > to
>> >> >> > >> > >> > > > > annotate hierarchy, so to avoid issues around
>> metric
>> >> >> names,
>> >> >> > >> > Kafka
>> >> >> > >> > >> > > > > replaces the "." in the name with an underscore.
>> >> >> > >> > >> > > > >
>> >> >> > >> > >> > > > > This generates good metric names, but creates the
>> >> problem
>> >> >> > with
>> >> >> > >> > name
>> >> >> > >> > >> > > > collisions.
>> >> >> > >> > >> > > > >
>> >> >> > >> > >> > > > > I'm wondering if it makes sense to simply limit the
>> >> range
>> >> >> > of
>> >> >> > >> > >> > > > > characters permitted in a topic name and disallow
>> "_"?
>> >> >> > >> Obviously
>> >> >> > >> > >> > > > > existing topics will need to remain as is, which
>> is a
>> >> bit
>> >> >> > >> > awkward.
>> >> >> > >> > >> > > >
>> >> >> > >> > >> > > > Interesting problem! Many if not most users I
>> >> personally am
>> >> >> > >> aware
>> >> >> > >> > of
>> >> >> > >> > >> > > > use "_" as a separator in topic names. I am sure that
>> >> many
>> >> >> > users
>> >> >> > >> > >> would
>> >> >> > >> > >> > > > be quite surprised by this limitation. With that
>> said,
>> >> I am
>> >> >> > sure
>> >> >> > >> > >> > > > they'd transition accordingly.
>> >> >> > >> > >> > > >
>> >> >> > >> > >> > > > >
>> >> >> > >> > >> > > > > If anyone has better backward-compatible solutions
>> to
>> >> >> this,
>> >> >> > >> I'm
>> >> >> > >> > all
>> >> >> > >> > >> > > ears
>> >> >> > >> > >> > > > :)
>> >> >> > >> > >> > > > >
>> >> >> > >> > >> > > > > Gwen
>> >> >> > >> > >> > > >
>> >> >> > >> > >> > >
>> >> >> > >> > >> >
>> >> >> > >> > >> >
>> >> >> > >> > >> >
>> >> >> > >> > >> > --
>> >> >> > >> > >> > Grant Henke
>> >> >> > >> > >> > Solutions Consultant | Cloudera
>> >> >> > >> > >> > ghenke@cloudera.com | twitter.com/gchenke |
>> >> >> > >> > linkedin.com/in/granthenke
>> >> >> > >> > >> >
>> >> >> > >> > >>
>> >> >> > >> > >
>> >> >> > >> > >
>> >> >> > >> > >
>> >> >> > >> > > --
>> >> >> > >> > > Grant Henke
>> >> >> > >> > > Solutions Consultant | Cloudera
>> >> >> > >> > > ghenke@cloudera.com | twitter.com/gchenke |
>> >> >> > linkedin.com/in/granthenke
>> >> >> > >> >
>> >> >> > >>
>> >> >> >
>> >> >>
>> >> >>
>> >> >>
>> >> >> --
>> >> >> Thanks,
>> >> >> Neha
>> >> >>
>> >>
>> >
>> >
>> >
>> > --
>> > Thanks,
>> > Ewen
>>
>
>
>
> --
> Thanks,
> Ewen

Re: [Discussion] Limitations on topic names

Posted by Jun Rao <ju...@confluent.io>.
Magnus,

Converting dot to _ essentially is our way of escaping in the scope part of
the metric name. The issue is that your options of escaping is limited due
to the constraints in the reporters. For example, the Ganglia reporter
replaces anything other than alpha-numeric, -, _ and dot to _ in the metric
name. Not sure how well Graphite deals with \ either. For details, take a
look at the discussion in KAFKA-1902. Note that the replacement of dots
only affects the reporters. Dots are preserved in the mbean names.

Thanks,

Jun

On Sun, Jul 12, 2015 at 10:58 PM, Magnus Edenhill <ma...@edenhill.se>
wrote:

> Hi,
>
> since dots seem to be a problem on the metrics side, why not let the
> metrics side handle it
> by escaping troublesome characters? E.g. "foo.my\.topic.feh"
> Let's not push the problem upstream.
>
> Replacing "." with another set of allowed characters "__" seems like a bad
> idea since it
> is ambigious: "__consumer_offsets" == ".consumer_offsets"?
>
> I'm guessing the same problem arises if broker names are part of the
> metrics name,
> e.g., "broker.192.168.0.2.rxbytes", do we want to push the exclusion of
> dots in IP addresses
> upstream as well? :)
>
> Magnus
>
>
> 2015-07-13 2:06 GMT+02:00 Jun Rao <ju...@confluent.io>:
>
> > First, a couple of clarifications on this.
> >
> > 1. Currently, we allow Kafka topic to have dots, except that we disallow
> > topic names that are exactly "." or ".." (which can cause weird problems
> > when mapping to file directories and ZK paths as Gwen pointed out).
> >
> > 2. When creating the Coda Hale metrics, currently, we only replace dot
> with
> > _ in the scope of the metric name. The actually jmx bean name still
> > preserves dot. This is because the Graphite reporter uses scope when
> > forming the metric names and assumes dots are component separators (see
> > KAFKA-1902 for details). So, if one uses tools like jmxtrans to export
> the
> > metrics from the mbeans directly, the original topic name is preserved.
> > However, I am not sure how well this maps to Graphite. We thought about
> > making the replacing character configurable. However, the difficulty is
> > that the logic of doing the replacement is in a singleton
> > class KafkaMetricsGroup and I am not sure if we can pass in an external
> > config.
> >
> > Given the above, I'd suggest that customer try the jmxtrans to Graphite
> > path and see if that helps. I agree that it's too disruptive to restrict
> > the current topic naming convention.
> >
> > Also, since we plan to replace Coda Hale metrics with Kafka metrics in
> the
> > future, we can try to address this issue better then.
> >
> > Thanks,
> >
> > Jun
> >
> >
> >
> >
> > On Sun, Jul 12, 2015 at 10:26 AM, Gwen Shapira <gs...@cloudera.com>
> > wrote:
> >
> > > I like the "lets warn people of conflicts when creating the topic"
> > > suggestion. IMO, automatic topic creation as currently done is buggy
> > > either way (Send data and hope the topic is ready before retries run
> > > out, potentially failing with the super helpful NO_LEADER error), so I
> > > don't mind leaving it broken a bit more. I think the right behavior is
> > > that conflicts will cause auto creating to fail, the same way we
> > > currently do when the default number of replicas is higher than number
> > > of brokers.
> > >
> > > One thing that is left confusing is that people in the "." camp need
> > > to know about the conversion or they will fail to find their topics in
> > > their monitoring tools. Not very nice to them, but I can't think of
> > > alternatives.
> > >
> > > I'll start with the doc patch :)
> > >
> > > On Sat, Jul 11, 2015 at 12:54 AM, Ewen Cheslack-Postava
> > > <ew...@confluent.io> wrote:
> > > > On Fri, Jul 10, 2015 at 4:41 PM, Gwen Shapira <gshapira@cloudera.com
> >
> > > wrote:
> > > >
> > > >> Yeah, I have an actual customer who ran into this. Unfortunately,
> > > >> inconsistencies in the way things are named are pretty common - just
> > > >> look at Kafka's many CLI options.
> > > >>
> > > >> I don't think that supporting both and pointing at the docs with "I
> > > >> told you so" when our metrics break is a good solution.
> > > >>
> > > >
> > > > I agree, especially since we don't *already* have something in the
> docs
> > > > indicating this will be an issue. I was flippant about the situation
> > > > because I *wish* there was more careful consideration + naming policy
> > in
> > > > place, but I realize that doesn't always happen in practice. I guess
> I
> > > need
> > > > to take Compatibility Czar more seriously :)
> > > >
> > > > I see think the obvious practical options are as follows:
> > > >
> > > > 1. Kill support for "_". Piss off the entire set of people who
> > currently
> > > > use "_" anywhere in topic names.
> > > > 2. Kill support for ".". Piss off the entire set of people who
> > currently
> > > > use "." anywhere in topic names.
> > > > 3. Tell people they need to be careful about this issue. Piss off the
> > set
> > > > of people who use both "_" and "." *and* happen to have conflicting
> > topic
> > > > names. They will have some pain when they discover the issue and have
> > to
> > > > figure out how to move one of those topics over to a non-conflicting
> > > name.
> > > > I'm going to claim that this group must be an *extremely* small
> > fraction
> > > of
> > > > users, which doesn't make it better to allow things to break for
> them,
> > > but
> > > > at least gives us an idea of the scale of impact.
> > > >
> > > > (One other alternative suggested earlier was encoding metric names to
> > > > account for differences; given the metric renaming mess in the last
> > > > release, I'm extremely hesitant to suggest anything of the sort...)
> > > >
> > > > None of the options are ideal, but to me, 3 seems like the least
> > painful.
> > > > Both for us, and for the vast majority of users. It seems to me that
> > the
> > > > number of users that would complain about (1) or (2) drastically
> > outweigh
> > > > (3).
> > > >
> > > > At this point, I don't think it's practical to keep switching the
> rules
> > > > about which characters are allowed and which aren't because the
> > previous
> > > > attempts haven't been successful -- it seems the rules have changed
> > > > multiple times, whether intentionally or accidentally, such that any
> > more
> > > > changes will cause problems. At this point, I think we just need to
> > > accept
> > > > being liberal in accepting the range of topic names that have been
> > > > permitted so far and make the best of the situation, even if it means
> > > only
> > > > being able to warn people of conflicts.
> > > >
> > > > Here's another alternative: how about being liberal with topic name
> > > > characters, but upon topic creation we convert the name to the metric
> > > name
> > > > and fail if there's a conflict with another topic? This is relatively
> > > > expensive (requires getting the metric name of all other topics), but
> > it
> > > > avoids the bad situation we're encountering here (conflicting
> metrics),
> > > > avoids getting into a persistent conflict (we kill topic creation
> when
> > we
> > > > detect the issue rather than noticing it when the metrics conflict
> > > > happens), and keeps the vast majority of existing users happy (both _
> > > and .
> > > > work in topic names as long as you don't create topics with
> conflicting
> > > > metric names).
> > > >
> > > > There are definitely details to be worked out (auto topic creation?),
> > but
> > > > it seems like a more realistic solution than to start disallowing _
> or
> > .
> > > in
> > > > topic names.
> > > >
> > > > -Ewen
> > > >
> > > >
> > > >>
> > > >> On Fri, Jul 10, 2015 at 4:33 PM, Ewen Cheslack-Postava
> > > >> <ew...@confluent.io> wrote:
> > > >> > I figure you'll probably see complaints no matter what change you
> > > make.
> > > >> > Gwen, given that you raised this, another important question might
> > be
> > > how
> > > >> > many people you see using *both*. I'm guessing this question came
> up
> > > >> > because you actually saw a conflict? But I'd imagine (or at least
> > > hope)
> > > >> > that most organizations are mostly consistent about naming topics
> --
> > > they
> > > >> > standardize on one or the other.
> > > >> >
> > > >> > Since there's no "right" way to name them, I'd just leave it
> > > supporting
> > > >> > both and document the potential conflict in metrics. And if people
> > use
> > > >> both
> > > >> > naming schemes, they probably deserve to suffer for their
> > > inconsistency
> > > >> :)
> > > >> >
> > > >> > -Ewen
> > > >> >
> > > >> > On Fri, Jul 10, 2015 at 3:28 PM, Gwen Shapira <
> > gshapira@cloudera.com>
> > > >> wrote:
> > > >> >
> > > >> >> I find dots more common in my customer base, so I will definitely
> > > feel
> > > >> >> the pain of removing them.
> > > >> >>
> > > >> >> However, "." are already used in metrics, file names,
> directories,
> > > etc
> > > >> >> - so if we keep the dots, we need to keep code that translates
> them
> > > >> >> and document the translation. Just banning "." seems more
> natural.
> > > >> >> Also, as Grant mentioned, we'll probably have our own special
> usage
> > > >> >> for "." down the line.
> > > >> >>
> > > >> >> On Fri, Jul 10, 2015 at 2:12 PM, Todd Palino <tp...@gmail.com>
> > > wrote:
> > > >> >> > I absolutely disagree with #2, Neha. That will break a lot of
> > > >> >> > infrastructure within LinkedIn. That said, removing "." might
> > break
> > > >> other
> > > >> >> > people as well, but I think we should have a clearer idea of
> how
> > > much
> > > >> >> usage
> > > >> >> > there is on either side.
> > > >> >> >
> > > >> >> > -Todd
> > > >> >> >
> > > >> >> >
> > > >> >> > On Fri, Jul 10, 2015 at 2:08 PM, Neha Narkhede <
> > neha@confluent.io>
> > > >> >> wrote:
> > > >> >> >
> > > >> >> >> "." seems natural for grouping topic names. +1 for 2) going
> > > forward
> > > >> only
> > > >> >> >> without breaking previously created topics with "_" though
> that
> > > might
> > > >> >> >> require us to patch the code somewhat awkwardly till we phase
> it
> > > out
> > > >> a
> > > >> >> >> couple (purposely left vague to stay out of Ewen's wrath :-))
> > > >> versions
> > > >> >> >> later.
> > > >> >> >>
> > > >> >> >> On Fri, Jul 10, 2015 at 2:02 PM, Gwen Shapira <
> > > gshapira@cloudera.com
> > > >> >
> > > >> >> >> wrote:
> > > >> >> >>
> > > >> >> >> > I don't think we should break existing topics. Just disallow
> > new
> > > >> >> >> > topics going forward.
> > > >> >> >> >
> > > >> >> >> > Agree that having both is horrible, but we should have a
> > > solution
> > > >> that
> > > >> >> >> > fails when you run "kafka_topics.sh --create", not when you
> > > >> configure
> > > >> >> >> > Ganglia.
> > > >> >> >> >
> > > >> >> >> > Gwen
> > > >> >> >> >
> > > >> >> >> > On Fri, Jul 10, 2015 at 1:53 PM, Jay Kreps <
> jay@confluent.io>
> > > >> wrote:
> > > >> >> >> > > Unfortunately '.' is pretty common too. I agree that it is
> > > >> perverse,
> > > >> >> >> but
> > > >> >> >> > > people seem to do it. Breaking all the topics with '.' in
> > the
> > > >> name
> > > >> >> >> seems
> > > >> >> >> > > like it could be worse than combining metrics for people
> who
> > > >> have a
> > > >> >> >> > > 'foo_bar' AND 'foo.bar' (and after all, having both is
> > DEEPLY
> > > >> >> perverse,
> > > >> >> >> > > no?).
> > > >> >> >> > >
> > > >> >> >> > > Where is our Dean of Compatibility, Ewen, on this?
> > > >> >> >> > >
> > > >> >> >> > > -Jay
> > > >> >> >> > >
> > > >> >> >> > > On Fri, Jul 10, 2015 at 1:32 PM, Todd Palino <
> > > tpalino@gmail.com>
> > > >> >> >> wrote:
> > > >> >> >> > >
> > > >> >> >> > >> My selfish point of view is that we do #1, as we use "_"
> > > >> >> extensively
> > > >> >> >> in
> > > >> >> >> > >> topic names here :) I also happen to think it's the right
> > > >> choice,
> > > >> >> >> > >> specifically because "." has more special meanings, as
> you
> > > >> noted.
> > > >> >> >> > >>
> > > >> >> >> > >> -Todd
> > > >> >> >> > >>
> > > >> >> >> > >>
> > > >> >> >> > >> On Fri, Jul 10, 2015 at 1:30 PM, Gwen Shapira <
> > > >> >> gshapira@cloudera.com>
> > > >> >> >> > >> wrote:
> > > >> >> >> > >>
> > > >> >> >> > >> > Unintentional side effect from allowing IP addresses in
> > > >> consumer
> > > >> >> >> > client
> > > >> >> >> > >> > IDs :)
> > > >> >> >> > >> >
> > > >> >> >> > >> > So the question is, what do we do now?
> > > >> >> >> > >> >
> > > >> >> >> > >> > 1) disallow "."
> > > >> >> >> > >> > 2) disallow "_"
> > > >> >> >> > >> > 3) find a reversible way to encode "." and "_" that
> won't
> > > >> break
> > > >> >> >> > existing
> > > >> >> >> > >> > metrics
> > > >> >> >> > >> > 4) all of the above?
> > > >> >> >> > >> >
> > > >> >> >> > >> > btw. it looks like "." and ".." are currently valid.
> > Topic
> > > >> names
> > > >> >> are
> > > >> >> >> > >> > used for directories, right? this sounds like fun :)
> > > >> >> >> > >> >
> > > >> >> >> > >> > I vote for option #1, although if someone has a good
> idea
> > > for
> > > >> #3
> > > >> >> it
> > > >> >> >> > >> > will be even better.
> > > >> >> >> > >> >
> > > >> >> >> > >> > Gwen
> > > >> >> >> > >> >
> > > >> >> >> > >> >
> > > >> >> >> > >> >
> > > >> >> >> > >> > On Fri, Jul 10, 2015 at 1:22 PM, Grant Henke <
> > > >> >> ghenke@cloudera.com>
> > > >> >> >> > >> wrote:
> > > >> >> >> > >> > > Found it was added here:
> > > >> >> >> > >> https://issues.apache.org/jira/browse/KAFKA-697
> > > >> >> >> > >> > >
> > > >> >> >> > >> > > On Fri, Jul 10, 2015 at 3:18 PM, Todd Palino <
> > > >> >> tpalino@gmail.com>
> > > >> >> >> > >> wrote:
> > > >> >> >> > >> > >
> > > >> >> >> > >> > >> This was definitely changed at some point after
> > > KAFKA-495.
> > > >> The
> > > >> >> >> > >> question
> > > >> >> >> > >> > is
> > > >> >> >> > >> > >> when and why.
> > > >> >> >> > >> > >>
> > > >> >> >> > >> > >> Here's the relevant code from that patch:
> > > >> >> >> > >> > >>
> > > >> >> >> > >> > >>
> > > >> >> >>
> > > ===================================================================
> > > >> >> >> > >> > >> --- core/src/main/scala/kafka/utils/Topic.scala
> > > (revision
> > > >> >> >> 1390178)
> > > >> >> >> > >> > >> +++ core/src/main/scala/kafka/utils/Topic.scala
> > (working
> > > >> copy)
> > > >> >> >> > >> > >> @@ -21,24 +21,21 @@
> > > >> >> >> > >> > >>  import util.matching.Regex
> > > >> >> >> > >> > >>
> > > >> >> >> > >> > >>  object Topic {
> > > >> >> >> > >> > >> +  val legalChars = "[a-zA-Z0-9_-]"
> > > >> >> >> > >> > >>
> > > >> >> >> > >> > >>
> > > >> >> >> > >> > >>
> > > >> >> >> > >> > >> -Todd
> > > >> >> >> > >> > >>
> > > >> >> >> > >> > >>
> > > >> >> >> > >> > >> On Fri, Jul 10, 2015 at 1:02 PM, Grant Henke <
> > > >> >> >> ghenke@cloudera.com>
> > > >> >> >> > >> > wrote:
> > > >> >> >> > >> > >>
> > > >> >> >> > >> > >> > kafka.common.Topic shows that currently period is
> a
> > > valid
> > > >> >> >> > character
> > > >> >> >> > >> > and I
> > > >> >> >> > >> > >> > have verified I can use kafka-topics.sh to create
> a
> > > new
> > > >> >> topic
> > > >> >> >> > with a
> > > >> >> >> > >> > >> > period.
> > > >> >> >> > >> > >> >
> > > >> >> >> > >> > >> >
> > > >> >> >> > >> > >> >
> > > AdminUtils.createOrUpdateTopicPartitionAssignmentPathInZK
> > > >> >> >> > currently
> > > >> >> >> > >> > uses
> > > >> >> >> > >> > >> > Topic.validate before writing to Zookeeper.
> > > >> >> >> > >> > >> >
> > > >> >> >> > >> > >> > Should period character support be removed? I was
> > > under
> > > >> the
> > > >> >> >> same
> > > >> >> >> > >> > >> impression
> > > >> >> >> > >> > >> > as Gwen, that a period was used by many as a way
> to
> > > >> "group"
> > > >> >> >> > topics.
> > > >> >> >> > >> > >> >
> > > >> >> >> > >> > >> > The code is pasted below since its small:
> > > >> >> >> > >> > >> >
> > > >> >> >> > >> > >> > object Topic {
> > > >> >> >> > >> > >> >   val legalChars = "[a-zA-Z0-9\\._\\-]"
> > > >> >> >> > >> > >> >   private val maxNameLength = 255
> > > >> >> >> > >> > >> >   private val rgx = new Regex(legalChars + "+")
> > > >> >> >> > >> > >> >
> > > >> >> >> > >> > >> >   val InternalTopics =
> > > >> Set(OffsetManager.OffsetsTopicName)
> > > >> >> >> > >> > >> >
> > > >> >> >> > >> > >> >   def validate(topic: String) {
> > > >> >> >> > >> > >> >     if (topic.length <= 0)
> > > >> >> >> > >> > >> >       throw new InvalidTopicException("topic name
> is
> > > >> >> illegal,
> > > >> >> >> > can't
> > > >> >> >> > >> be
> > > >> >> >> > >> > >> > empty")
> > > >> >> >> > >> > >> >     else if (topic.equals(".") ||
> > topic.equals(".."))
> > > >> >> >> > >> > >> >       throw new InvalidTopicException("topic name
> > > cannot
> > > >> be
> > > >> >> >> > \".\" or
> > > >> >> >> > >> > >> > \"..\"")
> > > >> >> >> > >> > >> >     else if (topic.length > maxNameLength)
> > > >> >> >> > >> > >> >       throw new InvalidTopicException("topic name
> is
> > > >> >> illegal,
> > > >> >> >> > can't
> > > >> >> >> > >> be
> > > >> >> >> > >> > >> > longer than " + maxNameLength + " characters")
> > > >> >> >> > >> > >> >
> > > >> >> >> > >> > >> >     rgx.findFirstIn(topic) match {
> > > >> >> >> > >> > >> >       case Some(t) =>
> > > >> >> >> > >> > >> >         if (!t.equals(topic))
> > > >> >> >> > >> > >> >           throw new InvalidTopicException("topic
> > name
> > > " +
> > > >> >> topic
> > > >> >> >> > + "
> > > >> >> >> > >> is
> > > >> >> >> > >> > >> > illegal, contains a character other than ASCII
> > > >> >> alphanumerics,
> > > >> >> >> > '.',
> > > >> >> >> > >> '_'
> > > >> >> >> > >> > >> and
> > > >> >> >> > >> > >> > '-'")
> > > >> >> >> > >> > >> >       case None => throw new
> > > InvalidTopicException("topic
> > > >> >> name
> > > >> >> >> "
> > > >> >> >> > +
> > > >> >> >> > >> > topic
> > > >> >> >> > >> > >> +
> > > >> >> >> > >> > >> > " is illegal,  contains a character other than
> ASCII
> > > >> >> >> > alphanumerics,
> > > >> >> >> > >> > '.',
> > > >> >> >> > >> > >> > '_' and '-'")
> > > >> >> >> > >> > >> >     }
> > > >> >> >> > >> > >> >   }
> > > >> >> >> > >> > >> > }
> > > >> >> >> > >> > >> >
> > > >> >> >> > >> > >> > On Fri, Jul 10, 2015 at 2:50 PM, Todd Palino <
> > > >> >> >> tpalino@gmail.com>
> > > >> >> >> > >> > wrote:
> > > >> >> >> > >> > >> >
> > > >> >> >> > >> > >> > > I had to go look this one up again to make sure
> -
> > > >> >> >> > >> > >> > > https://issues.apache.org/jira/browse/KAFKA-495
> > > >> >> >> > >> > >> > >
> > > >> >> >> > >> > >> > > The only valid character names for topics are
> > > >> >> alphanumeric,
> > > >> >> >> > >> > underscore,
> > > >> >> >> > >> > >> > and
> > > >> >> >> > >> > >> > > dash. A period is not supposed to be a valid
> > > character
> > > >> to
> > > >> >> >> use.
> > > >> >> >> > If
> > > >> >> >> > >> > >> you're
> > > >> >> >> > >> > >> > > seeing them, then one of two things have
> happened:
> > > >> >> >> > >> > >> > >
> > > >> >> >> > >> > >> > > 1) You have topic names that are grandfathered
> in
> > > from
> > > >> >> before
> > > >> >> >> > that
> > > >> >> >> > >> > >> patch
> > > >> >> >> > >> > >> > > 2) The patch is not working properly and there
> is
> > > >> >> somewhere
> > > >> >> >> in
> > > >> >> >> > the
> > > >> >> >> > >> > >> broker
> > > >> >> >> > >> > >> > > that the standard is not being enforced.
> > > >> >> >> > >> > >> > >
> > > >> >> >> > >> > >> > > -Todd
> > > >> >> >> > >> > >> > >
> > > >> >> >> > >> > >> > >
> > > >> >> >> > >> > >> > > On Fri, Jul 10, 2015 at 12:13 PM, Brock Noland <
> > > >> >> >> > brock@apache.org>
> > > >> >> >> > >> > >> wrote:
> > > >> >> >> > >> > >> > >
> > > >> >> >> > >> > >> > > > On Fri, Jul 10, 2015 at 11:34 AM, Gwen
> Shapira <
> > > >> >> >> > >> > >> gshapira@cloudera.com>
> > > >> >> >> > >> > >> > > > wrote:
> > > >> >> >> > >> > >> > > > > Hi Kafka Fans,
> > > >> >> >> > >> > >> > > > >
> > > >> >> >> > >> > >> > > > > If you have one topic named "kafka_lab_2"
> and
> > > the
> > > >> >> other
> > > >> >> >> > named
> > > >> >> >> > >> > >> > > > > "kafka.lab.2", the topic level metrics will
> be
> > > >> named
> > > >> >> >> > >> kafka_lab_2
> > > >> >> >> > >> > >> for
> > > >> >> >> > >> > >> > > > > both, effectively making it impossible to
> > > monitor
> > > >> them
> > > >> >> >> > >> properly.
> > > >> >> >> > >> > >> > > > >
> > > >> >> >> > >> > >> > > > > The reason this happens is that using "." in
> > > topic
> > > >> >> names
> > > >> >> >> is
> > > >> >> >> > >> > pretty
> > > >> >> >> > >> > >> > > > > common, especially as a way to group topics
> > into
> > > >> data
> > > >> >> >> > centers,
> > > >> >> >> > >> > >> > > > > relevant apps, etc - basically a work-around
> > to
> > > our
> > > >> >> >> current
> > > >> >> >> > >> > lack of
> > > >> >> >> > >> > >> > > > > name spaces. However, most metric monitoring
> > > >> systems
> > > >> >> >> using
> > > >> >> >> > "."
> > > >> >> >> > >> > to
> > > >> >> >> > >> > >> > > > > annotate hierarchy, so to avoid issues
> around
> > > >> metric
> > > >> >> >> names,
> > > >> >> >> > >> > Kafka
> > > >> >> >> > >> > >> > > > > replaces the "." in the name with an
> > underscore.
> > > >> >> >> > >> > >> > > > >
> > > >> >> >> > >> > >> > > > > This generates good metric names, but
> creates
> > > the
> > > >> >> problem
> > > >> >> >> > with
> > > >> >> >> > >> > name
> > > >> >> >> > >> > >> > > > collisions.
> > > >> >> >> > >> > >> > > > >
> > > >> >> >> > >> > >> > > > > I'm wondering if it makes sense to simply
> > limit
> > > the
> > > >> >> range
> > > >> >> >> > of
> > > >> >> >> > >> > >> > > > > characters permitted in a topic name and
> > > disallow
> > > >> "_"?
> > > >> >> >> > >> Obviously
> > > >> >> >> > >> > >> > > > > existing topics will need to remain as is,
> > which
> > > >> is a
> > > >> >> bit
> > > >> >> >> > >> > awkward.
> > > >> >> >> > >> > >> > > >
> > > >> >> >> > >> > >> > > > Interesting problem! Many if not most users I
> > > >> >> personally am
> > > >> >> >> > >> aware
> > > >> >> >> > >> > of
> > > >> >> >> > >> > >> > > > use "_" as a separator in topic names. I am
> sure
> > > that
> > > >> >> many
> > > >> >> >> > users
> > > >> >> >> > >> > >> would
> > > >> >> >> > >> > >> > > > be quite surprised by this limitation. With
> that
> > > >> said,
> > > >> >> I am
> > > >> >> >> > sure
> > > >> >> >> > >> > >> > > > they'd transition accordingly.
> > > >> >> >> > >> > >> > > >
> > > >> >> >> > >> > >> > > > >
> > > >> >> >> > >> > >> > > > > If anyone has better backward-compatible
> > > solutions
> > > >> to
> > > >> >> >> this,
> > > >> >> >> > >> I'm
> > > >> >> >> > >> > all
> > > >> >> >> > >> > >> > > ears
> > > >> >> >> > >> > >> > > > :)
> > > >> >> >> > >> > >> > > > >
> > > >> >> >> > >> > >> > > > > Gwen
> > > >> >> >> > >> > >> > > >
> > > >> >> >> > >> > >> > >
> > > >> >> >> > >> > >> >
> > > >> >> >> > >> > >> >
> > > >> >> >> > >> > >> >
> > > >> >> >> > >> > >> > --
> > > >> >> >> > >> > >> > Grant Henke
> > > >> >> >> > >> > >> > Solutions Consultant | Cloudera
> > > >> >> >> > >> > >> > ghenke@cloudera.com | twitter.com/gchenke |
> > > >> >> >> > >> > linkedin.com/in/granthenke
> > > >> >> >> > >> > >> >
> > > >> >> >> > >> > >>
> > > >> >> >> > >> > >
> > > >> >> >> > >> > >
> > > >> >> >> > >> > >
> > > >> >> >> > >> > > --
> > > >> >> >> > >> > > Grant Henke
> > > >> >> >> > >> > > Solutions Consultant | Cloudera
> > > >> >> >> > >> > > ghenke@cloudera.com | twitter.com/gchenke |
> > > >> >> >> > linkedin.com/in/granthenke
> > > >> >> >> > >> >
> > > >> >> >> > >>
> > > >> >> >> >
> > > >> >> >>
> > > >> >> >>
> > > >> >> >>
> > > >> >> >> --
> > > >> >> >> Thanks,
> > > >> >> >> Neha
> > > >> >> >>
> > > >> >>
> > > >> >
> > > >> >
> > > >> >
> > > >> > --
> > > >> > Thanks,
> > > >> > Ewen
> > > >>
> > > >
> > > >
> > > >
> > > > --
> > > > Thanks,
> > > > Ewen
> > >
> >
>

Re: [Discussion] Limitations on topic names

Posted by Magnus Edenhill <ma...@edenhill.se>.
Hi,

since dots seem to be a problem on the metrics side, why not let the
metrics side handle it
by escaping troublesome characters? E.g. "foo.my\.topic.feh"
Let's not push the problem upstream.

Replacing "." with another set of allowed characters "__" seems like a bad
idea since it
is ambigious: "__consumer_offsets" == ".consumer_offsets"?

I'm guessing the same problem arises if broker names are part of the
metrics name,
e.g., "broker.192.168.0.2.rxbytes", do we want to push the exclusion of
dots in IP addresses
upstream as well? :)

Magnus


2015-07-13 2:06 GMT+02:00 Jun Rao <ju...@confluent.io>:

> First, a couple of clarifications on this.
>
> 1. Currently, we allow Kafka topic to have dots, except that we disallow
> topic names that are exactly "." or ".." (which can cause weird problems
> when mapping to file directories and ZK paths as Gwen pointed out).
>
> 2. When creating the Coda Hale metrics, currently, we only replace dot with
> _ in the scope of the metric name. The actually jmx bean name still
> preserves dot. This is because the Graphite reporter uses scope when
> forming the metric names and assumes dots are component separators (see
> KAFKA-1902 for details). So, if one uses tools like jmxtrans to export the
> metrics from the mbeans directly, the original topic name is preserved.
> However, I am not sure how well this maps to Graphite. We thought about
> making the replacing character configurable. However, the difficulty is
> that the logic of doing the replacement is in a singleton
> class KafkaMetricsGroup and I am not sure if we can pass in an external
> config.
>
> Given the above, I'd suggest that customer try the jmxtrans to Graphite
> path and see if that helps. I agree that it's too disruptive to restrict
> the current topic naming convention.
>
> Also, since we plan to replace Coda Hale metrics with Kafka metrics in the
> future, we can try to address this issue better then.
>
> Thanks,
>
> Jun
>
>
>
>
> On Sun, Jul 12, 2015 at 10:26 AM, Gwen Shapira <gs...@cloudera.com>
> wrote:
>
> > I like the "lets warn people of conflicts when creating the topic"
> > suggestion. IMO, automatic topic creation as currently done is buggy
> > either way (Send data and hope the topic is ready before retries run
> > out, potentially failing with the super helpful NO_LEADER error), so I
> > don't mind leaving it broken a bit more. I think the right behavior is
> > that conflicts will cause auto creating to fail, the same way we
> > currently do when the default number of replicas is higher than number
> > of brokers.
> >
> > One thing that is left confusing is that people in the "." camp need
> > to know about the conversion or they will fail to find their topics in
> > their monitoring tools. Not very nice to them, but I can't think of
> > alternatives.
> >
> > I'll start with the doc patch :)
> >
> > On Sat, Jul 11, 2015 at 12:54 AM, Ewen Cheslack-Postava
> > <ew...@confluent.io> wrote:
> > > On Fri, Jul 10, 2015 at 4:41 PM, Gwen Shapira <gs...@cloudera.com>
> > wrote:
> > >
> > >> Yeah, I have an actual customer who ran into this. Unfortunately,
> > >> inconsistencies in the way things are named are pretty common - just
> > >> look at Kafka's many CLI options.
> > >>
> > >> I don't think that supporting both and pointing at the docs with "I
> > >> told you so" when our metrics break is a good solution.
> > >>
> > >
> > > I agree, especially since we don't *already* have something in the docs
> > > indicating this will be an issue. I was flippant about the situation
> > > because I *wish* there was more careful consideration + naming policy
> in
> > > place, but I realize that doesn't always happen in practice. I guess I
> > need
> > > to take Compatibility Czar more seriously :)
> > >
> > > I see think the obvious practical options are as follows:
> > >
> > > 1. Kill support for "_". Piss off the entire set of people who
> currently
> > > use "_" anywhere in topic names.
> > > 2. Kill support for ".". Piss off the entire set of people who
> currently
> > > use "." anywhere in topic names.
> > > 3. Tell people they need to be careful about this issue. Piss off the
> set
> > > of people who use both "_" and "." *and* happen to have conflicting
> topic
> > > names. They will have some pain when they discover the issue and have
> to
> > > figure out how to move one of those topics over to a non-conflicting
> > name.
> > > I'm going to claim that this group must be an *extremely* small
> fraction
> > of
> > > users, which doesn't make it better to allow things to break for them,
> > but
> > > at least gives us an idea of the scale of impact.
> > >
> > > (One other alternative suggested earlier was encoding metric names to
> > > account for differences; given the metric renaming mess in the last
> > > release, I'm extremely hesitant to suggest anything of the sort...)
> > >
> > > None of the options are ideal, but to me, 3 seems like the least
> painful.
> > > Both for us, and for the vast majority of users. It seems to me that
> the
> > > number of users that would complain about (1) or (2) drastically
> outweigh
> > > (3).
> > >
> > > At this point, I don't think it's practical to keep switching the rules
> > > about which characters are allowed and which aren't because the
> previous
> > > attempts haven't been successful -- it seems the rules have changed
> > > multiple times, whether intentionally or accidentally, such that any
> more
> > > changes will cause problems. At this point, I think we just need to
> > accept
> > > being liberal in accepting the range of topic names that have been
> > > permitted so far and make the best of the situation, even if it means
> > only
> > > being able to warn people of conflicts.
> > >
> > > Here's another alternative: how about being liberal with topic name
> > > characters, but upon topic creation we convert the name to the metric
> > name
> > > and fail if there's a conflict with another topic? This is relatively
> > > expensive (requires getting the metric name of all other topics), but
> it
> > > avoids the bad situation we're encountering here (conflicting metrics),
> > > avoids getting into a persistent conflict (we kill topic creation when
> we
> > > detect the issue rather than noticing it when the metrics conflict
> > > happens), and keeps the vast majority of existing users happy (both _
> > and .
> > > work in topic names as long as you don't create topics with conflicting
> > > metric names).
> > >
> > > There are definitely details to be worked out (auto topic creation?),
> but
> > > it seems like a more realistic solution than to start disallowing _ or
> .
> > in
> > > topic names.
> > >
> > > -Ewen
> > >
> > >
> > >>
> > >> On Fri, Jul 10, 2015 at 4:33 PM, Ewen Cheslack-Postava
> > >> <ew...@confluent.io> wrote:
> > >> > I figure you'll probably see complaints no matter what change you
> > make.
> > >> > Gwen, given that you raised this, another important question might
> be
> > how
> > >> > many people you see using *both*. I'm guessing this question came up
> > >> > because you actually saw a conflict? But I'd imagine (or at least
> > hope)
> > >> > that most organizations are mostly consistent about naming topics --
> > they
> > >> > standardize on one or the other.
> > >> >
> > >> > Since there's no "right" way to name them, I'd just leave it
> > supporting
> > >> > both and document the potential conflict in metrics. And if people
> use
> > >> both
> > >> > naming schemes, they probably deserve to suffer for their
> > inconsistency
> > >> :)
> > >> >
> > >> > -Ewen
> > >> >
> > >> > On Fri, Jul 10, 2015 at 3:28 PM, Gwen Shapira <
> gshapira@cloudera.com>
> > >> wrote:
> > >> >
> > >> >> I find dots more common in my customer base, so I will definitely
> > feel
> > >> >> the pain of removing them.
> > >> >>
> > >> >> However, "." are already used in metrics, file names, directories,
> > etc
> > >> >> - so if we keep the dots, we need to keep code that translates them
> > >> >> and document the translation. Just banning "." seems more natural.
> > >> >> Also, as Grant mentioned, we'll probably have our own special usage
> > >> >> for "." down the line.
> > >> >>
> > >> >> On Fri, Jul 10, 2015 at 2:12 PM, Todd Palino <tp...@gmail.com>
> > wrote:
> > >> >> > I absolutely disagree with #2, Neha. That will break a lot of
> > >> >> > infrastructure within LinkedIn. That said, removing "." might
> break
> > >> other
> > >> >> > people as well, but I think we should have a clearer idea of how
> > much
> > >> >> usage
> > >> >> > there is on either side.
> > >> >> >
> > >> >> > -Todd
> > >> >> >
> > >> >> >
> > >> >> > On Fri, Jul 10, 2015 at 2:08 PM, Neha Narkhede <
> neha@confluent.io>
> > >> >> wrote:
> > >> >> >
> > >> >> >> "." seems natural for grouping topic names. +1 for 2) going
> > forward
> > >> only
> > >> >> >> without breaking previously created topics with "_" though that
> > might
> > >> >> >> require us to patch the code somewhat awkwardly till we phase it
> > out
> > >> a
> > >> >> >> couple (purposely left vague to stay out of Ewen's wrath :-))
> > >> versions
> > >> >> >> later.
> > >> >> >>
> > >> >> >> On Fri, Jul 10, 2015 at 2:02 PM, Gwen Shapira <
> > gshapira@cloudera.com
> > >> >
> > >> >> >> wrote:
> > >> >> >>
> > >> >> >> > I don't think we should break existing topics. Just disallow
> new
> > >> >> >> > topics going forward.
> > >> >> >> >
> > >> >> >> > Agree that having both is horrible, but we should have a
> > solution
> > >> that
> > >> >> >> > fails when you run "kafka_topics.sh --create", not when you
> > >> configure
> > >> >> >> > Ganglia.
> > >> >> >> >
> > >> >> >> > Gwen
> > >> >> >> >
> > >> >> >> > On Fri, Jul 10, 2015 at 1:53 PM, Jay Kreps <ja...@confluent.io>
> > >> wrote:
> > >> >> >> > > Unfortunately '.' is pretty common too. I agree that it is
> > >> perverse,
> > >> >> >> but
> > >> >> >> > > people seem to do it. Breaking all the topics with '.' in
> the
> > >> name
> > >> >> >> seems
> > >> >> >> > > like it could be worse than combining metrics for people who
> > >> have a
> > >> >> >> > > 'foo_bar' AND 'foo.bar' (and after all, having both is
> DEEPLY
> > >> >> perverse,
> > >> >> >> > > no?).
> > >> >> >> > >
> > >> >> >> > > Where is our Dean of Compatibility, Ewen, on this?
> > >> >> >> > >
> > >> >> >> > > -Jay
> > >> >> >> > >
> > >> >> >> > > On Fri, Jul 10, 2015 at 1:32 PM, Todd Palino <
> > tpalino@gmail.com>
> > >> >> >> wrote:
> > >> >> >> > >
> > >> >> >> > >> My selfish point of view is that we do #1, as we use "_"
> > >> >> extensively
> > >> >> >> in
> > >> >> >> > >> topic names here :) I also happen to think it's the right
> > >> choice,
> > >> >> >> > >> specifically because "." has more special meanings, as you
> > >> noted.
> > >> >> >> > >>
> > >> >> >> > >> -Todd
> > >> >> >> > >>
> > >> >> >> > >>
> > >> >> >> > >> On Fri, Jul 10, 2015 at 1:30 PM, Gwen Shapira <
> > >> >> gshapira@cloudera.com>
> > >> >> >> > >> wrote:
> > >> >> >> > >>
> > >> >> >> > >> > Unintentional side effect from allowing IP addresses in
> > >> consumer
> > >> >> >> > client
> > >> >> >> > >> > IDs :)
> > >> >> >> > >> >
> > >> >> >> > >> > So the question is, what do we do now?
> > >> >> >> > >> >
> > >> >> >> > >> > 1) disallow "."
> > >> >> >> > >> > 2) disallow "_"
> > >> >> >> > >> > 3) find a reversible way to encode "." and "_" that won't
> > >> break
> > >> >> >> > existing
> > >> >> >> > >> > metrics
> > >> >> >> > >> > 4) all of the above?
> > >> >> >> > >> >
> > >> >> >> > >> > btw. it looks like "." and ".." are currently valid.
> Topic
> > >> names
> > >> >> are
> > >> >> >> > >> > used for directories, right? this sounds like fun :)
> > >> >> >> > >> >
> > >> >> >> > >> > I vote for option #1, although if someone has a good idea
> > for
> > >> #3
> > >> >> it
> > >> >> >> > >> > will be even better.
> > >> >> >> > >> >
> > >> >> >> > >> > Gwen
> > >> >> >> > >> >
> > >> >> >> > >> >
> > >> >> >> > >> >
> > >> >> >> > >> > On Fri, Jul 10, 2015 at 1:22 PM, Grant Henke <
> > >> >> ghenke@cloudera.com>
> > >> >> >> > >> wrote:
> > >> >> >> > >> > > Found it was added here:
> > >> >> >> > >> https://issues.apache.org/jira/browse/KAFKA-697
> > >> >> >> > >> > >
> > >> >> >> > >> > > On Fri, Jul 10, 2015 at 3:18 PM, Todd Palino <
> > >> >> tpalino@gmail.com>
> > >> >> >> > >> wrote:
> > >> >> >> > >> > >
> > >> >> >> > >> > >> This was definitely changed at some point after
> > KAFKA-495.
> > >> The
> > >> >> >> > >> question
> > >> >> >> > >> > is
> > >> >> >> > >> > >> when and why.
> > >> >> >> > >> > >>
> > >> >> >> > >> > >> Here's the relevant code from that patch:
> > >> >> >> > >> > >>
> > >> >> >> > >> > >>
> > >> >> >>
> > ===================================================================
> > >> >> >> > >> > >> --- core/src/main/scala/kafka/utils/Topic.scala
> > (revision
> > >> >> >> 1390178)
> > >> >> >> > >> > >> +++ core/src/main/scala/kafka/utils/Topic.scala
> (working
> > >> copy)
> > >> >> >> > >> > >> @@ -21,24 +21,21 @@
> > >> >> >> > >> > >>  import util.matching.Regex
> > >> >> >> > >> > >>
> > >> >> >> > >> > >>  object Topic {
> > >> >> >> > >> > >> +  val legalChars = "[a-zA-Z0-9_-]"
> > >> >> >> > >> > >>
> > >> >> >> > >> > >>
> > >> >> >> > >> > >>
> > >> >> >> > >> > >> -Todd
> > >> >> >> > >> > >>
> > >> >> >> > >> > >>
> > >> >> >> > >> > >> On Fri, Jul 10, 2015 at 1:02 PM, Grant Henke <
> > >> >> >> ghenke@cloudera.com>
> > >> >> >> > >> > wrote:
> > >> >> >> > >> > >>
> > >> >> >> > >> > >> > kafka.common.Topic shows that currently period is a
> > valid
> > >> >> >> > character
> > >> >> >> > >> > and I
> > >> >> >> > >> > >> > have verified I can use kafka-topics.sh to create a
> > new
> > >> >> topic
> > >> >> >> > with a
> > >> >> >> > >> > >> > period.
> > >> >> >> > >> > >> >
> > >> >> >> > >> > >> >
> > >> >> >> > >> > >> >
> > AdminUtils.createOrUpdateTopicPartitionAssignmentPathInZK
> > >> >> >> > currently
> > >> >> >> > >> > uses
> > >> >> >> > >> > >> > Topic.validate before writing to Zookeeper.
> > >> >> >> > >> > >> >
> > >> >> >> > >> > >> > Should period character support be removed? I was
> > under
> > >> the
> > >> >> >> same
> > >> >> >> > >> > >> impression
> > >> >> >> > >> > >> > as Gwen, that a period was used by many as a way to
> > >> "group"
> > >> >> >> > topics.
> > >> >> >> > >> > >> >
> > >> >> >> > >> > >> > The code is pasted below since its small:
> > >> >> >> > >> > >> >
> > >> >> >> > >> > >> > object Topic {
> > >> >> >> > >> > >> >   val legalChars = "[a-zA-Z0-9\\._\\-]"
> > >> >> >> > >> > >> >   private val maxNameLength = 255
> > >> >> >> > >> > >> >   private val rgx = new Regex(legalChars + "+")
> > >> >> >> > >> > >> >
> > >> >> >> > >> > >> >   val InternalTopics =
> > >> Set(OffsetManager.OffsetsTopicName)
> > >> >> >> > >> > >> >
> > >> >> >> > >> > >> >   def validate(topic: String) {
> > >> >> >> > >> > >> >     if (topic.length <= 0)
> > >> >> >> > >> > >> >       throw new InvalidTopicException("topic name is
> > >> >> illegal,
> > >> >> >> > can't
> > >> >> >> > >> be
> > >> >> >> > >> > >> > empty")
> > >> >> >> > >> > >> >     else if (topic.equals(".") ||
> topic.equals(".."))
> > >> >> >> > >> > >> >       throw new InvalidTopicException("topic name
> > cannot
> > >> be
> > >> >> >> > \".\" or
> > >> >> >> > >> > >> > \"..\"")
> > >> >> >> > >> > >> >     else if (topic.length > maxNameLength)
> > >> >> >> > >> > >> >       throw new InvalidTopicException("topic name is
> > >> >> illegal,
> > >> >> >> > can't
> > >> >> >> > >> be
> > >> >> >> > >> > >> > longer than " + maxNameLength + " characters")
> > >> >> >> > >> > >> >
> > >> >> >> > >> > >> >     rgx.findFirstIn(topic) match {
> > >> >> >> > >> > >> >       case Some(t) =>
> > >> >> >> > >> > >> >         if (!t.equals(topic))
> > >> >> >> > >> > >> >           throw new InvalidTopicException("topic
> name
> > " +
> > >> >> topic
> > >> >> >> > + "
> > >> >> >> > >> is
> > >> >> >> > >> > >> > illegal, contains a character other than ASCII
> > >> >> alphanumerics,
> > >> >> >> > '.',
> > >> >> >> > >> '_'
> > >> >> >> > >> > >> and
> > >> >> >> > >> > >> > '-'")
> > >> >> >> > >> > >> >       case None => throw new
> > InvalidTopicException("topic
> > >> >> name
> > >> >> >> "
> > >> >> >> > +
> > >> >> >> > >> > topic
> > >> >> >> > >> > >> +
> > >> >> >> > >> > >> > " is illegal,  contains a character other than ASCII
> > >> >> >> > alphanumerics,
> > >> >> >> > >> > '.',
> > >> >> >> > >> > >> > '_' and '-'")
> > >> >> >> > >> > >> >     }
> > >> >> >> > >> > >> >   }
> > >> >> >> > >> > >> > }
> > >> >> >> > >> > >> >
> > >> >> >> > >> > >> > On Fri, Jul 10, 2015 at 2:50 PM, Todd Palino <
> > >> >> >> tpalino@gmail.com>
> > >> >> >> > >> > wrote:
> > >> >> >> > >> > >> >
> > >> >> >> > >> > >> > > I had to go look this one up again to make sure -
> > >> >> >> > >> > >> > > https://issues.apache.org/jira/browse/KAFKA-495
> > >> >> >> > >> > >> > >
> > >> >> >> > >> > >> > > The only valid character names for topics are
> > >> >> alphanumeric,
> > >> >> >> > >> > underscore,
> > >> >> >> > >> > >> > and
> > >> >> >> > >> > >> > > dash. A period is not supposed to be a valid
> > character
> > >> to
> > >> >> >> use.
> > >> >> >> > If
> > >> >> >> > >> > >> you're
> > >> >> >> > >> > >> > > seeing them, then one of two things have happened:
> > >> >> >> > >> > >> > >
> > >> >> >> > >> > >> > > 1) You have topic names that are grandfathered in
> > from
> > >> >> before
> > >> >> >> > that
> > >> >> >> > >> > >> patch
> > >> >> >> > >> > >> > > 2) The patch is not working properly and there is
> > >> >> somewhere
> > >> >> >> in
> > >> >> >> > the
> > >> >> >> > >> > >> broker
> > >> >> >> > >> > >> > > that the standard is not being enforced.
> > >> >> >> > >> > >> > >
> > >> >> >> > >> > >> > > -Todd
> > >> >> >> > >> > >> > >
> > >> >> >> > >> > >> > >
> > >> >> >> > >> > >> > > On Fri, Jul 10, 2015 at 12:13 PM, Brock Noland <
> > >> >> >> > brock@apache.org>
> > >> >> >> > >> > >> wrote:
> > >> >> >> > >> > >> > >
> > >> >> >> > >> > >> > > > On Fri, Jul 10, 2015 at 11:34 AM, Gwen Shapira <
> > >> >> >> > >> > >> gshapira@cloudera.com>
> > >> >> >> > >> > >> > > > wrote:
> > >> >> >> > >> > >> > > > > Hi Kafka Fans,
> > >> >> >> > >> > >> > > > >
> > >> >> >> > >> > >> > > > > If you have one topic named "kafka_lab_2" and
> > the
> > >> >> other
> > >> >> >> > named
> > >> >> >> > >> > >> > > > > "kafka.lab.2", the topic level metrics will be
> > >> named
> > >> >> >> > >> kafka_lab_2
> > >> >> >> > >> > >> for
> > >> >> >> > >> > >> > > > > both, effectively making it impossible to
> > monitor
> > >> them
> > >> >> >> > >> properly.
> > >> >> >> > >> > >> > > > >
> > >> >> >> > >> > >> > > > > The reason this happens is that using "." in
> > topic
> > >> >> names
> > >> >> >> is
> > >> >> >> > >> > pretty
> > >> >> >> > >> > >> > > > > common, especially as a way to group topics
> into
> > >> data
> > >> >> >> > centers,
> > >> >> >> > >> > >> > > > > relevant apps, etc - basically a work-around
> to
> > our
> > >> >> >> current
> > >> >> >> > >> > lack of
> > >> >> >> > >> > >> > > > > name spaces. However, most metric monitoring
> > >> systems
> > >> >> >> using
> > >> >> >> > "."
> > >> >> >> > >> > to
> > >> >> >> > >> > >> > > > > annotate hierarchy, so to avoid issues around
> > >> metric
> > >> >> >> names,
> > >> >> >> > >> > Kafka
> > >> >> >> > >> > >> > > > > replaces the "." in the name with an
> underscore.
> > >> >> >> > >> > >> > > > >
> > >> >> >> > >> > >> > > > > This generates good metric names, but creates
> > the
> > >> >> problem
> > >> >> >> > with
> > >> >> >> > >> > name
> > >> >> >> > >> > >> > > > collisions.
> > >> >> >> > >> > >> > > > >
> > >> >> >> > >> > >> > > > > I'm wondering if it makes sense to simply
> limit
> > the
> > >> >> range
> > >> >> >> > of
> > >> >> >> > >> > >> > > > > characters permitted in a topic name and
> > disallow
> > >> "_"?
> > >> >> >> > >> Obviously
> > >> >> >> > >> > >> > > > > existing topics will need to remain as is,
> which
> > >> is a
> > >> >> bit
> > >> >> >> > >> > awkward.
> > >> >> >> > >> > >> > > >
> > >> >> >> > >> > >> > > > Interesting problem! Many if not most users I
> > >> >> personally am
> > >> >> >> > >> aware
> > >> >> >> > >> > of
> > >> >> >> > >> > >> > > > use "_" as a separator in topic names. I am sure
> > that
> > >> >> many
> > >> >> >> > users
> > >> >> >> > >> > >> would
> > >> >> >> > >> > >> > > > be quite surprised by this limitation. With that
> > >> said,
> > >> >> I am
> > >> >> >> > sure
> > >> >> >> > >> > >> > > > they'd transition accordingly.
> > >> >> >> > >> > >> > > >
> > >> >> >> > >> > >> > > > >
> > >> >> >> > >> > >> > > > > If anyone has better backward-compatible
> > solutions
> > >> to
> > >> >> >> this,
> > >> >> >> > >> I'm
> > >> >> >> > >> > all
> > >> >> >> > >> > >> > > ears
> > >> >> >> > >> > >> > > > :)
> > >> >> >> > >> > >> > > > >
> > >> >> >> > >> > >> > > > > Gwen
> > >> >> >> > >> > >> > > >
> > >> >> >> > >> > >> > >
> > >> >> >> > >> > >> >
> > >> >> >> > >> > >> >
> > >> >> >> > >> > >> >
> > >> >> >> > >> > >> > --
> > >> >> >> > >> > >> > Grant Henke
> > >> >> >> > >> > >> > Solutions Consultant | Cloudera
> > >> >> >> > >> > >> > ghenke@cloudera.com | twitter.com/gchenke |
> > >> >> >> > >> > linkedin.com/in/granthenke
> > >> >> >> > >> > >> >
> > >> >> >> > >> > >>
> > >> >> >> > >> > >
> > >> >> >> > >> > >
> > >> >> >> > >> > >
> > >> >> >> > >> > > --
> > >> >> >> > >> > > Grant Henke
> > >> >> >> > >> > > Solutions Consultant | Cloudera
> > >> >> >> > >> > > ghenke@cloudera.com | twitter.com/gchenke |
> > >> >> >> > linkedin.com/in/granthenke
> > >> >> >> > >> >
> > >> >> >> > >>
> > >> >> >> >
> > >> >> >>
> > >> >> >>
> > >> >> >>
> > >> >> >> --
> > >> >> >> Thanks,
> > >> >> >> Neha
> > >> >> >>
> > >> >>
> > >> >
> > >> >
> > >> >
> > >> > --
> > >> > Thanks,
> > >> > Ewen
> > >>
> > >
> > >
> > >
> > > --
> > > Thanks,
> > > Ewen
> >
>

Re: [Discussion] Limitations on topic names

Posted by Jun Rao <ju...@confluent.io>.
First, a couple of clarifications on this.

1. Currently, we allow Kafka topic to have dots, except that we disallow
topic names that are exactly "." or ".." (which can cause weird problems
when mapping to file directories and ZK paths as Gwen pointed out).

2. When creating the Coda Hale metrics, currently, we only replace dot with
_ in the scope of the metric name. The actually jmx bean name still
preserves dot. This is because the Graphite reporter uses scope when
forming the metric names and assumes dots are component separators (see
KAFKA-1902 for details). So, if one uses tools like jmxtrans to export the
metrics from the mbeans directly, the original topic name is preserved.
However, I am not sure how well this maps to Graphite. We thought about
making the replacing character configurable. However, the difficulty is
that the logic of doing the replacement is in a singleton
class KafkaMetricsGroup and I am not sure if we can pass in an external
config.

Given the above, I'd suggest that customer try the jmxtrans to Graphite
path and see if that helps. I agree that it's too disruptive to restrict
the current topic naming convention.

Also, since we plan to replace Coda Hale metrics with Kafka metrics in the
future, we can try to address this issue better then.

Thanks,

Jun




On Sun, Jul 12, 2015 at 10:26 AM, Gwen Shapira <gs...@cloudera.com>
wrote:

> I like the "lets warn people of conflicts when creating the topic"
> suggestion. IMO, automatic topic creation as currently done is buggy
> either way (Send data and hope the topic is ready before retries run
> out, potentially failing with the super helpful NO_LEADER error), so I
> don't mind leaving it broken a bit more. I think the right behavior is
> that conflicts will cause auto creating to fail, the same way we
> currently do when the default number of replicas is higher than number
> of brokers.
>
> One thing that is left confusing is that people in the "." camp need
> to know about the conversion or they will fail to find their topics in
> their monitoring tools. Not very nice to them, but I can't think of
> alternatives.
>
> I'll start with the doc patch :)
>
> On Sat, Jul 11, 2015 at 12:54 AM, Ewen Cheslack-Postava
> <ew...@confluent.io> wrote:
> > On Fri, Jul 10, 2015 at 4:41 PM, Gwen Shapira <gs...@cloudera.com>
> wrote:
> >
> >> Yeah, I have an actual customer who ran into this. Unfortunately,
> >> inconsistencies in the way things are named are pretty common - just
> >> look at Kafka's many CLI options.
> >>
> >> I don't think that supporting both and pointing at the docs with "I
> >> told you so" when our metrics break is a good solution.
> >>
> >
> > I agree, especially since we don't *already* have something in the docs
> > indicating this will be an issue. I was flippant about the situation
> > because I *wish* there was more careful consideration + naming policy in
> > place, but I realize that doesn't always happen in practice. I guess I
> need
> > to take Compatibility Czar more seriously :)
> >
> > I see think the obvious practical options are as follows:
> >
> > 1. Kill support for "_". Piss off the entire set of people who currently
> > use "_" anywhere in topic names.
> > 2. Kill support for ".". Piss off the entire set of people who currently
> > use "." anywhere in topic names.
> > 3. Tell people they need to be careful about this issue. Piss off the set
> > of people who use both "_" and "." *and* happen to have conflicting topic
> > names. They will have some pain when they discover the issue and have to
> > figure out how to move one of those topics over to a non-conflicting
> name.
> > I'm going to claim that this group must be an *extremely* small fraction
> of
> > users, which doesn't make it better to allow things to break for them,
> but
> > at least gives us an idea of the scale of impact.
> >
> > (One other alternative suggested earlier was encoding metric names to
> > account for differences; given the metric renaming mess in the last
> > release, I'm extremely hesitant to suggest anything of the sort...)
> >
> > None of the options are ideal, but to me, 3 seems like the least painful.
> > Both for us, and for the vast majority of users. It seems to me that the
> > number of users that would complain about (1) or (2) drastically outweigh
> > (3).
> >
> > At this point, I don't think it's practical to keep switching the rules
> > about which characters are allowed and which aren't because the previous
> > attempts haven't been successful -- it seems the rules have changed
> > multiple times, whether intentionally or accidentally, such that any more
> > changes will cause problems. At this point, I think we just need to
> accept
> > being liberal in accepting the range of topic names that have been
> > permitted so far and make the best of the situation, even if it means
> only
> > being able to warn people of conflicts.
> >
> > Here's another alternative: how about being liberal with topic name
> > characters, but upon topic creation we convert the name to the metric
> name
> > and fail if there's a conflict with another topic? This is relatively
> > expensive (requires getting the metric name of all other topics), but it
> > avoids the bad situation we're encountering here (conflicting metrics),
> > avoids getting into a persistent conflict (we kill topic creation when we
> > detect the issue rather than noticing it when the metrics conflict
> > happens), and keeps the vast majority of existing users happy (both _
> and .
> > work in topic names as long as you don't create topics with conflicting
> > metric names).
> >
> > There are definitely details to be worked out (auto topic creation?), but
> > it seems like a more realistic solution than to start disallowing _ or .
> in
> > topic names.
> >
> > -Ewen
> >
> >
> >>
> >> On Fri, Jul 10, 2015 at 4:33 PM, Ewen Cheslack-Postava
> >> <ew...@confluent.io> wrote:
> >> > I figure you'll probably see complaints no matter what change you
> make.
> >> > Gwen, given that you raised this, another important question might be
> how
> >> > many people you see using *both*. I'm guessing this question came up
> >> > because you actually saw a conflict? But I'd imagine (or at least
> hope)
> >> > that most organizations are mostly consistent about naming topics --
> they
> >> > standardize on one or the other.
> >> >
> >> > Since there's no "right" way to name them, I'd just leave it
> supporting
> >> > both and document the potential conflict in metrics. And if people use
> >> both
> >> > naming schemes, they probably deserve to suffer for their
> inconsistency
> >> :)
> >> >
> >> > -Ewen
> >> >
> >> > On Fri, Jul 10, 2015 at 3:28 PM, Gwen Shapira <gs...@cloudera.com>
> >> wrote:
> >> >
> >> >> I find dots more common in my customer base, so I will definitely
> feel
> >> >> the pain of removing them.
> >> >>
> >> >> However, "." are already used in metrics, file names, directories,
> etc
> >> >> - so if we keep the dots, we need to keep code that translates them
> >> >> and document the translation. Just banning "." seems more natural.
> >> >> Also, as Grant mentioned, we'll probably have our own special usage
> >> >> for "." down the line.
> >> >>
> >> >> On Fri, Jul 10, 2015 at 2:12 PM, Todd Palino <tp...@gmail.com>
> wrote:
> >> >> > I absolutely disagree with #2, Neha. That will break a lot of
> >> >> > infrastructure within LinkedIn. That said, removing "." might break
> >> other
> >> >> > people as well, but I think we should have a clearer idea of how
> much
> >> >> usage
> >> >> > there is on either side.
> >> >> >
> >> >> > -Todd
> >> >> >
> >> >> >
> >> >> > On Fri, Jul 10, 2015 at 2:08 PM, Neha Narkhede <ne...@confluent.io>
> >> >> wrote:
> >> >> >
> >> >> >> "." seems natural for grouping topic names. +1 for 2) going
> forward
> >> only
> >> >> >> without breaking previously created topics with "_" though that
> might
> >> >> >> require us to patch the code somewhat awkwardly till we phase it
> out
> >> a
> >> >> >> couple (purposely left vague to stay out of Ewen's wrath :-))
> >> versions
> >> >> >> later.
> >> >> >>
> >> >> >> On Fri, Jul 10, 2015 at 2:02 PM, Gwen Shapira <
> gshapira@cloudera.com
> >> >
> >> >> >> wrote:
> >> >> >>
> >> >> >> > I don't think we should break existing topics. Just disallow new
> >> >> >> > topics going forward.
> >> >> >> >
> >> >> >> > Agree that having both is horrible, but we should have a
> solution
> >> that
> >> >> >> > fails when you run "kafka_topics.sh --create", not when you
> >> configure
> >> >> >> > Ganglia.
> >> >> >> >
> >> >> >> > Gwen
> >> >> >> >
> >> >> >> > On Fri, Jul 10, 2015 at 1:53 PM, Jay Kreps <ja...@confluent.io>
> >> wrote:
> >> >> >> > > Unfortunately '.' is pretty common too. I agree that it is
> >> perverse,
> >> >> >> but
> >> >> >> > > people seem to do it. Breaking all the topics with '.' in the
> >> name
> >> >> >> seems
> >> >> >> > > like it could be worse than combining metrics for people who
> >> have a
> >> >> >> > > 'foo_bar' AND 'foo.bar' (and after all, having both is DEEPLY
> >> >> perverse,
> >> >> >> > > no?).
> >> >> >> > >
> >> >> >> > > Where is our Dean of Compatibility, Ewen, on this?
> >> >> >> > >
> >> >> >> > > -Jay
> >> >> >> > >
> >> >> >> > > On Fri, Jul 10, 2015 at 1:32 PM, Todd Palino <
> tpalino@gmail.com>
> >> >> >> wrote:
> >> >> >> > >
> >> >> >> > >> My selfish point of view is that we do #1, as we use "_"
> >> >> extensively
> >> >> >> in
> >> >> >> > >> topic names here :) I also happen to think it's the right
> >> choice,
> >> >> >> > >> specifically because "." has more special meanings, as you
> >> noted.
> >> >> >> > >>
> >> >> >> > >> -Todd
> >> >> >> > >>
> >> >> >> > >>
> >> >> >> > >> On Fri, Jul 10, 2015 at 1:30 PM, Gwen Shapira <
> >> >> gshapira@cloudera.com>
> >> >> >> > >> wrote:
> >> >> >> > >>
> >> >> >> > >> > Unintentional side effect from allowing IP addresses in
> >> consumer
> >> >> >> > client
> >> >> >> > >> > IDs :)
> >> >> >> > >> >
> >> >> >> > >> > So the question is, what do we do now?
> >> >> >> > >> >
> >> >> >> > >> > 1) disallow "."
> >> >> >> > >> > 2) disallow "_"
> >> >> >> > >> > 3) find a reversible way to encode "." and "_" that won't
> >> break
> >> >> >> > existing
> >> >> >> > >> > metrics
> >> >> >> > >> > 4) all of the above?
> >> >> >> > >> >
> >> >> >> > >> > btw. it looks like "." and ".." are currently valid. Topic
> >> names
> >> >> are
> >> >> >> > >> > used for directories, right? this sounds like fun :)
> >> >> >> > >> >
> >> >> >> > >> > I vote for option #1, although if someone has a good idea
> for
> >> #3
> >> >> it
> >> >> >> > >> > will be even better.
> >> >> >> > >> >
> >> >> >> > >> > Gwen
> >> >> >> > >> >
> >> >> >> > >> >
> >> >> >> > >> >
> >> >> >> > >> > On Fri, Jul 10, 2015 at 1:22 PM, Grant Henke <
> >> >> ghenke@cloudera.com>
> >> >> >> > >> wrote:
> >> >> >> > >> > > Found it was added here:
> >> >> >> > >> https://issues.apache.org/jira/browse/KAFKA-697
> >> >> >> > >> > >
> >> >> >> > >> > > On Fri, Jul 10, 2015 at 3:18 PM, Todd Palino <
> >> >> tpalino@gmail.com>
> >> >> >> > >> wrote:
> >> >> >> > >> > >
> >> >> >> > >> > >> This was definitely changed at some point after
> KAFKA-495.
> >> The
> >> >> >> > >> question
> >> >> >> > >> > is
> >> >> >> > >> > >> when and why.
> >> >> >> > >> > >>
> >> >> >> > >> > >> Here's the relevant code from that patch:
> >> >> >> > >> > >>
> >> >> >> > >> > >>
> >> >> >>
> ===================================================================
> >> >> >> > >> > >> --- core/src/main/scala/kafka/utils/Topic.scala
> (revision
> >> >> >> 1390178)
> >> >> >> > >> > >> +++ core/src/main/scala/kafka/utils/Topic.scala (working
> >> copy)
> >> >> >> > >> > >> @@ -21,24 +21,21 @@
> >> >> >> > >> > >>  import util.matching.Regex
> >> >> >> > >> > >>
> >> >> >> > >> > >>  object Topic {
> >> >> >> > >> > >> +  val legalChars = "[a-zA-Z0-9_-]"
> >> >> >> > >> > >>
> >> >> >> > >> > >>
> >> >> >> > >> > >>
> >> >> >> > >> > >> -Todd
> >> >> >> > >> > >>
> >> >> >> > >> > >>
> >> >> >> > >> > >> On Fri, Jul 10, 2015 at 1:02 PM, Grant Henke <
> >> >> >> ghenke@cloudera.com>
> >> >> >> > >> > wrote:
> >> >> >> > >> > >>
> >> >> >> > >> > >> > kafka.common.Topic shows that currently period is a
> valid
> >> >> >> > character
> >> >> >> > >> > and I
> >> >> >> > >> > >> > have verified I can use kafka-topics.sh to create a
> new
> >> >> topic
> >> >> >> > with a
> >> >> >> > >> > >> > period.
> >> >> >> > >> > >> >
> >> >> >> > >> > >> >
> >> >> >> > >> > >> >
> AdminUtils.createOrUpdateTopicPartitionAssignmentPathInZK
> >> >> >> > currently
> >> >> >> > >> > uses
> >> >> >> > >> > >> > Topic.validate before writing to Zookeeper.
> >> >> >> > >> > >> >
> >> >> >> > >> > >> > Should period character support be removed? I was
> under
> >> the
> >> >> >> same
> >> >> >> > >> > >> impression
> >> >> >> > >> > >> > as Gwen, that a period was used by many as a way to
> >> "group"
> >> >> >> > topics.
> >> >> >> > >> > >> >
> >> >> >> > >> > >> > The code is pasted below since its small:
> >> >> >> > >> > >> >
> >> >> >> > >> > >> > object Topic {
> >> >> >> > >> > >> >   val legalChars = "[a-zA-Z0-9\\._\\-]"
> >> >> >> > >> > >> >   private val maxNameLength = 255
> >> >> >> > >> > >> >   private val rgx = new Regex(legalChars + "+")
> >> >> >> > >> > >> >
> >> >> >> > >> > >> >   val InternalTopics =
> >> Set(OffsetManager.OffsetsTopicName)
> >> >> >> > >> > >> >
> >> >> >> > >> > >> >   def validate(topic: String) {
> >> >> >> > >> > >> >     if (topic.length <= 0)
> >> >> >> > >> > >> >       throw new InvalidTopicException("topic name is
> >> >> illegal,
> >> >> >> > can't
> >> >> >> > >> be
> >> >> >> > >> > >> > empty")
> >> >> >> > >> > >> >     else if (topic.equals(".") || topic.equals(".."))
> >> >> >> > >> > >> >       throw new InvalidTopicException("topic name
> cannot
> >> be
> >> >> >> > \".\" or
> >> >> >> > >> > >> > \"..\"")
> >> >> >> > >> > >> >     else if (topic.length > maxNameLength)
> >> >> >> > >> > >> >       throw new InvalidTopicException("topic name is
> >> >> illegal,
> >> >> >> > can't
> >> >> >> > >> be
> >> >> >> > >> > >> > longer than " + maxNameLength + " characters")
> >> >> >> > >> > >> >
> >> >> >> > >> > >> >     rgx.findFirstIn(topic) match {
> >> >> >> > >> > >> >       case Some(t) =>
> >> >> >> > >> > >> >         if (!t.equals(topic))
> >> >> >> > >> > >> >           throw new InvalidTopicException("topic name
> " +
> >> >> topic
> >> >> >> > + "
> >> >> >> > >> is
> >> >> >> > >> > >> > illegal, contains a character other than ASCII
> >> >> alphanumerics,
> >> >> >> > '.',
> >> >> >> > >> '_'
> >> >> >> > >> > >> and
> >> >> >> > >> > >> > '-'")
> >> >> >> > >> > >> >       case None => throw new
> InvalidTopicException("topic
> >> >> name
> >> >> >> "
> >> >> >> > +
> >> >> >> > >> > topic
> >> >> >> > >> > >> +
> >> >> >> > >> > >> > " is illegal,  contains a character other than ASCII
> >> >> >> > alphanumerics,
> >> >> >> > >> > '.',
> >> >> >> > >> > >> > '_' and '-'")
> >> >> >> > >> > >> >     }
> >> >> >> > >> > >> >   }
> >> >> >> > >> > >> > }
> >> >> >> > >> > >> >
> >> >> >> > >> > >> > On Fri, Jul 10, 2015 at 2:50 PM, Todd Palino <
> >> >> >> tpalino@gmail.com>
> >> >> >> > >> > wrote:
> >> >> >> > >> > >> >
> >> >> >> > >> > >> > > I had to go look this one up again to make sure -
> >> >> >> > >> > >> > > https://issues.apache.org/jira/browse/KAFKA-495
> >> >> >> > >> > >> > >
> >> >> >> > >> > >> > > The only valid character names for topics are
> >> >> alphanumeric,
> >> >> >> > >> > underscore,
> >> >> >> > >> > >> > and
> >> >> >> > >> > >> > > dash. A period is not supposed to be a valid
> character
> >> to
> >> >> >> use.
> >> >> >> > If
> >> >> >> > >> > >> you're
> >> >> >> > >> > >> > > seeing them, then one of two things have happened:
> >> >> >> > >> > >> > >
> >> >> >> > >> > >> > > 1) You have topic names that are grandfathered in
> from
> >> >> before
> >> >> >> > that
> >> >> >> > >> > >> patch
> >> >> >> > >> > >> > > 2) The patch is not working properly and there is
> >> >> somewhere
> >> >> >> in
> >> >> >> > the
> >> >> >> > >> > >> broker
> >> >> >> > >> > >> > > that the standard is not being enforced.
> >> >> >> > >> > >> > >
> >> >> >> > >> > >> > > -Todd
> >> >> >> > >> > >> > >
> >> >> >> > >> > >> > >
> >> >> >> > >> > >> > > On Fri, Jul 10, 2015 at 12:13 PM, Brock Noland <
> >> >> >> > brock@apache.org>
> >> >> >> > >> > >> wrote:
> >> >> >> > >> > >> > >
> >> >> >> > >> > >> > > > On Fri, Jul 10, 2015 at 11:34 AM, Gwen Shapira <
> >> >> >> > >> > >> gshapira@cloudera.com>
> >> >> >> > >> > >> > > > wrote:
> >> >> >> > >> > >> > > > > Hi Kafka Fans,
> >> >> >> > >> > >> > > > >
> >> >> >> > >> > >> > > > > If you have one topic named "kafka_lab_2" and
> the
> >> >> other
> >> >> >> > named
> >> >> >> > >> > >> > > > > "kafka.lab.2", the topic level metrics will be
> >> named
> >> >> >> > >> kafka_lab_2
> >> >> >> > >> > >> for
> >> >> >> > >> > >> > > > > both, effectively making it impossible to
> monitor
> >> them
> >> >> >> > >> properly.
> >> >> >> > >> > >> > > > >
> >> >> >> > >> > >> > > > > The reason this happens is that using "." in
> topic
> >> >> names
> >> >> >> is
> >> >> >> > >> > pretty
> >> >> >> > >> > >> > > > > common, especially as a way to group topics into
> >> data
> >> >> >> > centers,
> >> >> >> > >> > >> > > > > relevant apps, etc - basically a work-around to
> our
> >> >> >> current
> >> >> >> > >> > lack of
> >> >> >> > >> > >> > > > > name spaces. However, most metric monitoring
> >> systems
> >> >> >> using
> >> >> >> > "."
> >> >> >> > >> > to
> >> >> >> > >> > >> > > > > annotate hierarchy, so to avoid issues around
> >> metric
> >> >> >> names,
> >> >> >> > >> > Kafka
> >> >> >> > >> > >> > > > > replaces the "." in the name with an underscore.
> >> >> >> > >> > >> > > > >
> >> >> >> > >> > >> > > > > This generates good metric names, but creates
> the
> >> >> problem
> >> >> >> > with
> >> >> >> > >> > name
> >> >> >> > >> > >> > > > collisions.
> >> >> >> > >> > >> > > > >
> >> >> >> > >> > >> > > > > I'm wondering if it makes sense to simply limit
> the
> >> >> range
> >> >> >> > of
> >> >> >> > >> > >> > > > > characters permitted in a topic name and
> disallow
> >> "_"?
> >> >> >> > >> Obviously
> >> >> >> > >> > >> > > > > existing topics will need to remain as is, which
> >> is a
> >> >> bit
> >> >> >> > >> > awkward.
> >> >> >> > >> > >> > > >
> >> >> >> > >> > >> > > > Interesting problem! Many if not most users I
> >> >> personally am
> >> >> >> > >> aware
> >> >> >> > >> > of
> >> >> >> > >> > >> > > > use "_" as a separator in topic names. I am sure
> that
> >> >> many
> >> >> >> > users
> >> >> >> > >> > >> would
> >> >> >> > >> > >> > > > be quite surprised by this limitation. With that
> >> said,
> >> >> I am
> >> >> >> > sure
> >> >> >> > >> > >> > > > they'd transition accordingly.
> >> >> >> > >> > >> > > >
> >> >> >> > >> > >> > > > >
> >> >> >> > >> > >> > > > > If anyone has better backward-compatible
> solutions
> >> to
> >> >> >> this,
> >> >> >> > >> I'm
> >> >> >> > >> > all
> >> >> >> > >> > >> > > ears
> >> >> >> > >> > >> > > > :)
> >> >> >> > >> > >> > > > >
> >> >> >> > >> > >> > > > > Gwen
> >> >> >> > >> > >> > > >
> >> >> >> > >> > >> > >
> >> >> >> > >> > >> >
> >> >> >> > >> > >> >
> >> >> >> > >> > >> >
> >> >> >> > >> > >> > --
> >> >> >> > >> > >> > Grant Henke
> >> >> >> > >> > >> > Solutions Consultant | Cloudera
> >> >> >> > >> > >> > ghenke@cloudera.com | twitter.com/gchenke |
> >> >> >> > >> > linkedin.com/in/granthenke
> >> >> >> > >> > >> >
> >> >> >> > >> > >>
> >> >> >> > >> > >
> >> >> >> > >> > >
> >> >> >> > >> > >
> >> >> >> > >> > > --
> >> >> >> > >> > > Grant Henke
> >> >> >> > >> > > Solutions Consultant | Cloudera
> >> >> >> > >> > > ghenke@cloudera.com | twitter.com/gchenke |
> >> >> >> > linkedin.com/in/granthenke
> >> >> >> > >> >
> >> >> >> > >>
> >> >> >> >
> >> >> >>
> >> >> >>
> >> >> >>
> >> >> >> --
> >> >> >> Thanks,
> >> >> >> Neha
> >> >> >>
> >> >>
> >> >
> >> >
> >> >
> >> > --
> >> > Thanks,
> >> > Ewen
> >>
> >
> >
> >
> > --
> > Thanks,
> > Ewen
>

Re: [Discussion] Limitations on topic names

Posted by Gwen Shapira <gs...@cloudera.com>.
I like the "lets warn people of conflicts when creating the topic"
suggestion. IMO, automatic topic creation as currently done is buggy
either way (Send data and hope the topic is ready before retries run
out, potentially failing with the super helpful NO_LEADER error), so I
don't mind leaving it broken a bit more. I think the right behavior is
that conflicts will cause auto creating to fail, the same way we
currently do when the default number of replicas is higher than number
of brokers.

One thing that is left confusing is that people in the "." camp need
to know about the conversion or they will fail to find their topics in
their monitoring tools. Not very nice to them, but I can't think of
alternatives.

I'll start with the doc patch :)

On Sat, Jul 11, 2015 at 12:54 AM, Ewen Cheslack-Postava
<ew...@confluent.io> wrote:
> On Fri, Jul 10, 2015 at 4:41 PM, Gwen Shapira <gs...@cloudera.com> wrote:
>
>> Yeah, I have an actual customer who ran into this. Unfortunately,
>> inconsistencies in the way things are named are pretty common - just
>> look at Kafka's many CLI options.
>>
>> I don't think that supporting both and pointing at the docs with "I
>> told you so" when our metrics break is a good solution.
>>
>
> I agree, especially since we don't *already* have something in the docs
> indicating this will be an issue. I was flippant about the situation
> because I *wish* there was more careful consideration + naming policy in
> place, but I realize that doesn't always happen in practice. I guess I need
> to take Compatibility Czar more seriously :)
>
> I see think the obvious practical options are as follows:
>
> 1. Kill support for "_". Piss off the entire set of people who currently
> use "_" anywhere in topic names.
> 2. Kill support for ".". Piss off the entire set of people who currently
> use "." anywhere in topic names.
> 3. Tell people they need to be careful about this issue. Piss off the set
> of people who use both "_" and "." *and* happen to have conflicting topic
> names. They will have some pain when they discover the issue and have to
> figure out how to move one of those topics over to a non-conflicting name.
> I'm going to claim that this group must be an *extremely* small fraction of
> users, which doesn't make it better to allow things to break for them, but
> at least gives us an idea of the scale of impact.
>
> (One other alternative suggested earlier was encoding metric names to
> account for differences; given the metric renaming mess in the last
> release, I'm extremely hesitant to suggest anything of the sort...)
>
> None of the options are ideal, but to me, 3 seems like the least painful.
> Both for us, and for the vast majority of users. It seems to me that the
> number of users that would complain about (1) or (2) drastically outweigh
> (3).
>
> At this point, I don't think it's practical to keep switching the rules
> about which characters are allowed and which aren't because the previous
> attempts haven't been successful -- it seems the rules have changed
> multiple times, whether intentionally or accidentally, such that any more
> changes will cause problems. At this point, I think we just need to accept
> being liberal in accepting the range of topic names that have been
> permitted so far and make the best of the situation, even if it means only
> being able to warn people of conflicts.
>
> Here's another alternative: how about being liberal with topic name
> characters, but upon topic creation we convert the name to the metric name
> and fail if there's a conflict with another topic? This is relatively
> expensive (requires getting the metric name of all other topics), but it
> avoids the bad situation we're encountering here (conflicting metrics),
> avoids getting into a persistent conflict (we kill topic creation when we
> detect the issue rather than noticing it when the metrics conflict
> happens), and keeps the vast majority of existing users happy (both _ and .
> work in topic names as long as you don't create topics with conflicting
> metric names).
>
> There are definitely details to be worked out (auto topic creation?), but
> it seems like a more realistic solution than to start disallowing _ or . in
> topic names.
>
> -Ewen
>
>
>>
>> On Fri, Jul 10, 2015 at 4:33 PM, Ewen Cheslack-Postava
>> <ew...@confluent.io> wrote:
>> > I figure you'll probably see complaints no matter what change you make.
>> > Gwen, given that you raised this, another important question might be how
>> > many people you see using *both*. I'm guessing this question came up
>> > because you actually saw a conflict? But I'd imagine (or at least hope)
>> > that most organizations are mostly consistent about naming topics -- they
>> > standardize on one or the other.
>> >
>> > Since there's no "right" way to name them, I'd just leave it supporting
>> > both and document the potential conflict in metrics. And if people use
>> both
>> > naming schemes, they probably deserve to suffer for their inconsistency
>> :)
>> >
>> > -Ewen
>> >
>> > On Fri, Jul 10, 2015 at 3:28 PM, Gwen Shapira <gs...@cloudera.com>
>> wrote:
>> >
>> >> I find dots more common in my customer base, so I will definitely feel
>> >> the pain of removing them.
>> >>
>> >> However, "." are already used in metrics, file names, directories, etc
>> >> - so if we keep the dots, we need to keep code that translates them
>> >> and document the translation. Just banning "." seems more natural.
>> >> Also, as Grant mentioned, we'll probably have our own special usage
>> >> for "." down the line.
>> >>
>> >> On Fri, Jul 10, 2015 at 2:12 PM, Todd Palino <tp...@gmail.com> wrote:
>> >> > I absolutely disagree with #2, Neha. That will break a lot of
>> >> > infrastructure within LinkedIn. That said, removing "." might break
>> other
>> >> > people as well, but I think we should have a clearer idea of how much
>> >> usage
>> >> > there is on either side.
>> >> >
>> >> > -Todd
>> >> >
>> >> >
>> >> > On Fri, Jul 10, 2015 at 2:08 PM, Neha Narkhede <ne...@confluent.io>
>> >> wrote:
>> >> >
>> >> >> "." seems natural for grouping topic names. +1 for 2) going forward
>> only
>> >> >> without breaking previously created topics with "_" though that might
>> >> >> require us to patch the code somewhat awkwardly till we phase it out
>> a
>> >> >> couple (purposely left vague to stay out of Ewen's wrath :-))
>> versions
>> >> >> later.
>> >> >>
>> >> >> On Fri, Jul 10, 2015 at 2:02 PM, Gwen Shapira <gshapira@cloudera.com
>> >
>> >> >> wrote:
>> >> >>
>> >> >> > I don't think we should break existing topics. Just disallow new
>> >> >> > topics going forward.
>> >> >> >
>> >> >> > Agree that having both is horrible, but we should have a solution
>> that
>> >> >> > fails when you run "kafka_topics.sh --create", not when you
>> configure
>> >> >> > Ganglia.
>> >> >> >
>> >> >> > Gwen
>> >> >> >
>> >> >> > On Fri, Jul 10, 2015 at 1:53 PM, Jay Kreps <ja...@confluent.io>
>> wrote:
>> >> >> > > Unfortunately '.' is pretty common too. I agree that it is
>> perverse,
>> >> >> but
>> >> >> > > people seem to do it. Breaking all the topics with '.' in the
>> name
>> >> >> seems
>> >> >> > > like it could be worse than combining metrics for people who
>> have a
>> >> >> > > 'foo_bar' AND 'foo.bar' (and after all, having both is DEEPLY
>> >> perverse,
>> >> >> > > no?).
>> >> >> > >
>> >> >> > > Where is our Dean of Compatibility, Ewen, on this?
>> >> >> > >
>> >> >> > > -Jay
>> >> >> > >
>> >> >> > > On Fri, Jul 10, 2015 at 1:32 PM, Todd Palino <tp...@gmail.com>
>> >> >> wrote:
>> >> >> > >
>> >> >> > >> My selfish point of view is that we do #1, as we use "_"
>> >> extensively
>> >> >> in
>> >> >> > >> topic names here :) I also happen to think it's the right
>> choice,
>> >> >> > >> specifically because "." has more special meanings, as you
>> noted.
>> >> >> > >>
>> >> >> > >> -Todd
>> >> >> > >>
>> >> >> > >>
>> >> >> > >> On Fri, Jul 10, 2015 at 1:30 PM, Gwen Shapira <
>> >> gshapira@cloudera.com>
>> >> >> > >> wrote:
>> >> >> > >>
>> >> >> > >> > Unintentional side effect from allowing IP addresses in
>> consumer
>> >> >> > client
>> >> >> > >> > IDs :)
>> >> >> > >> >
>> >> >> > >> > So the question is, what do we do now?
>> >> >> > >> >
>> >> >> > >> > 1) disallow "."
>> >> >> > >> > 2) disallow "_"
>> >> >> > >> > 3) find a reversible way to encode "." and "_" that won't
>> break
>> >> >> > existing
>> >> >> > >> > metrics
>> >> >> > >> > 4) all of the above?
>> >> >> > >> >
>> >> >> > >> > btw. it looks like "." and ".." are currently valid. Topic
>> names
>> >> are
>> >> >> > >> > used for directories, right? this sounds like fun :)
>> >> >> > >> >
>> >> >> > >> > I vote for option #1, although if someone has a good idea for
>> #3
>> >> it
>> >> >> > >> > will be even better.
>> >> >> > >> >
>> >> >> > >> > Gwen
>> >> >> > >> >
>> >> >> > >> >
>> >> >> > >> >
>> >> >> > >> > On Fri, Jul 10, 2015 at 1:22 PM, Grant Henke <
>> >> ghenke@cloudera.com>
>> >> >> > >> wrote:
>> >> >> > >> > > Found it was added here:
>> >> >> > >> https://issues.apache.org/jira/browse/KAFKA-697
>> >> >> > >> > >
>> >> >> > >> > > On Fri, Jul 10, 2015 at 3:18 PM, Todd Palino <
>> >> tpalino@gmail.com>
>> >> >> > >> wrote:
>> >> >> > >> > >
>> >> >> > >> > >> This was definitely changed at some point after KAFKA-495.
>> The
>> >> >> > >> question
>> >> >> > >> > is
>> >> >> > >> > >> when and why.
>> >> >> > >> > >>
>> >> >> > >> > >> Here's the relevant code from that patch:
>> >> >> > >> > >>
>> >> >> > >> > >>
>> >> >> ===================================================================
>> >> >> > >> > >> --- core/src/main/scala/kafka/utils/Topic.scala (revision
>> >> >> 1390178)
>> >> >> > >> > >> +++ core/src/main/scala/kafka/utils/Topic.scala (working
>> copy)
>> >> >> > >> > >> @@ -21,24 +21,21 @@
>> >> >> > >> > >>  import util.matching.Regex
>> >> >> > >> > >>
>> >> >> > >> > >>  object Topic {
>> >> >> > >> > >> +  val legalChars = "[a-zA-Z0-9_-]"
>> >> >> > >> > >>
>> >> >> > >> > >>
>> >> >> > >> > >>
>> >> >> > >> > >> -Todd
>> >> >> > >> > >>
>> >> >> > >> > >>
>> >> >> > >> > >> On Fri, Jul 10, 2015 at 1:02 PM, Grant Henke <
>> >> >> ghenke@cloudera.com>
>> >> >> > >> > wrote:
>> >> >> > >> > >>
>> >> >> > >> > >> > kafka.common.Topic shows that currently period is a valid
>> >> >> > character
>> >> >> > >> > and I
>> >> >> > >> > >> > have verified I can use kafka-topics.sh to create a new
>> >> topic
>> >> >> > with a
>> >> >> > >> > >> > period.
>> >> >> > >> > >> >
>> >> >> > >> > >> >
>> >> >> > >> > >> > AdminUtils.createOrUpdateTopicPartitionAssignmentPathInZK
>> >> >> > currently
>> >> >> > >> > uses
>> >> >> > >> > >> > Topic.validate before writing to Zookeeper.
>> >> >> > >> > >> >
>> >> >> > >> > >> > Should period character support be removed? I was under
>> the
>> >> >> same
>> >> >> > >> > >> impression
>> >> >> > >> > >> > as Gwen, that a period was used by many as a way to
>> "group"
>> >> >> > topics.
>> >> >> > >> > >> >
>> >> >> > >> > >> > The code is pasted below since its small:
>> >> >> > >> > >> >
>> >> >> > >> > >> > object Topic {
>> >> >> > >> > >> >   val legalChars = "[a-zA-Z0-9\\._\\-]"
>> >> >> > >> > >> >   private val maxNameLength = 255
>> >> >> > >> > >> >   private val rgx = new Regex(legalChars + "+")
>> >> >> > >> > >> >
>> >> >> > >> > >> >   val InternalTopics =
>> Set(OffsetManager.OffsetsTopicName)
>> >> >> > >> > >> >
>> >> >> > >> > >> >   def validate(topic: String) {
>> >> >> > >> > >> >     if (topic.length <= 0)
>> >> >> > >> > >> >       throw new InvalidTopicException("topic name is
>> >> illegal,
>> >> >> > can't
>> >> >> > >> be
>> >> >> > >> > >> > empty")
>> >> >> > >> > >> >     else if (topic.equals(".") || topic.equals(".."))
>> >> >> > >> > >> >       throw new InvalidTopicException("topic name cannot
>> be
>> >> >> > \".\" or
>> >> >> > >> > >> > \"..\"")
>> >> >> > >> > >> >     else if (topic.length > maxNameLength)
>> >> >> > >> > >> >       throw new InvalidTopicException("topic name is
>> >> illegal,
>> >> >> > can't
>> >> >> > >> be
>> >> >> > >> > >> > longer than " + maxNameLength + " characters")
>> >> >> > >> > >> >
>> >> >> > >> > >> >     rgx.findFirstIn(topic) match {
>> >> >> > >> > >> >       case Some(t) =>
>> >> >> > >> > >> >         if (!t.equals(topic))
>> >> >> > >> > >> >           throw new InvalidTopicException("topic name " +
>> >> topic
>> >> >> > + "
>> >> >> > >> is
>> >> >> > >> > >> > illegal, contains a character other than ASCII
>> >> alphanumerics,
>> >> >> > '.',
>> >> >> > >> '_'
>> >> >> > >> > >> and
>> >> >> > >> > >> > '-'")
>> >> >> > >> > >> >       case None => throw new InvalidTopicException("topic
>> >> name
>> >> >> "
>> >> >> > +
>> >> >> > >> > topic
>> >> >> > >> > >> +
>> >> >> > >> > >> > " is illegal,  contains a character other than ASCII
>> >> >> > alphanumerics,
>> >> >> > >> > '.',
>> >> >> > >> > >> > '_' and '-'")
>> >> >> > >> > >> >     }
>> >> >> > >> > >> >   }
>> >> >> > >> > >> > }
>> >> >> > >> > >> >
>> >> >> > >> > >> > On Fri, Jul 10, 2015 at 2:50 PM, Todd Palino <
>> >> >> tpalino@gmail.com>
>> >> >> > >> > wrote:
>> >> >> > >> > >> >
>> >> >> > >> > >> > > I had to go look this one up again to make sure -
>> >> >> > >> > >> > > https://issues.apache.org/jira/browse/KAFKA-495
>> >> >> > >> > >> > >
>> >> >> > >> > >> > > The only valid character names for topics are
>> >> alphanumeric,
>> >> >> > >> > underscore,
>> >> >> > >> > >> > and
>> >> >> > >> > >> > > dash. A period is not supposed to be a valid character
>> to
>> >> >> use.
>> >> >> > If
>> >> >> > >> > >> you're
>> >> >> > >> > >> > > seeing them, then one of two things have happened:
>> >> >> > >> > >> > >
>> >> >> > >> > >> > > 1) You have topic names that are grandfathered in from
>> >> before
>> >> >> > that
>> >> >> > >> > >> patch
>> >> >> > >> > >> > > 2) The patch is not working properly and there is
>> >> somewhere
>> >> >> in
>> >> >> > the
>> >> >> > >> > >> broker
>> >> >> > >> > >> > > that the standard is not being enforced.
>> >> >> > >> > >> > >
>> >> >> > >> > >> > > -Todd
>> >> >> > >> > >> > >
>> >> >> > >> > >> > >
>> >> >> > >> > >> > > On Fri, Jul 10, 2015 at 12:13 PM, Brock Noland <
>> >> >> > brock@apache.org>
>> >> >> > >> > >> wrote:
>> >> >> > >> > >> > >
>> >> >> > >> > >> > > > On Fri, Jul 10, 2015 at 11:34 AM, Gwen Shapira <
>> >> >> > >> > >> gshapira@cloudera.com>
>> >> >> > >> > >> > > > wrote:
>> >> >> > >> > >> > > > > Hi Kafka Fans,
>> >> >> > >> > >> > > > >
>> >> >> > >> > >> > > > > If you have one topic named "kafka_lab_2" and the
>> >> other
>> >> >> > named
>> >> >> > >> > >> > > > > "kafka.lab.2", the topic level metrics will be
>> named
>> >> >> > >> kafka_lab_2
>> >> >> > >> > >> for
>> >> >> > >> > >> > > > > both, effectively making it impossible to monitor
>> them
>> >> >> > >> properly.
>> >> >> > >> > >> > > > >
>> >> >> > >> > >> > > > > The reason this happens is that using "." in topic
>> >> names
>> >> >> is
>> >> >> > >> > pretty
>> >> >> > >> > >> > > > > common, especially as a way to group topics into
>> data
>> >> >> > centers,
>> >> >> > >> > >> > > > > relevant apps, etc - basically a work-around to our
>> >> >> current
>> >> >> > >> > lack of
>> >> >> > >> > >> > > > > name spaces. However, most metric monitoring
>> systems
>> >> >> using
>> >> >> > "."
>> >> >> > >> > to
>> >> >> > >> > >> > > > > annotate hierarchy, so to avoid issues around
>> metric
>> >> >> names,
>> >> >> > >> > Kafka
>> >> >> > >> > >> > > > > replaces the "." in the name with an underscore.
>> >> >> > >> > >> > > > >
>> >> >> > >> > >> > > > > This generates good metric names, but creates the
>> >> problem
>> >> >> > with
>> >> >> > >> > name
>> >> >> > >> > >> > > > collisions.
>> >> >> > >> > >> > > > >
>> >> >> > >> > >> > > > > I'm wondering if it makes sense to simply limit the
>> >> range
>> >> >> > of
>> >> >> > >> > >> > > > > characters permitted in a topic name and disallow
>> "_"?
>> >> >> > >> Obviously
>> >> >> > >> > >> > > > > existing topics will need to remain as is, which
>> is a
>> >> bit
>> >> >> > >> > awkward.
>> >> >> > >> > >> > > >
>> >> >> > >> > >> > > > Interesting problem! Many if not most users I
>> >> personally am
>> >> >> > >> aware
>> >> >> > >> > of
>> >> >> > >> > >> > > > use "_" as a separator in topic names. I am sure that
>> >> many
>> >> >> > users
>> >> >> > >> > >> would
>> >> >> > >> > >> > > > be quite surprised by this limitation. With that
>> said,
>> >> I am
>> >> >> > sure
>> >> >> > >> > >> > > > they'd transition accordingly.
>> >> >> > >> > >> > > >
>> >> >> > >> > >> > > > >
>> >> >> > >> > >> > > > > If anyone has better backward-compatible solutions
>> to
>> >> >> this,
>> >> >> > >> I'm
>> >> >> > >> > all
>> >> >> > >> > >> > > ears
>> >> >> > >> > >> > > > :)
>> >> >> > >> > >> > > > >
>> >> >> > >> > >> > > > > Gwen
>> >> >> > >> > >> > > >
>> >> >> > >> > >> > >
>> >> >> > >> > >> >
>> >> >> > >> > >> >
>> >> >> > >> > >> >
>> >> >> > >> > >> > --
>> >> >> > >> > >> > Grant Henke
>> >> >> > >> > >> > Solutions Consultant | Cloudera
>> >> >> > >> > >> > ghenke@cloudera.com | twitter.com/gchenke |
>> >> >> > >> > linkedin.com/in/granthenke
>> >> >> > >> > >> >
>> >> >> > >> > >>
>> >> >> > >> > >
>> >> >> > >> > >
>> >> >> > >> > >
>> >> >> > >> > > --
>> >> >> > >> > > Grant Henke
>> >> >> > >> > > Solutions Consultant | Cloudera
>> >> >> > >> > > ghenke@cloudera.com | twitter.com/gchenke |
>> >> >> > linkedin.com/in/granthenke
>> >> >> > >> >
>> >> >> > >>
>> >> >> >
>> >> >>
>> >> >>
>> >> >>
>> >> >> --
>> >> >> Thanks,
>> >> >> Neha
>> >> >>
>> >>
>> >
>> >
>> >
>> > --
>> > Thanks,
>> > Ewen
>>
>
>
>
> --
> Thanks,
> Ewen

Re: [Discussion] Limitations on topic names

Posted by Guozhang Wang <wa...@gmail.com>.
For resolving the metrics conflicts, we can alternatively let Kafka to
replace "." with double underscores "__" if that is the primary reason for
topic name restrictions.

Guozhang

On Sat, Jul 11, 2015 at 12:54 AM, Ewen Cheslack-Postava <ew...@confluent.io>
wrote:

> On Fri, Jul 10, 2015 at 4:41 PM, Gwen Shapira <gs...@cloudera.com>
> wrote:
>
> > Yeah, I have an actual customer who ran into this. Unfortunately,
> > inconsistencies in the way things are named are pretty common - just
> > look at Kafka's many CLI options.
> >
> > I don't think that supporting both and pointing at the docs with "I
> > told you so" when our metrics break is a good solution.
> >
>
> I agree, especially since we don't *already* have something in the docs
> indicating this will be an issue. I was flippant about the situation
> because I *wish* there was more careful consideration + naming policy in
> place, but I realize that doesn't always happen in practice. I guess I need
> to take Compatibility Czar more seriously :)
>
> I see think the obvious practical options are as follows:
>
> 1. Kill support for "_". Piss off the entire set of people who currently
> use "_" anywhere in topic names.
> 2. Kill support for ".". Piss off the entire set of people who currently
> use "." anywhere in topic names.
> 3. Tell people they need to be careful about this issue. Piss off the set
> of people who use both "_" and "." *and* happen to have conflicting topic
> names. They will have some pain when they discover the issue and have to
> figure out how to move one of those topics over to a non-conflicting name.
> I'm going to claim that this group must be an *extremely* small fraction of
> users, which doesn't make it better to allow things to break for them, but
> at least gives us an idea of the scale of impact.
>
> (One other alternative suggested earlier was encoding metric names to
> account for differences; given the metric renaming mess in the last
> release, I'm extremely hesitant to suggest anything of the sort...)
>
> None of the options are ideal, but to me, 3 seems like the least painful.
> Both for us, and for the vast majority of users. It seems to me that the
> number of users that would complain about (1) or (2) drastically outweigh
> (3).
>
> At this point, I don't think it's practical to keep switching the rules
> about which characters are allowed and which aren't because the previous
> attempts haven't been successful -- it seems the rules have changed
> multiple times, whether intentionally or accidentally, such that any more
> changes will cause problems. At this point, I think we just need to accept
> being liberal in accepting the range of topic names that have been
> permitted so far and make the best of the situation, even if it means only
> being able to warn people of conflicts.
>
> Here's another alternative: how about being liberal with topic name
> characters, but upon topic creation we convert the name to the metric name
> and fail if there's a conflict with another topic? This is relatively
> expensive (requires getting the metric name of all other topics), but it
> avoids the bad situation we're encountering here (conflicting metrics),
> avoids getting into a persistent conflict (we kill topic creation when we
> detect the issue rather than noticing it when the metrics conflict
> happens), and keeps the vast majority of existing users happy (both _ and .
> work in topic names as long as you don't create topics with conflicting
> metric names).
>
> There are definitely details to be worked out (auto topic creation?), but
> it seems like a more realistic solution than to start disallowing _ or . in
> topic names.
>
> -Ewen
>
>
> >
> > On Fri, Jul 10, 2015 at 4:33 PM, Ewen Cheslack-Postava
> > <ew...@confluent.io> wrote:
> > > I figure you'll probably see complaints no matter what change you make.
> > > Gwen, given that you raised this, another important question might be
> how
> > > many people you see using *both*. I'm guessing this question came up
> > > because you actually saw a conflict? But I'd imagine (or at least hope)
> > > that most organizations are mostly consistent about naming topics --
> they
> > > standardize on one or the other.
> > >
> > > Since there's no "right" way to name them, I'd just leave it supporting
> > > both and document the potential conflict in metrics. And if people use
> > both
> > > naming schemes, they probably deserve to suffer for their inconsistency
> > :)
> > >
> > > -Ewen
> > >
> > > On Fri, Jul 10, 2015 at 3:28 PM, Gwen Shapira <gs...@cloudera.com>
> > wrote:
> > >
> > >> I find dots more common in my customer base, so I will definitely feel
> > >> the pain of removing them.
> > >>
> > >> However, "." are already used in metrics, file names, directories, etc
> > >> - so if we keep the dots, we need to keep code that translates them
> > >> and document the translation. Just banning "." seems more natural.
> > >> Also, as Grant mentioned, we'll probably have our own special usage
> > >> for "." down the line.
> > >>
> > >> On Fri, Jul 10, 2015 at 2:12 PM, Todd Palino <tp...@gmail.com>
> wrote:
> > >> > I absolutely disagree with #2, Neha. That will break a lot of
> > >> > infrastructure within LinkedIn. That said, removing "." might break
> > other
> > >> > people as well, but I think we should have a clearer idea of how
> much
> > >> usage
> > >> > there is on either side.
> > >> >
> > >> > -Todd
> > >> >
> > >> >
> > >> > On Fri, Jul 10, 2015 at 2:08 PM, Neha Narkhede <ne...@confluent.io>
> > >> wrote:
> > >> >
> > >> >> "." seems natural for grouping topic names. +1 for 2) going forward
> > only
> > >> >> without breaking previously created topics with "_" though that
> might
> > >> >> require us to patch the code somewhat awkwardly till we phase it
> out
> > a
> > >> >> couple (purposely left vague to stay out of Ewen's wrath :-))
> > versions
> > >> >> later.
> > >> >>
> > >> >> On Fri, Jul 10, 2015 at 2:02 PM, Gwen Shapira <
> gshapira@cloudera.com
> > >
> > >> >> wrote:
> > >> >>
> > >> >> > I don't think we should break existing topics. Just disallow new
> > >> >> > topics going forward.
> > >> >> >
> > >> >> > Agree that having both is horrible, but we should have a solution
> > that
> > >> >> > fails when you run "kafka_topics.sh --create", not when you
> > configure
> > >> >> > Ganglia.
> > >> >> >
> > >> >> > Gwen
> > >> >> >
> > >> >> > On Fri, Jul 10, 2015 at 1:53 PM, Jay Kreps <ja...@confluent.io>
> > wrote:
> > >> >> > > Unfortunately '.' is pretty common too. I agree that it is
> > perverse,
> > >> >> but
> > >> >> > > people seem to do it. Breaking all the topics with '.' in the
> > name
> > >> >> seems
> > >> >> > > like it could be worse than combining metrics for people who
> > have a
> > >> >> > > 'foo_bar' AND 'foo.bar' (and after all, having both is DEEPLY
> > >> perverse,
> > >> >> > > no?).
> > >> >> > >
> > >> >> > > Where is our Dean of Compatibility, Ewen, on this?
> > >> >> > >
> > >> >> > > -Jay
> > >> >> > >
> > >> >> > > On Fri, Jul 10, 2015 at 1:32 PM, Todd Palino <
> tpalino@gmail.com>
> > >> >> wrote:
> > >> >> > >
> > >> >> > >> My selfish point of view is that we do #1, as we use "_"
> > >> extensively
> > >> >> in
> > >> >> > >> topic names here :) I also happen to think it's the right
> > choice,
> > >> >> > >> specifically because "." has more special meanings, as you
> > noted.
> > >> >> > >>
> > >> >> > >> -Todd
> > >> >> > >>
> > >> >> > >>
> > >> >> > >> On Fri, Jul 10, 2015 at 1:30 PM, Gwen Shapira <
> > >> gshapira@cloudera.com>
> > >> >> > >> wrote:
> > >> >> > >>
> > >> >> > >> > Unintentional side effect from allowing IP addresses in
> > consumer
> > >> >> > client
> > >> >> > >> > IDs :)
> > >> >> > >> >
> > >> >> > >> > So the question is, what do we do now?
> > >> >> > >> >
> > >> >> > >> > 1) disallow "."
> > >> >> > >> > 2) disallow "_"
> > >> >> > >> > 3) find a reversible way to encode "." and "_" that won't
> > break
> > >> >> > existing
> > >> >> > >> > metrics
> > >> >> > >> > 4) all of the above?
> > >> >> > >> >
> > >> >> > >> > btw. it looks like "." and ".." are currently valid. Topic
> > names
> > >> are
> > >> >> > >> > used for directories, right? this sounds like fun :)
> > >> >> > >> >
> > >> >> > >> > I vote for option #1, although if someone has a good idea
> for
> > #3
> > >> it
> > >> >> > >> > will be even better.
> > >> >> > >> >
> > >> >> > >> > Gwen
> > >> >> > >> >
> > >> >> > >> >
> > >> >> > >> >
> > >> >> > >> > On Fri, Jul 10, 2015 at 1:22 PM, Grant Henke <
> > >> ghenke@cloudera.com>
> > >> >> > >> wrote:
> > >> >> > >> > > Found it was added here:
> > >> >> > >> https://issues.apache.org/jira/browse/KAFKA-697
> > >> >> > >> > >
> > >> >> > >> > > On Fri, Jul 10, 2015 at 3:18 PM, Todd Palino <
> > >> tpalino@gmail.com>
> > >> >> > >> wrote:
> > >> >> > >> > >
> > >> >> > >> > >> This was definitely changed at some point after
> KAFKA-495.
> > The
> > >> >> > >> question
> > >> >> > >> > is
> > >> >> > >> > >> when and why.
> > >> >> > >> > >>
> > >> >> > >> > >> Here's the relevant code from that patch:
> > >> >> > >> > >>
> > >> >> > >> > >>
> > >> >> ===================================================================
> > >> >> > >> > >> --- core/src/main/scala/kafka/utils/Topic.scala (revision
> > >> >> 1390178)
> > >> >> > >> > >> +++ core/src/main/scala/kafka/utils/Topic.scala (working
> > copy)
> > >> >> > >> > >> @@ -21,24 +21,21 @@
> > >> >> > >> > >>  import util.matching.Regex
> > >> >> > >> > >>
> > >> >> > >> > >>  object Topic {
> > >> >> > >> > >> +  val legalChars = "[a-zA-Z0-9_-]"
> > >> >> > >> > >>
> > >> >> > >> > >>
> > >> >> > >> > >>
> > >> >> > >> > >> -Todd
> > >> >> > >> > >>
> > >> >> > >> > >>
> > >> >> > >> > >> On Fri, Jul 10, 2015 at 1:02 PM, Grant Henke <
> > >> >> ghenke@cloudera.com>
> > >> >> > >> > wrote:
> > >> >> > >> > >>
> > >> >> > >> > >> > kafka.common.Topic shows that currently period is a
> valid
> > >> >> > character
> > >> >> > >> > and I
> > >> >> > >> > >> > have verified I can use kafka-topics.sh to create a new
> > >> topic
> > >> >> > with a
> > >> >> > >> > >> > period.
> > >> >> > >> > >> >
> > >> >> > >> > >> >
> > >> >> > >> > >> >
> AdminUtils.createOrUpdateTopicPartitionAssignmentPathInZK
> > >> >> > currently
> > >> >> > >> > uses
> > >> >> > >> > >> > Topic.validate before writing to Zookeeper.
> > >> >> > >> > >> >
> > >> >> > >> > >> > Should period character support be removed? I was under
> > the
> > >> >> same
> > >> >> > >> > >> impression
> > >> >> > >> > >> > as Gwen, that a period was used by many as a way to
> > "group"
> > >> >> > topics.
> > >> >> > >> > >> >
> > >> >> > >> > >> > The code is pasted below since its small:
> > >> >> > >> > >> >
> > >> >> > >> > >> > object Topic {
> > >> >> > >> > >> >   val legalChars = "[a-zA-Z0-9\\._\\-]"
> > >> >> > >> > >> >   private val maxNameLength = 255
> > >> >> > >> > >> >   private val rgx = new Regex(legalChars + "+")
> > >> >> > >> > >> >
> > >> >> > >> > >> >   val InternalTopics =
> > Set(OffsetManager.OffsetsTopicName)
> > >> >> > >> > >> >
> > >> >> > >> > >> >   def validate(topic: String) {
> > >> >> > >> > >> >     if (topic.length <= 0)
> > >> >> > >> > >> >       throw new InvalidTopicException("topic name is
> > >> illegal,
> > >> >> > can't
> > >> >> > >> be
> > >> >> > >> > >> > empty")
> > >> >> > >> > >> >     else if (topic.equals(".") || topic.equals(".."))
> > >> >> > >> > >> >       throw new InvalidTopicException("topic name
> cannot
> > be
> > >> >> > \".\" or
> > >> >> > >> > >> > \"..\"")
> > >> >> > >> > >> >     else if (topic.length > maxNameLength)
> > >> >> > >> > >> >       throw new InvalidTopicException("topic name is
> > >> illegal,
> > >> >> > can't
> > >> >> > >> be
> > >> >> > >> > >> > longer than " + maxNameLength + " characters")
> > >> >> > >> > >> >
> > >> >> > >> > >> >     rgx.findFirstIn(topic) match {
> > >> >> > >> > >> >       case Some(t) =>
> > >> >> > >> > >> >         if (!t.equals(topic))
> > >> >> > >> > >> >           throw new InvalidTopicException("topic name
> " +
> > >> topic
> > >> >> > + "
> > >> >> > >> is
> > >> >> > >> > >> > illegal, contains a character other than ASCII
> > >> alphanumerics,
> > >> >> > '.',
> > >> >> > >> '_'
> > >> >> > >> > >> and
> > >> >> > >> > >> > '-'")
> > >> >> > >> > >> >       case None => throw new
> InvalidTopicException("topic
> > >> name
> > >> >> "
> > >> >> > +
> > >> >> > >> > topic
> > >> >> > >> > >> +
> > >> >> > >> > >> > " is illegal,  contains a character other than ASCII
> > >> >> > alphanumerics,
> > >> >> > >> > '.',
> > >> >> > >> > >> > '_' and '-'")
> > >> >> > >> > >> >     }
> > >> >> > >> > >> >   }
> > >> >> > >> > >> > }
> > >> >> > >> > >> >
> > >> >> > >> > >> > On Fri, Jul 10, 2015 at 2:50 PM, Todd Palino <
> > >> >> tpalino@gmail.com>
> > >> >> > >> > wrote:
> > >> >> > >> > >> >
> > >> >> > >> > >> > > I had to go look this one up again to make sure -
> > >> >> > >> > >> > > https://issues.apache.org/jira/browse/KAFKA-495
> > >> >> > >> > >> > >
> > >> >> > >> > >> > > The only valid character names for topics are
> > >> alphanumeric,
> > >> >> > >> > underscore,
> > >> >> > >> > >> > and
> > >> >> > >> > >> > > dash. A period is not supposed to be a valid
> character
> > to
> > >> >> use.
> > >> >> > If
> > >> >> > >> > >> you're
> > >> >> > >> > >> > > seeing them, then one of two things have happened:
> > >> >> > >> > >> > >
> > >> >> > >> > >> > > 1) You have topic names that are grandfathered in
> from
> > >> before
> > >> >> > that
> > >> >> > >> > >> patch
> > >> >> > >> > >> > > 2) The patch is not working properly and there is
> > >> somewhere
> > >> >> in
> > >> >> > the
> > >> >> > >> > >> broker
> > >> >> > >> > >> > > that the standard is not being enforced.
> > >> >> > >> > >> > >
> > >> >> > >> > >> > > -Todd
> > >> >> > >> > >> > >
> > >> >> > >> > >> > >
> > >> >> > >> > >> > > On Fri, Jul 10, 2015 at 12:13 PM, Brock Noland <
> > >> >> > brock@apache.org>
> > >> >> > >> > >> wrote:
> > >> >> > >> > >> > >
> > >> >> > >> > >> > > > On Fri, Jul 10, 2015 at 11:34 AM, Gwen Shapira <
> > >> >> > >> > >> gshapira@cloudera.com>
> > >> >> > >> > >> > > > wrote:
> > >> >> > >> > >> > > > > Hi Kafka Fans,
> > >> >> > >> > >> > > > >
> > >> >> > >> > >> > > > > If you have one topic named "kafka_lab_2" and the
> > >> other
> > >> >> > named
> > >> >> > >> > >> > > > > "kafka.lab.2", the topic level metrics will be
> > named
> > >> >> > >> kafka_lab_2
> > >> >> > >> > >> for
> > >> >> > >> > >> > > > > both, effectively making it impossible to monitor
> > them
> > >> >> > >> properly.
> > >> >> > >> > >> > > > >
> > >> >> > >> > >> > > > > The reason this happens is that using "." in
> topic
> > >> names
> > >> >> is
> > >> >> > >> > pretty
> > >> >> > >> > >> > > > > common, especially as a way to group topics into
> > data
> > >> >> > centers,
> > >> >> > >> > >> > > > > relevant apps, etc - basically a work-around to
> our
> > >> >> current
> > >> >> > >> > lack of
> > >> >> > >> > >> > > > > name spaces. However, most metric monitoring
> > systems
> > >> >> using
> > >> >> > "."
> > >> >> > >> > to
> > >> >> > >> > >> > > > > annotate hierarchy, so to avoid issues around
> > metric
> > >> >> names,
> > >> >> > >> > Kafka
> > >> >> > >> > >> > > > > replaces the "." in the name with an underscore.
> > >> >> > >> > >> > > > >
> > >> >> > >> > >> > > > > This generates good metric names, but creates the
> > >> problem
> > >> >> > with
> > >> >> > >> > name
> > >> >> > >> > >> > > > collisions.
> > >> >> > >> > >> > > > >
> > >> >> > >> > >> > > > > I'm wondering if it makes sense to simply limit
> the
> > >> range
> > >> >> > of
> > >> >> > >> > >> > > > > characters permitted in a topic name and disallow
> > "_"?
> > >> >> > >> Obviously
> > >> >> > >> > >> > > > > existing topics will need to remain as is, which
> > is a
> > >> bit
> > >> >> > >> > awkward.
> > >> >> > >> > >> > > >
> > >> >> > >> > >> > > > Interesting problem! Many if not most users I
> > >> personally am
> > >> >> > >> aware
> > >> >> > >> > of
> > >> >> > >> > >> > > > use "_" as a separator in topic names. I am sure
> that
> > >> many
> > >> >> > users
> > >> >> > >> > >> would
> > >> >> > >> > >> > > > be quite surprised by this limitation. With that
> > said,
> > >> I am
> > >> >> > sure
> > >> >> > >> > >> > > > they'd transition accordingly.
> > >> >> > >> > >> > > >
> > >> >> > >> > >> > > > >
> > >> >> > >> > >> > > > > If anyone has better backward-compatible
> solutions
> > to
> > >> >> this,
> > >> >> > >> I'm
> > >> >> > >> > all
> > >> >> > >> > >> > > ears
> > >> >> > >> > >> > > > :)
> > >> >> > >> > >> > > > >
> > >> >> > >> > >> > > > > Gwen
> > >> >> > >> > >> > > >
> > >> >> > >> > >> > >
> > >> >> > >> > >> >
> > >> >> > >> > >> >
> > >> >> > >> > >> >
> > >> >> > >> > >> > --
> > >> >> > >> > >> > Grant Henke
> > >> >> > >> > >> > Solutions Consultant | Cloudera
> > >> >> > >> > >> > ghenke@cloudera.com | twitter.com/gchenke |
> > >> >> > >> > linkedin.com/in/granthenke
> > >> >> > >> > >> >
> > >> >> > >> > >>
> > >> >> > >> > >
> > >> >> > >> > >
> > >> >> > >> > >
> > >> >> > >> > > --
> > >> >> > >> > > Grant Henke
> > >> >> > >> > > Solutions Consultant | Cloudera
> > >> >> > >> > > ghenke@cloudera.com | twitter.com/gchenke |
> > >> >> > linkedin.com/in/granthenke
> > >> >> > >> >
> > >> >> > >>
> > >> >> >
> > >> >>
> > >> >>
> > >> >>
> > >> >> --
> > >> >> Thanks,
> > >> >> Neha
> > >> >>
> > >>
> > >
> > >
> > >
> > > --
> > > Thanks,
> > > Ewen
> >
>
>
>
> --
> Thanks,
> Ewen
>



-- 
-- Guozhang

Re: [Discussion] Limitations on topic names

Posted by Ewen Cheslack-Postava <ew...@confluent.io>.
On Fri, Jul 10, 2015 at 4:41 PM, Gwen Shapira <gs...@cloudera.com> wrote:

> Yeah, I have an actual customer who ran into this. Unfortunately,
> inconsistencies in the way things are named are pretty common - just
> look at Kafka's many CLI options.
>
> I don't think that supporting both and pointing at the docs with "I
> told you so" when our metrics break is a good solution.
>

I agree, especially since we don't *already* have something in the docs
indicating this will be an issue. I was flippant about the situation
because I *wish* there was more careful consideration + naming policy in
place, but I realize that doesn't always happen in practice. I guess I need
to take Compatibility Czar more seriously :)

I see think the obvious practical options are as follows:

1. Kill support for "_". Piss off the entire set of people who currently
use "_" anywhere in topic names.
2. Kill support for ".". Piss off the entire set of people who currently
use "." anywhere in topic names.
3. Tell people they need to be careful about this issue. Piss off the set
of people who use both "_" and "." *and* happen to have conflicting topic
names. They will have some pain when they discover the issue and have to
figure out how to move one of those topics over to a non-conflicting name.
I'm going to claim that this group must be an *extremely* small fraction of
users, which doesn't make it better to allow things to break for them, but
at least gives us an idea of the scale of impact.

(One other alternative suggested earlier was encoding metric names to
account for differences; given the metric renaming mess in the last
release, I'm extremely hesitant to suggest anything of the sort...)

None of the options are ideal, but to me, 3 seems like the least painful.
Both for us, and for the vast majority of users. It seems to me that the
number of users that would complain about (1) or (2) drastically outweigh
(3).

At this point, I don't think it's practical to keep switching the rules
about which characters are allowed and which aren't because the previous
attempts haven't been successful -- it seems the rules have changed
multiple times, whether intentionally or accidentally, such that any more
changes will cause problems. At this point, I think we just need to accept
being liberal in accepting the range of topic names that have been
permitted so far and make the best of the situation, even if it means only
being able to warn people of conflicts.

Here's another alternative: how about being liberal with topic name
characters, but upon topic creation we convert the name to the metric name
and fail if there's a conflict with another topic? This is relatively
expensive (requires getting the metric name of all other topics), but it
avoids the bad situation we're encountering here (conflicting metrics),
avoids getting into a persistent conflict (we kill topic creation when we
detect the issue rather than noticing it when the metrics conflict
happens), and keeps the vast majority of existing users happy (both _ and .
work in topic names as long as you don't create topics with conflicting
metric names).

There are definitely details to be worked out (auto topic creation?), but
it seems like a more realistic solution than to start disallowing _ or . in
topic names.

-Ewen


>
> On Fri, Jul 10, 2015 at 4:33 PM, Ewen Cheslack-Postava
> <ew...@confluent.io> wrote:
> > I figure you'll probably see complaints no matter what change you make.
> > Gwen, given that you raised this, another important question might be how
> > many people you see using *both*. I'm guessing this question came up
> > because you actually saw a conflict? But I'd imagine (or at least hope)
> > that most organizations are mostly consistent about naming topics -- they
> > standardize on one or the other.
> >
> > Since there's no "right" way to name them, I'd just leave it supporting
> > both and document the potential conflict in metrics. And if people use
> both
> > naming schemes, they probably deserve to suffer for their inconsistency
> :)
> >
> > -Ewen
> >
> > On Fri, Jul 10, 2015 at 3:28 PM, Gwen Shapira <gs...@cloudera.com>
> wrote:
> >
> >> I find dots more common in my customer base, so I will definitely feel
> >> the pain of removing them.
> >>
> >> However, "." are already used in metrics, file names, directories, etc
> >> - so if we keep the dots, we need to keep code that translates them
> >> and document the translation. Just banning "." seems more natural.
> >> Also, as Grant mentioned, we'll probably have our own special usage
> >> for "." down the line.
> >>
> >> On Fri, Jul 10, 2015 at 2:12 PM, Todd Palino <tp...@gmail.com> wrote:
> >> > I absolutely disagree with #2, Neha. That will break a lot of
> >> > infrastructure within LinkedIn. That said, removing "." might break
> other
> >> > people as well, but I think we should have a clearer idea of how much
> >> usage
> >> > there is on either side.
> >> >
> >> > -Todd
> >> >
> >> >
> >> > On Fri, Jul 10, 2015 at 2:08 PM, Neha Narkhede <ne...@confluent.io>
> >> wrote:
> >> >
> >> >> "." seems natural for grouping topic names. +1 for 2) going forward
> only
> >> >> without breaking previously created topics with "_" though that might
> >> >> require us to patch the code somewhat awkwardly till we phase it out
> a
> >> >> couple (purposely left vague to stay out of Ewen's wrath :-))
> versions
> >> >> later.
> >> >>
> >> >> On Fri, Jul 10, 2015 at 2:02 PM, Gwen Shapira <gshapira@cloudera.com
> >
> >> >> wrote:
> >> >>
> >> >> > I don't think we should break existing topics. Just disallow new
> >> >> > topics going forward.
> >> >> >
> >> >> > Agree that having both is horrible, but we should have a solution
> that
> >> >> > fails when you run "kafka_topics.sh --create", not when you
> configure
> >> >> > Ganglia.
> >> >> >
> >> >> > Gwen
> >> >> >
> >> >> > On Fri, Jul 10, 2015 at 1:53 PM, Jay Kreps <ja...@confluent.io>
> wrote:
> >> >> > > Unfortunately '.' is pretty common too. I agree that it is
> perverse,
> >> >> but
> >> >> > > people seem to do it. Breaking all the topics with '.' in the
> name
> >> >> seems
> >> >> > > like it could be worse than combining metrics for people who
> have a
> >> >> > > 'foo_bar' AND 'foo.bar' (and after all, having both is DEEPLY
> >> perverse,
> >> >> > > no?).
> >> >> > >
> >> >> > > Where is our Dean of Compatibility, Ewen, on this?
> >> >> > >
> >> >> > > -Jay
> >> >> > >
> >> >> > > On Fri, Jul 10, 2015 at 1:32 PM, Todd Palino <tp...@gmail.com>
> >> >> wrote:
> >> >> > >
> >> >> > >> My selfish point of view is that we do #1, as we use "_"
> >> extensively
> >> >> in
> >> >> > >> topic names here :) I also happen to think it's the right
> choice,
> >> >> > >> specifically because "." has more special meanings, as you
> noted.
> >> >> > >>
> >> >> > >> -Todd
> >> >> > >>
> >> >> > >>
> >> >> > >> On Fri, Jul 10, 2015 at 1:30 PM, Gwen Shapira <
> >> gshapira@cloudera.com>
> >> >> > >> wrote:
> >> >> > >>
> >> >> > >> > Unintentional side effect from allowing IP addresses in
> consumer
> >> >> > client
> >> >> > >> > IDs :)
> >> >> > >> >
> >> >> > >> > So the question is, what do we do now?
> >> >> > >> >
> >> >> > >> > 1) disallow "."
> >> >> > >> > 2) disallow "_"
> >> >> > >> > 3) find a reversible way to encode "." and "_" that won't
> break
> >> >> > existing
> >> >> > >> > metrics
> >> >> > >> > 4) all of the above?
> >> >> > >> >
> >> >> > >> > btw. it looks like "." and ".." are currently valid. Topic
> names
> >> are
> >> >> > >> > used for directories, right? this sounds like fun :)
> >> >> > >> >
> >> >> > >> > I vote for option #1, although if someone has a good idea for
> #3
> >> it
> >> >> > >> > will be even better.
> >> >> > >> >
> >> >> > >> > Gwen
> >> >> > >> >
> >> >> > >> >
> >> >> > >> >
> >> >> > >> > On Fri, Jul 10, 2015 at 1:22 PM, Grant Henke <
> >> ghenke@cloudera.com>
> >> >> > >> wrote:
> >> >> > >> > > Found it was added here:
> >> >> > >> https://issues.apache.org/jira/browse/KAFKA-697
> >> >> > >> > >
> >> >> > >> > > On Fri, Jul 10, 2015 at 3:18 PM, Todd Palino <
> >> tpalino@gmail.com>
> >> >> > >> wrote:
> >> >> > >> > >
> >> >> > >> > >> This was definitely changed at some point after KAFKA-495.
> The
> >> >> > >> question
> >> >> > >> > is
> >> >> > >> > >> when and why.
> >> >> > >> > >>
> >> >> > >> > >> Here's the relevant code from that patch:
> >> >> > >> > >>
> >> >> > >> > >>
> >> >> ===================================================================
> >> >> > >> > >> --- core/src/main/scala/kafka/utils/Topic.scala (revision
> >> >> 1390178)
> >> >> > >> > >> +++ core/src/main/scala/kafka/utils/Topic.scala (working
> copy)
> >> >> > >> > >> @@ -21,24 +21,21 @@
> >> >> > >> > >>  import util.matching.Regex
> >> >> > >> > >>
> >> >> > >> > >>  object Topic {
> >> >> > >> > >> +  val legalChars = "[a-zA-Z0-9_-]"
> >> >> > >> > >>
> >> >> > >> > >>
> >> >> > >> > >>
> >> >> > >> > >> -Todd
> >> >> > >> > >>
> >> >> > >> > >>
> >> >> > >> > >> On Fri, Jul 10, 2015 at 1:02 PM, Grant Henke <
> >> >> ghenke@cloudera.com>
> >> >> > >> > wrote:
> >> >> > >> > >>
> >> >> > >> > >> > kafka.common.Topic shows that currently period is a valid
> >> >> > character
> >> >> > >> > and I
> >> >> > >> > >> > have verified I can use kafka-topics.sh to create a new
> >> topic
> >> >> > with a
> >> >> > >> > >> > period.
> >> >> > >> > >> >
> >> >> > >> > >> >
> >> >> > >> > >> > AdminUtils.createOrUpdateTopicPartitionAssignmentPathInZK
> >> >> > currently
> >> >> > >> > uses
> >> >> > >> > >> > Topic.validate before writing to Zookeeper.
> >> >> > >> > >> >
> >> >> > >> > >> > Should period character support be removed? I was under
> the
> >> >> same
> >> >> > >> > >> impression
> >> >> > >> > >> > as Gwen, that a period was used by many as a way to
> "group"
> >> >> > topics.
> >> >> > >> > >> >
> >> >> > >> > >> > The code is pasted below since its small:
> >> >> > >> > >> >
> >> >> > >> > >> > object Topic {
> >> >> > >> > >> >   val legalChars = "[a-zA-Z0-9\\._\\-]"
> >> >> > >> > >> >   private val maxNameLength = 255
> >> >> > >> > >> >   private val rgx = new Regex(legalChars + "+")
> >> >> > >> > >> >
> >> >> > >> > >> >   val InternalTopics =
> Set(OffsetManager.OffsetsTopicName)
> >> >> > >> > >> >
> >> >> > >> > >> >   def validate(topic: String) {
> >> >> > >> > >> >     if (topic.length <= 0)
> >> >> > >> > >> >       throw new InvalidTopicException("topic name is
> >> illegal,
> >> >> > can't
> >> >> > >> be
> >> >> > >> > >> > empty")
> >> >> > >> > >> >     else if (topic.equals(".") || topic.equals(".."))
> >> >> > >> > >> >       throw new InvalidTopicException("topic name cannot
> be
> >> >> > \".\" or
> >> >> > >> > >> > \"..\"")
> >> >> > >> > >> >     else if (topic.length > maxNameLength)
> >> >> > >> > >> >       throw new InvalidTopicException("topic name is
> >> illegal,
> >> >> > can't
> >> >> > >> be
> >> >> > >> > >> > longer than " + maxNameLength + " characters")
> >> >> > >> > >> >
> >> >> > >> > >> >     rgx.findFirstIn(topic) match {
> >> >> > >> > >> >       case Some(t) =>
> >> >> > >> > >> >         if (!t.equals(topic))
> >> >> > >> > >> >           throw new InvalidTopicException("topic name " +
> >> topic
> >> >> > + "
> >> >> > >> is
> >> >> > >> > >> > illegal, contains a character other than ASCII
> >> alphanumerics,
> >> >> > '.',
> >> >> > >> '_'
> >> >> > >> > >> and
> >> >> > >> > >> > '-'")
> >> >> > >> > >> >       case None => throw new InvalidTopicException("topic
> >> name
> >> >> "
> >> >> > +
> >> >> > >> > topic
> >> >> > >> > >> +
> >> >> > >> > >> > " is illegal,  contains a character other than ASCII
> >> >> > alphanumerics,
> >> >> > >> > '.',
> >> >> > >> > >> > '_' and '-'")
> >> >> > >> > >> >     }
> >> >> > >> > >> >   }
> >> >> > >> > >> > }
> >> >> > >> > >> >
> >> >> > >> > >> > On Fri, Jul 10, 2015 at 2:50 PM, Todd Palino <
> >> >> tpalino@gmail.com>
> >> >> > >> > wrote:
> >> >> > >> > >> >
> >> >> > >> > >> > > I had to go look this one up again to make sure -
> >> >> > >> > >> > > https://issues.apache.org/jira/browse/KAFKA-495
> >> >> > >> > >> > >
> >> >> > >> > >> > > The only valid character names for topics are
> >> alphanumeric,
> >> >> > >> > underscore,
> >> >> > >> > >> > and
> >> >> > >> > >> > > dash. A period is not supposed to be a valid character
> to
> >> >> use.
> >> >> > If
> >> >> > >> > >> you're
> >> >> > >> > >> > > seeing them, then one of two things have happened:
> >> >> > >> > >> > >
> >> >> > >> > >> > > 1) You have topic names that are grandfathered in from
> >> before
> >> >> > that
> >> >> > >> > >> patch
> >> >> > >> > >> > > 2) The patch is not working properly and there is
> >> somewhere
> >> >> in
> >> >> > the
> >> >> > >> > >> broker
> >> >> > >> > >> > > that the standard is not being enforced.
> >> >> > >> > >> > >
> >> >> > >> > >> > > -Todd
> >> >> > >> > >> > >
> >> >> > >> > >> > >
> >> >> > >> > >> > > On Fri, Jul 10, 2015 at 12:13 PM, Brock Noland <
> >> >> > brock@apache.org>
> >> >> > >> > >> wrote:
> >> >> > >> > >> > >
> >> >> > >> > >> > > > On Fri, Jul 10, 2015 at 11:34 AM, Gwen Shapira <
> >> >> > >> > >> gshapira@cloudera.com>
> >> >> > >> > >> > > > wrote:
> >> >> > >> > >> > > > > Hi Kafka Fans,
> >> >> > >> > >> > > > >
> >> >> > >> > >> > > > > If you have one topic named "kafka_lab_2" and the
> >> other
> >> >> > named
> >> >> > >> > >> > > > > "kafka.lab.2", the topic level metrics will be
> named
> >> >> > >> kafka_lab_2
> >> >> > >> > >> for
> >> >> > >> > >> > > > > both, effectively making it impossible to monitor
> them
> >> >> > >> properly.
> >> >> > >> > >> > > > >
> >> >> > >> > >> > > > > The reason this happens is that using "." in topic
> >> names
> >> >> is
> >> >> > >> > pretty
> >> >> > >> > >> > > > > common, especially as a way to group topics into
> data
> >> >> > centers,
> >> >> > >> > >> > > > > relevant apps, etc - basically a work-around to our
> >> >> current
> >> >> > >> > lack of
> >> >> > >> > >> > > > > name spaces. However, most metric monitoring
> systems
> >> >> using
> >> >> > "."
> >> >> > >> > to
> >> >> > >> > >> > > > > annotate hierarchy, so to avoid issues around
> metric
> >> >> names,
> >> >> > >> > Kafka
> >> >> > >> > >> > > > > replaces the "." in the name with an underscore.
> >> >> > >> > >> > > > >
> >> >> > >> > >> > > > > This generates good metric names, but creates the
> >> problem
> >> >> > with
> >> >> > >> > name
> >> >> > >> > >> > > > collisions.
> >> >> > >> > >> > > > >
> >> >> > >> > >> > > > > I'm wondering if it makes sense to simply limit the
> >> range
> >> >> > of
> >> >> > >> > >> > > > > characters permitted in a topic name and disallow
> "_"?
> >> >> > >> Obviously
> >> >> > >> > >> > > > > existing topics will need to remain as is, which
> is a
> >> bit
> >> >> > >> > awkward.
> >> >> > >> > >> > > >
> >> >> > >> > >> > > > Interesting problem! Many if not most users I
> >> personally am
> >> >> > >> aware
> >> >> > >> > of
> >> >> > >> > >> > > > use "_" as a separator in topic names. I am sure that
> >> many
> >> >> > users
> >> >> > >> > >> would
> >> >> > >> > >> > > > be quite surprised by this limitation. With that
> said,
> >> I am
> >> >> > sure
> >> >> > >> > >> > > > they'd transition accordingly.
> >> >> > >> > >> > > >
> >> >> > >> > >> > > > >
> >> >> > >> > >> > > > > If anyone has better backward-compatible solutions
> to
> >> >> this,
> >> >> > >> I'm
> >> >> > >> > all
> >> >> > >> > >> > > ears
> >> >> > >> > >> > > > :)
> >> >> > >> > >> > > > >
> >> >> > >> > >> > > > > Gwen
> >> >> > >> > >> > > >
> >> >> > >> > >> > >
> >> >> > >> > >> >
> >> >> > >> > >> >
> >> >> > >> > >> >
> >> >> > >> > >> > --
> >> >> > >> > >> > Grant Henke
> >> >> > >> > >> > Solutions Consultant | Cloudera
> >> >> > >> > >> > ghenke@cloudera.com | twitter.com/gchenke |
> >> >> > >> > linkedin.com/in/granthenke
> >> >> > >> > >> >
> >> >> > >> > >>
> >> >> > >> > >
> >> >> > >> > >
> >> >> > >> > >
> >> >> > >> > > --
> >> >> > >> > > Grant Henke
> >> >> > >> > > Solutions Consultant | Cloudera
> >> >> > >> > > ghenke@cloudera.com | twitter.com/gchenke |
> >> >> > linkedin.com/in/granthenke
> >> >> > >> >
> >> >> > >>
> >> >> >
> >> >>
> >> >>
> >> >>
> >> >> --
> >> >> Thanks,
> >> >> Neha
> >> >>
> >>
> >
> >
> >
> > --
> > Thanks,
> > Ewen
>



-- 
Thanks,
Ewen

Re: [Discussion] Limitations on topic names

Posted by Gwen Shapira <gs...@cloudera.com>.
Yeah, I have an actual customer who ran into this. Unfortunately,
inconsistencies in the way things are named are pretty common - just
look at Kafka's many CLI options.

I don't think that supporting both and pointing at the docs with "I
told you so" when our metrics break is a good solution.

On Fri, Jul 10, 2015 at 4:33 PM, Ewen Cheslack-Postava
<ew...@confluent.io> wrote:
> I figure you'll probably see complaints no matter what change you make.
> Gwen, given that you raised this, another important question might be how
> many people you see using *both*. I'm guessing this question came up
> because you actually saw a conflict? But I'd imagine (or at least hope)
> that most organizations are mostly consistent about naming topics -- they
> standardize on one or the other.
>
> Since there's no "right" way to name them, I'd just leave it supporting
> both and document the potential conflict in metrics. And if people use both
> naming schemes, they probably deserve to suffer for their inconsistency :)
>
> -Ewen
>
> On Fri, Jul 10, 2015 at 3:28 PM, Gwen Shapira <gs...@cloudera.com> wrote:
>
>> I find dots more common in my customer base, so I will definitely feel
>> the pain of removing them.
>>
>> However, "." are already used in metrics, file names, directories, etc
>> - so if we keep the dots, we need to keep code that translates them
>> and document the translation. Just banning "." seems more natural.
>> Also, as Grant mentioned, we'll probably have our own special usage
>> for "." down the line.
>>
>> On Fri, Jul 10, 2015 at 2:12 PM, Todd Palino <tp...@gmail.com> wrote:
>> > I absolutely disagree with #2, Neha. That will break a lot of
>> > infrastructure within LinkedIn. That said, removing "." might break other
>> > people as well, but I think we should have a clearer idea of how much
>> usage
>> > there is on either side.
>> >
>> > -Todd
>> >
>> >
>> > On Fri, Jul 10, 2015 at 2:08 PM, Neha Narkhede <ne...@confluent.io>
>> wrote:
>> >
>> >> "." seems natural for grouping topic names. +1 for 2) going forward only
>> >> without breaking previously created topics with "_" though that might
>> >> require us to patch the code somewhat awkwardly till we phase it out a
>> >> couple (purposely left vague to stay out of Ewen's wrath :-)) versions
>> >> later.
>> >>
>> >> On Fri, Jul 10, 2015 at 2:02 PM, Gwen Shapira <gs...@cloudera.com>
>> >> wrote:
>> >>
>> >> > I don't think we should break existing topics. Just disallow new
>> >> > topics going forward.
>> >> >
>> >> > Agree that having both is horrible, but we should have a solution that
>> >> > fails when you run "kafka_topics.sh --create", not when you configure
>> >> > Ganglia.
>> >> >
>> >> > Gwen
>> >> >
>> >> > On Fri, Jul 10, 2015 at 1:53 PM, Jay Kreps <ja...@confluent.io> wrote:
>> >> > > Unfortunately '.' is pretty common too. I agree that it is perverse,
>> >> but
>> >> > > people seem to do it. Breaking all the topics with '.' in the name
>> >> seems
>> >> > > like it could be worse than combining metrics for people who have a
>> >> > > 'foo_bar' AND 'foo.bar' (and after all, having both is DEEPLY
>> perverse,
>> >> > > no?).
>> >> > >
>> >> > > Where is our Dean of Compatibility, Ewen, on this?
>> >> > >
>> >> > > -Jay
>> >> > >
>> >> > > On Fri, Jul 10, 2015 at 1:32 PM, Todd Palino <tp...@gmail.com>
>> >> wrote:
>> >> > >
>> >> > >> My selfish point of view is that we do #1, as we use "_"
>> extensively
>> >> in
>> >> > >> topic names here :) I also happen to think it's the right choice,
>> >> > >> specifically because "." has more special meanings, as you noted.
>> >> > >>
>> >> > >> -Todd
>> >> > >>
>> >> > >>
>> >> > >> On Fri, Jul 10, 2015 at 1:30 PM, Gwen Shapira <
>> gshapira@cloudera.com>
>> >> > >> wrote:
>> >> > >>
>> >> > >> > Unintentional side effect from allowing IP addresses in consumer
>> >> > client
>> >> > >> > IDs :)
>> >> > >> >
>> >> > >> > So the question is, what do we do now?
>> >> > >> >
>> >> > >> > 1) disallow "."
>> >> > >> > 2) disallow "_"
>> >> > >> > 3) find a reversible way to encode "." and "_" that won't break
>> >> > existing
>> >> > >> > metrics
>> >> > >> > 4) all of the above?
>> >> > >> >
>> >> > >> > btw. it looks like "." and ".." are currently valid. Topic names
>> are
>> >> > >> > used for directories, right? this sounds like fun :)
>> >> > >> >
>> >> > >> > I vote for option #1, although if someone has a good idea for #3
>> it
>> >> > >> > will be even better.
>> >> > >> >
>> >> > >> > Gwen
>> >> > >> >
>> >> > >> >
>> >> > >> >
>> >> > >> > On Fri, Jul 10, 2015 at 1:22 PM, Grant Henke <
>> ghenke@cloudera.com>
>> >> > >> wrote:
>> >> > >> > > Found it was added here:
>> >> > >> https://issues.apache.org/jira/browse/KAFKA-697
>> >> > >> > >
>> >> > >> > > On Fri, Jul 10, 2015 at 3:18 PM, Todd Palino <
>> tpalino@gmail.com>
>> >> > >> wrote:
>> >> > >> > >
>> >> > >> > >> This was definitely changed at some point after KAFKA-495. The
>> >> > >> question
>> >> > >> > is
>> >> > >> > >> when and why.
>> >> > >> > >>
>> >> > >> > >> Here's the relevant code from that patch:
>> >> > >> > >>
>> >> > >> > >>
>> >> ===================================================================
>> >> > >> > >> --- core/src/main/scala/kafka/utils/Topic.scala (revision
>> >> 1390178)
>> >> > >> > >> +++ core/src/main/scala/kafka/utils/Topic.scala (working copy)
>> >> > >> > >> @@ -21,24 +21,21 @@
>> >> > >> > >>  import util.matching.Regex
>> >> > >> > >>
>> >> > >> > >>  object Topic {
>> >> > >> > >> +  val legalChars = "[a-zA-Z0-9_-]"
>> >> > >> > >>
>> >> > >> > >>
>> >> > >> > >>
>> >> > >> > >> -Todd
>> >> > >> > >>
>> >> > >> > >>
>> >> > >> > >> On Fri, Jul 10, 2015 at 1:02 PM, Grant Henke <
>> >> ghenke@cloudera.com>
>> >> > >> > wrote:
>> >> > >> > >>
>> >> > >> > >> > kafka.common.Topic shows that currently period is a valid
>> >> > character
>> >> > >> > and I
>> >> > >> > >> > have verified I can use kafka-topics.sh to create a new
>> topic
>> >> > with a
>> >> > >> > >> > period.
>> >> > >> > >> >
>> >> > >> > >> >
>> >> > >> > >> > AdminUtils.createOrUpdateTopicPartitionAssignmentPathInZK
>> >> > currently
>> >> > >> > uses
>> >> > >> > >> > Topic.validate before writing to Zookeeper.
>> >> > >> > >> >
>> >> > >> > >> > Should period character support be removed? I was under the
>> >> same
>> >> > >> > >> impression
>> >> > >> > >> > as Gwen, that a period was used by many as a way to "group"
>> >> > topics.
>> >> > >> > >> >
>> >> > >> > >> > The code is pasted below since its small:
>> >> > >> > >> >
>> >> > >> > >> > object Topic {
>> >> > >> > >> >   val legalChars = "[a-zA-Z0-9\\._\\-]"
>> >> > >> > >> >   private val maxNameLength = 255
>> >> > >> > >> >   private val rgx = new Regex(legalChars + "+")
>> >> > >> > >> >
>> >> > >> > >> >   val InternalTopics = Set(OffsetManager.OffsetsTopicName)
>> >> > >> > >> >
>> >> > >> > >> >   def validate(topic: String) {
>> >> > >> > >> >     if (topic.length <= 0)
>> >> > >> > >> >       throw new InvalidTopicException("topic name is
>> illegal,
>> >> > can't
>> >> > >> be
>> >> > >> > >> > empty")
>> >> > >> > >> >     else if (topic.equals(".") || topic.equals(".."))
>> >> > >> > >> >       throw new InvalidTopicException("topic name cannot be
>> >> > \".\" or
>> >> > >> > >> > \"..\"")
>> >> > >> > >> >     else if (topic.length > maxNameLength)
>> >> > >> > >> >       throw new InvalidTopicException("topic name is
>> illegal,
>> >> > can't
>> >> > >> be
>> >> > >> > >> > longer than " + maxNameLength + " characters")
>> >> > >> > >> >
>> >> > >> > >> >     rgx.findFirstIn(topic) match {
>> >> > >> > >> >       case Some(t) =>
>> >> > >> > >> >         if (!t.equals(topic))
>> >> > >> > >> >           throw new InvalidTopicException("topic name " +
>> topic
>> >> > + "
>> >> > >> is
>> >> > >> > >> > illegal, contains a character other than ASCII
>> alphanumerics,
>> >> > '.',
>> >> > >> '_'
>> >> > >> > >> and
>> >> > >> > >> > '-'")
>> >> > >> > >> >       case None => throw new InvalidTopicException("topic
>> name
>> >> "
>> >> > +
>> >> > >> > topic
>> >> > >> > >> +
>> >> > >> > >> > " is illegal,  contains a character other than ASCII
>> >> > alphanumerics,
>> >> > >> > '.',
>> >> > >> > >> > '_' and '-'")
>> >> > >> > >> >     }
>> >> > >> > >> >   }
>> >> > >> > >> > }
>> >> > >> > >> >
>> >> > >> > >> > On Fri, Jul 10, 2015 at 2:50 PM, Todd Palino <
>> >> tpalino@gmail.com>
>> >> > >> > wrote:
>> >> > >> > >> >
>> >> > >> > >> > > I had to go look this one up again to make sure -
>> >> > >> > >> > > https://issues.apache.org/jira/browse/KAFKA-495
>> >> > >> > >> > >
>> >> > >> > >> > > The only valid character names for topics are
>> alphanumeric,
>> >> > >> > underscore,
>> >> > >> > >> > and
>> >> > >> > >> > > dash. A period is not supposed to be a valid character to
>> >> use.
>> >> > If
>> >> > >> > >> you're
>> >> > >> > >> > > seeing them, then one of two things have happened:
>> >> > >> > >> > >
>> >> > >> > >> > > 1) You have topic names that are grandfathered in from
>> before
>> >> > that
>> >> > >> > >> patch
>> >> > >> > >> > > 2) The patch is not working properly and there is
>> somewhere
>> >> in
>> >> > the
>> >> > >> > >> broker
>> >> > >> > >> > > that the standard is not being enforced.
>> >> > >> > >> > >
>> >> > >> > >> > > -Todd
>> >> > >> > >> > >
>> >> > >> > >> > >
>> >> > >> > >> > > On Fri, Jul 10, 2015 at 12:13 PM, Brock Noland <
>> >> > brock@apache.org>
>> >> > >> > >> wrote:
>> >> > >> > >> > >
>> >> > >> > >> > > > On Fri, Jul 10, 2015 at 11:34 AM, Gwen Shapira <
>> >> > >> > >> gshapira@cloudera.com>
>> >> > >> > >> > > > wrote:
>> >> > >> > >> > > > > Hi Kafka Fans,
>> >> > >> > >> > > > >
>> >> > >> > >> > > > > If you have one topic named "kafka_lab_2" and the
>> other
>> >> > named
>> >> > >> > >> > > > > "kafka.lab.2", the topic level metrics will be named
>> >> > >> kafka_lab_2
>> >> > >> > >> for
>> >> > >> > >> > > > > both, effectively making it impossible to monitor them
>> >> > >> properly.
>> >> > >> > >> > > > >
>> >> > >> > >> > > > > The reason this happens is that using "." in topic
>> names
>> >> is
>> >> > >> > pretty
>> >> > >> > >> > > > > common, especially as a way to group topics into data
>> >> > centers,
>> >> > >> > >> > > > > relevant apps, etc - basically a work-around to our
>> >> current
>> >> > >> > lack of
>> >> > >> > >> > > > > name spaces. However, most metric monitoring systems
>> >> using
>> >> > "."
>> >> > >> > to
>> >> > >> > >> > > > > annotate hierarchy, so to avoid issues around metric
>> >> names,
>> >> > >> > Kafka
>> >> > >> > >> > > > > replaces the "." in the name with an underscore.
>> >> > >> > >> > > > >
>> >> > >> > >> > > > > This generates good metric names, but creates the
>> problem
>> >> > with
>> >> > >> > name
>> >> > >> > >> > > > collisions.
>> >> > >> > >> > > > >
>> >> > >> > >> > > > > I'm wondering if it makes sense to simply limit the
>> range
>> >> > of
>> >> > >> > >> > > > > characters permitted in a topic name and disallow "_"?
>> >> > >> Obviously
>> >> > >> > >> > > > > existing topics will need to remain as is, which is a
>> bit
>> >> > >> > awkward.
>> >> > >> > >> > > >
>> >> > >> > >> > > > Interesting problem! Many if not most users I
>> personally am
>> >> > >> aware
>> >> > >> > of
>> >> > >> > >> > > > use "_" as a separator in topic names. I am sure that
>> many
>> >> > users
>> >> > >> > >> would
>> >> > >> > >> > > > be quite surprised by this limitation. With that said,
>> I am
>> >> > sure
>> >> > >> > >> > > > they'd transition accordingly.
>> >> > >> > >> > > >
>> >> > >> > >> > > > >
>> >> > >> > >> > > > > If anyone has better backward-compatible solutions to
>> >> this,
>> >> > >> I'm
>> >> > >> > all
>> >> > >> > >> > > ears
>> >> > >> > >> > > > :)
>> >> > >> > >> > > > >
>> >> > >> > >> > > > > Gwen
>> >> > >> > >> > > >
>> >> > >> > >> > >
>> >> > >> > >> >
>> >> > >> > >> >
>> >> > >> > >> >
>> >> > >> > >> > --
>> >> > >> > >> > Grant Henke
>> >> > >> > >> > Solutions Consultant | Cloudera
>> >> > >> > >> > ghenke@cloudera.com | twitter.com/gchenke |
>> >> > >> > linkedin.com/in/granthenke
>> >> > >> > >> >
>> >> > >> > >>
>> >> > >> > >
>> >> > >> > >
>> >> > >> > >
>> >> > >> > > --
>> >> > >> > > Grant Henke
>> >> > >> > > Solutions Consultant | Cloudera
>> >> > >> > > ghenke@cloudera.com | twitter.com/gchenke |
>> >> > linkedin.com/in/granthenke
>> >> > >> >
>> >> > >>
>> >> >
>> >>
>> >>
>> >>
>> >> --
>> >> Thanks,
>> >> Neha
>> >>
>>
>
>
>
> --
> Thanks,
> Ewen

Re: [Discussion] Limitations on topic names

Posted by Ewen Cheslack-Postava <ew...@confluent.io>.
I figure you'll probably see complaints no matter what change you make.
Gwen, given that you raised this, another important question might be how
many people you see using *both*. I'm guessing this question came up
because you actually saw a conflict? But I'd imagine (or at least hope)
that most organizations are mostly consistent about naming topics -- they
standardize on one or the other.

Since there's no "right" way to name them, I'd just leave it supporting
both and document the potential conflict in metrics. And if people use both
naming schemes, they probably deserve to suffer for their inconsistency :)

-Ewen

On Fri, Jul 10, 2015 at 3:28 PM, Gwen Shapira <gs...@cloudera.com> wrote:

> I find dots more common in my customer base, so I will definitely feel
> the pain of removing them.
>
> However, "." are already used in metrics, file names, directories, etc
> - so if we keep the dots, we need to keep code that translates them
> and document the translation. Just banning "." seems more natural.
> Also, as Grant mentioned, we'll probably have our own special usage
> for "." down the line.
>
> On Fri, Jul 10, 2015 at 2:12 PM, Todd Palino <tp...@gmail.com> wrote:
> > I absolutely disagree with #2, Neha. That will break a lot of
> > infrastructure within LinkedIn. That said, removing "." might break other
> > people as well, but I think we should have a clearer idea of how much
> usage
> > there is on either side.
> >
> > -Todd
> >
> >
> > On Fri, Jul 10, 2015 at 2:08 PM, Neha Narkhede <ne...@confluent.io>
> wrote:
> >
> >> "." seems natural for grouping topic names. +1 for 2) going forward only
> >> without breaking previously created topics with "_" though that might
> >> require us to patch the code somewhat awkwardly till we phase it out a
> >> couple (purposely left vague to stay out of Ewen's wrath :-)) versions
> >> later.
> >>
> >> On Fri, Jul 10, 2015 at 2:02 PM, Gwen Shapira <gs...@cloudera.com>
> >> wrote:
> >>
> >> > I don't think we should break existing topics. Just disallow new
> >> > topics going forward.
> >> >
> >> > Agree that having both is horrible, but we should have a solution that
> >> > fails when you run "kafka_topics.sh --create", not when you configure
> >> > Ganglia.
> >> >
> >> > Gwen
> >> >
> >> > On Fri, Jul 10, 2015 at 1:53 PM, Jay Kreps <ja...@confluent.io> wrote:
> >> > > Unfortunately '.' is pretty common too. I agree that it is perverse,
> >> but
> >> > > people seem to do it. Breaking all the topics with '.' in the name
> >> seems
> >> > > like it could be worse than combining metrics for people who have a
> >> > > 'foo_bar' AND 'foo.bar' (and after all, having both is DEEPLY
> perverse,
> >> > > no?).
> >> > >
> >> > > Where is our Dean of Compatibility, Ewen, on this?
> >> > >
> >> > > -Jay
> >> > >
> >> > > On Fri, Jul 10, 2015 at 1:32 PM, Todd Palino <tp...@gmail.com>
> >> wrote:
> >> > >
> >> > >> My selfish point of view is that we do #1, as we use "_"
> extensively
> >> in
> >> > >> topic names here :) I also happen to think it's the right choice,
> >> > >> specifically because "." has more special meanings, as you noted.
> >> > >>
> >> > >> -Todd
> >> > >>
> >> > >>
> >> > >> On Fri, Jul 10, 2015 at 1:30 PM, Gwen Shapira <
> gshapira@cloudera.com>
> >> > >> wrote:
> >> > >>
> >> > >> > Unintentional side effect from allowing IP addresses in consumer
> >> > client
> >> > >> > IDs :)
> >> > >> >
> >> > >> > So the question is, what do we do now?
> >> > >> >
> >> > >> > 1) disallow "."
> >> > >> > 2) disallow "_"
> >> > >> > 3) find a reversible way to encode "." and "_" that won't break
> >> > existing
> >> > >> > metrics
> >> > >> > 4) all of the above?
> >> > >> >
> >> > >> > btw. it looks like "." and ".." are currently valid. Topic names
> are
> >> > >> > used for directories, right? this sounds like fun :)
> >> > >> >
> >> > >> > I vote for option #1, although if someone has a good idea for #3
> it
> >> > >> > will be even better.
> >> > >> >
> >> > >> > Gwen
> >> > >> >
> >> > >> >
> >> > >> >
> >> > >> > On Fri, Jul 10, 2015 at 1:22 PM, Grant Henke <
> ghenke@cloudera.com>
> >> > >> wrote:
> >> > >> > > Found it was added here:
> >> > >> https://issues.apache.org/jira/browse/KAFKA-697
> >> > >> > >
> >> > >> > > On Fri, Jul 10, 2015 at 3:18 PM, Todd Palino <
> tpalino@gmail.com>
> >> > >> wrote:
> >> > >> > >
> >> > >> > >> This was definitely changed at some point after KAFKA-495. The
> >> > >> question
> >> > >> > is
> >> > >> > >> when and why.
> >> > >> > >>
> >> > >> > >> Here's the relevant code from that patch:
> >> > >> > >>
> >> > >> > >>
> >> ===================================================================
> >> > >> > >> --- core/src/main/scala/kafka/utils/Topic.scala (revision
> >> 1390178)
> >> > >> > >> +++ core/src/main/scala/kafka/utils/Topic.scala (working copy)
> >> > >> > >> @@ -21,24 +21,21 @@
> >> > >> > >>  import util.matching.Regex
> >> > >> > >>
> >> > >> > >>  object Topic {
> >> > >> > >> +  val legalChars = "[a-zA-Z0-9_-]"
> >> > >> > >>
> >> > >> > >>
> >> > >> > >>
> >> > >> > >> -Todd
> >> > >> > >>
> >> > >> > >>
> >> > >> > >> On Fri, Jul 10, 2015 at 1:02 PM, Grant Henke <
> >> ghenke@cloudera.com>
> >> > >> > wrote:
> >> > >> > >>
> >> > >> > >> > kafka.common.Topic shows that currently period is a valid
> >> > character
> >> > >> > and I
> >> > >> > >> > have verified I can use kafka-topics.sh to create a new
> topic
> >> > with a
> >> > >> > >> > period.
> >> > >> > >> >
> >> > >> > >> >
> >> > >> > >> > AdminUtils.createOrUpdateTopicPartitionAssignmentPathInZK
> >> > currently
> >> > >> > uses
> >> > >> > >> > Topic.validate before writing to Zookeeper.
> >> > >> > >> >
> >> > >> > >> > Should period character support be removed? I was under the
> >> same
> >> > >> > >> impression
> >> > >> > >> > as Gwen, that a period was used by many as a way to "group"
> >> > topics.
> >> > >> > >> >
> >> > >> > >> > The code is pasted below since its small:
> >> > >> > >> >
> >> > >> > >> > object Topic {
> >> > >> > >> >   val legalChars = "[a-zA-Z0-9\\._\\-]"
> >> > >> > >> >   private val maxNameLength = 255
> >> > >> > >> >   private val rgx = new Regex(legalChars + "+")
> >> > >> > >> >
> >> > >> > >> >   val InternalTopics = Set(OffsetManager.OffsetsTopicName)
> >> > >> > >> >
> >> > >> > >> >   def validate(topic: String) {
> >> > >> > >> >     if (topic.length <= 0)
> >> > >> > >> >       throw new InvalidTopicException("topic name is
> illegal,
> >> > can't
> >> > >> be
> >> > >> > >> > empty")
> >> > >> > >> >     else if (topic.equals(".") || topic.equals(".."))
> >> > >> > >> >       throw new InvalidTopicException("topic name cannot be
> >> > \".\" or
> >> > >> > >> > \"..\"")
> >> > >> > >> >     else if (topic.length > maxNameLength)
> >> > >> > >> >       throw new InvalidTopicException("topic name is
> illegal,
> >> > can't
> >> > >> be
> >> > >> > >> > longer than " + maxNameLength + " characters")
> >> > >> > >> >
> >> > >> > >> >     rgx.findFirstIn(topic) match {
> >> > >> > >> >       case Some(t) =>
> >> > >> > >> >         if (!t.equals(topic))
> >> > >> > >> >           throw new InvalidTopicException("topic name " +
> topic
> >> > + "
> >> > >> is
> >> > >> > >> > illegal, contains a character other than ASCII
> alphanumerics,
> >> > '.',
> >> > >> '_'
> >> > >> > >> and
> >> > >> > >> > '-'")
> >> > >> > >> >       case None => throw new InvalidTopicException("topic
> name
> >> "
> >> > +
> >> > >> > topic
> >> > >> > >> +
> >> > >> > >> > " is illegal,  contains a character other than ASCII
> >> > alphanumerics,
> >> > >> > '.',
> >> > >> > >> > '_' and '-'")
> >> > >> > >> >     }
> >> > >> > >> >   }
> >> > >> > >> > }
> >> > >> > >> >
> >> > >> > >> > On Fri, Jul 10, 2015 at 2:50 PM, Todd Palino <
> >> tpalino@gmail.com>
> >> > >> > wrote:
> >> > >> > >> >
> >> > >> > >> > > I had to go look this one up again to make sure -
> >> > >> > >> > > https://issues.apache.org/jira/browse/KAFKA-495
> >> > >> > >> > >
> >> > >> > >> > > The only valid character names for topics are
> alphanumeric,
> >> > >> > underscore,
> >> > >> > >> > and
> >> > >> > >> > > dash. A period is not supposed to be a valid character to
> >> use.
> >> > If
> >> > >> > >> you're
> >> > >> > >> > > seeing them, then one of two things have happened:
> >> > >> > >> > >
> >> > >> > >> > > 1) You have topic names that are grandfathered in from
> before
> >> > that
> >> > >> > >> patch
> >> > >> > >> > > 2) The patch is not working properly and there is
> somewhere
> >> in
> >> > the
> >> > >> > >> broker
> >> > >> > >> > > that the standard is not being enforced.
> >> > >> > >> > >
> >> > >> > >> > > -Todd
> >> > >> > >> > >
> >> > >> > >> > >
> >> > >> > >> > > On Fri, Jul 10, 2015 at 12:13 PM, Brock Noland <
> >> > brock@apache.org>
> >> > >> > >> wrote:
> >> > >> > >> > >
> >> > >> > >> > > > On Fri, Jul 10, 2015 at 11:34 AM, Gwen Shapira <
> >> > >> > >> gshapira@cloudera.com>
> >> > >> > >> > > > wrote:
> >> > >> > >> > > > > Hi Kafka Fans,
> >> > >> > >> > > > >
> >> > >> > >> > > > > If you have one topic named "kafka_lab_2" and the
> other
> >> > named
> >> > >> > >> > > > > "kafka.lab.2", the topic level metrics will be named
> >> > >> kafka_lab_2
> >> > >> > >> for
> >> > >> > >> > > > > both, effectively making it impossible to monitor them
> >> > >> properly.
> >> > >> > >> > > > >
> >> > >> > >> > > > > The reason this happens is that using "." in topic
> names
> >> is
> >> > >> > pretty
> >> > >> > >> > > > > common, especially as a way to group topics into data
> >> > centers,
> >> > >> > >> > > > > relevant apps, etc - basically a work-around to our
> >> current
> >> > >> > lack of
> >> > >> > >> > > > > name spaces. However, most metric monitoring systems
> >> using
> >> > "."
> >> > >> > to
> >> > >> > >> > > > > annotate hierarchy, so to avoid issues around metric
> >> names,
> >> > >> > Kafka
> >> > >> > >> > > > > replaces the "." in the name with an underscore.
> >> > >> > >> > > > >
> >> > >> > >> > > > > This generates good metric names, but creates the
> problem
> >> > with
> >> > >> > name
> >> > >> > >> > > > collisions.
> >> > >> > >> > > > >
> >> > >> > >> > > > > I'm wondering if it makes sense to simply limit the
> range
> >> > of
> >> > >> > >> > > > > characters permitted in a topic name and disallow "_"?
> >> > >> Obviously
> >> > >> > >> > > > > existing topics will need to remain as is, which is a
> bit
> >> > >> > awkward.
> >> > >> > >> > > >
> >> > >> > >> > > > Interesting problem! Many if not most users I
> personally am
> >> > >> aware
> >> > >> > of
> >> > >> > >> > > > use "_" as a separator in topic names. I am sure that
> many
> >> > users
> >> > >> > >> would
> >> > >> > >> > > > be quite surprised by this limitation. With that said,
> I am
> >> > sure
> >> > >> > >> > > > they'd transition accordingly.
> >> > >> > >> > > >
> >> > >> > >> > > > >
> >> > >> > >> > > > > If anyone has better backward-compatible solutions to
> >> this,
> >> > >> I'm
> >> > >> > all
> >> > >> > >> > > ears
> >> > >> > >> > > > :)
> >> > >> > >> > > > >
> >> > >> > >> > > > > Gwen
> >> > >> > >> > > >
> >> > >> > >> > >
> >> > >> > >> >
> >> > >> > >> >
> >> > >> > >> >
> >> > >> > >> > --
> >> > >> > >> > Grant Henke
> >> > >> > >> > Solutions Consultant | Cloudera
> >> > >> > >> > ghenke@cloudera.com | twitter.com/gchenke |
> >> > >> > linkedin.com/in/granthenke
> >> > >> > >> >
> >> > >> > >>
> >> > >> > >
> >> > >> > >
> >> > >> > >
> >> > >> > > --
> >> > >> > > Grant Henke
> >> > >> > > Solutions Consultant | Cloudera
> >> > >> > > ghenke@cloudera.com | twitter.com/gchenke |
> >> > linkedin.com/in/granthenke
> >> > >> >
> >> > >>
> >> >
> >>
> >>
> >>
> >> --
> >> Thanks,
> >> Neha
> >>
>



-- 
Thanks,
Ewen

Re: [Discussion] Limitations on topic names

Posted by Gwen Shapira <gs...@cloudera.com>.
I find dots more common in my customer base, so I will definitely feel
the pain of removing them.

However, "." are already used in metrics, file names, directories, etc
- so if we keep the dots, we need to keep code that translates them
and document the translation. Just banning "." seems more natural.
Also, as Grant mentioned, we'll probably have our own special usage
for "." down the line.

On Fri, Jul 10, 2015 at 2:12 PM, Todd Palino <tp...@gmail.com> wrote:
> I absolutely disagree with #2, Neha. That will break a lot of
> infrastructure within LinkedIn. That said, removing "." might break other
> people as well, but I think we should have a clearer idea of how much usage
> there is on either side.
>
> -Todd
>
>
> On Fri, Jul 10, 2015 at 2:08 PM, Neha Narkhede <ne...@confluent.io> wrote:
>
>> "." seems natural for grouping topic names. +1 for 2) going forward only
>> without breaking previously created topics with "_" though that might
>> require us to patch the code somewhat awkwardly till we phase it out a
>> couple (purposely left vague to stay out of Ewen's wrath :-)) versions
>> later.
>>
>> On Fri, Jul 10, 2015 at 2:02 PM, Gwen Shapira <gs...@cloudera.com>
>> wrote:
>>
>> > I don't think we should break existing topics. Just disallow new
>> > topics going forward.
>> >
>> > Agree that having both is horrible, but we should have a solution that
>> > fails when you run "kafka_topics.sh --create", not when you configure
>> > Ganglia.
>> >
>> > Gwen
>> >
>> > On Fri, Jul 10, 2015 at 1:53 PM, Jay Kreps <ja...@confluent.io> wrote:
>> > > Unfortunately '.' is pretty common too. I agree that it is perverse,
>> but
>> > > people seem to do it. Breaking all the topics with '.' in the name
>> seems
>> > > like it could be worse than combining metrics for people who have a
>> > > 'foo_bar' AND 'foo.bar' (and after all, having both is DEEPLY perverse,
>> > > no?).
>> > >
>> > > Where is our Dean of Compatibility, Ewen, on this?
>> > >
>> > > -Jay
>> > >
>> > > On Fri, Jul 10, 2015 at 1:32 PM, Todd Palino <tp...@gmail.com>
>> wrote:
>> > >
>> > >> My selfish point of view is that we do #1, as we use "_" extensively
>> in
>> > >> topic names here :) I also happen to think it's the right choice,
>> > >> specifically because "." has more special meanings, as you noted.
>> > >>
>> > >> -Todd
>> > >>
>> > >>
>> > >> On Fri, Jul 10, 2015 at 1:30 PM, Gwen Shapira <gs...@cloudera.com>
>> > >> wrote:
>> > >>
>> > >> > Unintentional side effect from allowing IP addresses in consumer
>> > client
>> > >> > IDs :)
>> > >> >
>> > >> > So the question is, what do we do now?
>> > >> >
>> > >> > 1) disallow "."
>> > >> > 2) disallow "_"
>> > >> > 3) find a reversible way to encode "." and "_" that won't break
>> > existing
>> > >> > metrics
>> > >> > 4) all of the above?
>> > >> >
>> > >> > btw. it looks like "." and ".." are currently valid. Topic names are
>> > >> > used for directories, right? this sounds like fun :)
>> > >> >
>> > >> > I vote for option #1, although if someone has a good idea for #3 it
>> > >> > will be even better.
>> > >> >
>> > >> > Gwen
>> > >> >
>> > >> >
>> > >> >
>> > >> > On Fri, Jul 10, 2015 at 1:22 PM, Grant Henke <gh...@cloudera.com>
>> > >> wrote:
>> > >> > > Found it was added here:
>> > >> https://issues.apache.org/jira/browse/KAFKA-697
>> > >> > >
>> > >> > > On Fri, Jul 10, 2015 at 3:18 PM, Todd Palino <tp...@gmail.com>
>> > >> wrote:
>> > >> > >
>> > >> > >> This was definitely changed at some point after KAFKA-495. The
>> > >> question
>> > >> > is
>> > >> > >> when and why.
>> > >> > >>
>> > >> > >> Here's the relevant code from that patch:
>> > >> > >>
>> > >> > >>
>> ===================================================================
>> > >> > >> --- core/src/main/scala/kafka/utils/Topic.scala (revision
>> 1390178)
>> > >> > >> +++ core/src/main/scala/kafka/utils/Topic.scala (working copy)
>> > >> > >> @@ -21,24 +21,21 @@
>> > >> > >>  import util.matching.Regex
>> > >> > >>
>> > >> > >>  object Topic {
>> > >> > >> +  val legalChars = "[a-zA-Z0-9_-]"
>> > >> > >>
>> > >> > >>
>> > >> > >>
>> > >> > >> -Todd
>> > >> > >>
>> > >> > >>
>> > >> > >> On Fri, Jul 10, 2015 at 1:02 PM, Grant Henke <
>> ghenke@cloudera.com>
>> > >> > wrote:
>> > >> > >>
>> > >> > >> > kafka.common.Topic shows that currently period is a valid
>> > character
>> > >> > and I
>> > >> > >> > have verified I can use kafka-topics.sh to create a new topic
>> > with a
>> > >> > >> > period.
>> > >> > >> >
>> > >> > >> >
>> > >> > >> > AdminUtils.createOrUpdateTopicPartitionAssignmentPathInZK
>> > currently
>> > >> > uses
>> > >> > >> > Topic.validate before writing to Zookeeper.
>> > >> > >> >
>> > >> > >> > Should period character support be removed? I was under the
>> same
>> > >> > >> impression
>> > >> > >> > as Gwen, that a period was used by many as a way to "group"
>> > topics.
>> > >> > >> >
>> > >> > >> > The code is pasted below since its small:
>> > >> > >> >
>> > >> > >> > object Topic {
>> > >> > >> >   val legalChars = "[a-zA-Z0-9\\._\\-]"
>> > >> > >> >   private val maxNameLength = 255
>> > >> > >> >   private val rgx = new Regex(legalChars + "+")
>> > >> > >> >
>> > >> > >> >   val InternalTopics = Set(OffsetManager.OffsetsTopicName)
>> > >> > >> >
>> > >> > >> >   def validate(topic: String) {
>> > >> > >> >     if (topic.length <= 0)
>> > >> > >> >       throw new InvalidTopicException("topic name is illegal,
>> > can't
>> > >> be
>> > >> > >> > empty")
>> > >> > >> >     else if (topic.equals(".") || topic.equals(".."))
>> > >> > >> >       throw new InvalidTopicException("topic name cannot be
>> > \".\" or
>> > >> > >> > \"..\"")
>> > >> > >> >     else if (topic.length > maxNameLength)
>> > >> > >> >       throw new InvalidTopicException("topic name is illegal,
>> > can't
>> > >> be
>> > >> > >> > longer than " + maxNameLength + " characters")
>> > >> > >> >
>> > >> > >> >     rgx.findFirstIn(topic) match {
>> > >> > >> >       case Some(t) =>
>> > >> > >> >         if (!t.equals(topic))
>> > >> > >> >           throw new InvalidTopicException("topic name " + topic
>> > + "
>> > >> is
>> > >> > >> > illegal, contains a character other than ASCII alphanumerics,
>> > '.',
>> > >> '_'
>> > >> > >> and
>> > >> > >> > '-'")
>> > >> > >> >       case None => throw new InvalidTopicException("topic name
>> "
>> > +
>> > >> > topic
>> > >> > >> +
>> > >> > >> > " is illegal,  contains a character other than ASCII
>> > alphanumerics,
>> > >> > '.',
>> > >> > >> > '_' and '-'")
>> > >> > >> >     }
>> > >> > >> >   }
>> > >> > >> > }
>> > >> > >> >
>> > >> > >> > On Fri, Jul 10, 2015 at 2:50 PM, Todd Palino <
>> tpalino@gmail.com>
>> > >> > wrote:
>> > >> > >> >
>> > >> > >> > > I had to go look this one up again to make sure -
>> > >> > >> > > https://issues.apache.org/jira/browse/KAFKA-495
>> > >> > >> > >
>> > >> > >> > > The only valid character names for topics are alphanumeric,
>> > >> > underscore,
>> > >> > >> > and
>> > >> > >> > > dash. A period is not supposed to be a valid character to
>> use.
>> > If
>> > >> > >> you're
>> > >> > >> > > seeing them, then one of two things have happened:
>> > >> > >> > >
>> > >> > >> > > 1) You have topic names that are grandfathered in from before
>> > that
>> > >> > >> patch
>> > >> > >> > > 2) The patch is not working properly and there is somewhere
>> in
>> > the
>> > >> > >> broker
>> > >> > >> > > that the standard is not being enforced.
>> > >> > >> > >
>> > >> > >> > > -Todd
>> > >> > >> > >
>> > >> > >> > >
>> > >> > >> > > On Fri, Jul 10, 2015 at 12:13 PM, Brock Noland <
>> > brock@apache.org>
>> > >> > >> wrote:
>> > >> > >> > >
>> > >> > >> > > > On Fri, Jul 10, 2015 at 11:34 AM, Gwen Shapira <
>> > >> > >> gshapira@cloudera.com>
>> > >> > >> > > > wrote:
>> > >> > >> > > > > Hi Kafka Fans,
>> > >> > >> > > > >
>> > >> > >> > > > > If you have one topic named "kafka_lab_2" and the other
>> > named
>> > >> > >> > > > > "kafka.lab.2", the topic level metrics will be named
>> > >> kafka_lab_2
>> > >> > >> for
>> > >> > >> > > > > both, effectively making it impossible to monitor them
>> > >> properly.
>> > >> > >> > > > >
>> > >> > >> > > > > The reason this happens is that using "." in topic names
>> is
>> > >> > pretty
>> > >> > >> > > > > common, especially as a way to group topics into data
>> > centers,
>> > >> > >> > > > > relevant apps, etc - basically a work-around to our
>> current
>> > >> > lack of
>> > >> > >> > > > > name spaces. However, most metric monitoring systems
>> using
>> > "."
>> > >> > to
>> > >> > >> > > > > annotate hierarchy, so to avoid issues around metric
>> names,
>> > >> > Kafka
>> > >> > >> > > > > replaces the "." in the name with an underscore.
>> > >> > >> > > > >
>> > >> > >> > > > > This generates good metric names, but creates the problem
>> > with
>> > >> > name
>> > >> > >> > > > collisions.
>> > >> > >> > > > >
>> > >> > >> > > > > I'm wondering if it makes sense to simply limit the range
>> > of
>> > >> > >> > > > > characters permitted in a topic name and disallow "_"?
>> > >> Obviously
>> > >> > >> > > > > existing topics will need to remain as is, which is a bit
>> > >> > awkward.
>> > >> > >> > > >
>> > >> > >> > > > Interesting problem! Many if not most users I personally am
>> > >> aware
>> > >> > of
>> > >> > >> > > > use "_" as a separator in topic names. I am sure that many
>> > users
>> > >> > >> would
>> > >> > >> > > > be quite surprised by this limitation. With that said, I am
>> > sure
>> > >> > >> > > > they'd transition accordingly.
>> > >> > >> > > >
>> > >> > >> > > > >
>> > >> > >> > > > > If anyone has better backward-compatible solutions to
>> this,
>> > >> I'm
>> > >> > all
>> > >> > >> > > ears
>> > >> > >> > > > :)
>> > >> > >> > > > >
>> > >> > >> > > > > Gwen
>> > >> > >> > > >
>> > >> > >> > >
>> > >> > >> >
>> > >> > >> >
>> > >> > >> >
>> > >> > >> > --
>> > >> > >> > Grant Henke
>> > >> > >> > Solutions Consultant | Cloudera
>> > >> > >> > ghenke@cloudera.com | twitter.com/gchenke |
>> > >> > linkedin.com/in/granthenke
>> > >> > >> >
>> > >> > >>
>> > >> > >
>> > >> > >
>> > >> > >
>> > >> > > --
>> > >> > > Grant Henke
>> > >> > > Solutions Consultant | Cloudera
>> > >> > > ghenke@cloudera.com | twitter.com/gchenke |
>> > linkedin.com/in/granthenke
>> > >> >
>> > >>
>> >
>>
>>
>>
>> --
>> Thanks,
>> Neha
>>

Re: [Discussion] Limitations on topic names

Posted by Todd Palino <tp...@gmail.com>.
I absolutely disagree with #2, Neha. That will break a lot of
infrastructure within LinkedIn. That said, removing "." might break other
people as well, but I think we should have a clearer idea of how much usage
there is on either side.

-Todd


On Fri, Jul 10, 2015 at 2:08 PM, Neha Narkhede <ne...@confluent.io> wrote:

> "." seems natural for grouping topic names. +1 for 2) going forward only
> without breaking previously created topics with "_" though that might
> require us to patch the code somewhat awkwardly till we phase it out a
> couple (purposely left vague to stay out of Ewen's wrath :-)) versions
> later.
>
> On Fri, Jul 10, 2015 at 2:02 PM, Gwen Shapira <gs...@cloudera.com>
> wrote:
>
> > I don't think we should break existing topics. Just disallow new
> > topics going forward.
> >
> > Agree that having both is horrible, but we should have a solution that
> > fails when you run "kafka_topics.sh --create", not when you configure
> > Ganglia.
> >
> > Gwen
> >
> > On Fri, Jul 10, 2015 at 1:53 PM, Jay Kreps <ja...@confluent.io> wrote:
> > > Unfortunately '.' is pretty common too. I agree that it is perverse,
> but
> > > people seem to do it. Breaking all the topics with '.' in the name
> seems
> > > like it could be worse than combining metrics for people who have a
> > > 'foo_bar' AND 'foo.bar' (and after all, having both is DEEPLY perverse,
> > > no?).
> > >
> > > Where is our Dean of Compatibility, Ewen, on this?
> > >
> > > -Jay
> > >
> > > On Fri, Jul 10, 2015 at 1:32 PM, Todd Palino <tp...@gmail.com>
> wrote:
> > >
> > >> My selfish point of view is that we do #1, as we use "_" extensively
> in
> > >> topic names here :) I also happen to think it's the right choice,
> > >> specifically because "." has more special meanings, as you noted.
> > >>
> > >> -Todd
> > >>
> > >>
> > >> On Fri, Jul 10, 2015 at 1:30 PM, Gwen Shapira <gs...@cloudera.com>
> > >> wrote:
> > >>
> > >> > Unintentional side effect from allowing IP addresses in consumer
> > client
> > >> > IDs :)
> > >> >
> > >> > So the question is, what do we do now?
> > >> >
> > >> > 1) disallow "."
> > >> > 2) disallow "_"
> > >> > 3) find a reversible way to encode "." and "_" that won't break
> > existing
> > >> > metrics
> > >> > 4) all of the above?
> > >> >
> > >> > btw. it looks like "." and ".." are currently valid. Topic names are
> > >> > used for directories, right? this sounds like fun :)
> > >> >
> > >> > I vote for option #1, although if someone has a good idea for #3 it
> > >> > will be even better.
> > >> >
> > >> > Gwen
> > >> >
> > >> >
> > >> >
> > >> > On Fri, Jul 10, 2015 at 1:22 PM, Grant Henke <gh...@cloudera.com>
> > >> wrote:
> > >> > > Found it was added here:
> > >> https://issues.apache.org/jira/browse/KAFKA-697
> > >> > >
> > >> > > On Fri, Jul 10, 2015 at 3:18 PM, Todd Palino <tp...@gmail.com>
> > >> wrote:
> > >> > >
> > >> > >> This was definitely changed at some point after KAFKA-495. The
> > >> question
> > >> > is
> > >> > >> when and why.
> > >> > >>
> > >> > >> Here's the relevant code from that patch:
> > >> > >>
> > >> > >>
> ===================================================================
> > >> > >> --- core/src/main/scala/kafka/utils/Topic.scala (revision
> 1390178)
> > >> > >> +++ core/src/main/scala/kafka/utils/Topic.scala (working copy)
> > >> > >> @@ -21,24 +21,21 @@
> > >> > >>  import util.matching.Regex
> > >> > >>
> > >> > >>  object Topic {
> > >> > >> +  val legalChars = "[a-zA-Z0-9_-]"
> > >> > >>
> > >> > >>
> > >> > >>
> > >> > >> -Todd
> > >> > >>
> > >> > >>
> > >> > >> On Fri, Jul 10, 2015 at 1:02 PM, Grant Henke <
> ghenke@cloudera.com>
> > >> > wrote:
> > >> > >>
> > >> > >> > kafka.common.Topic shows that currently period is a valid
> > character
> > >> > and I
> > >> > >> > have verified I can use kafka-topics.sh to create a new topic
> > with a
> > >> > >> > period.
> > >> > >> >
> > >> > >> >
> > >> > >> > AdminUtils.createOrUpdateTopicPartitionAssignmentPathInZK
> > currently
> > >> > uses
> > >> > >> > Topic.validate before writing to Zookeeper.
> > >> > >> >
> > >> > >> > Should period character support be removed? I was under the
> same
> > >> > >> impression
> > >> > >> > as Gwen, that a period was used by many as a way to "group"
> > topics.
> > >> > >> >
> > >> > >> > The code is pasted below since its small:
> > >> > >> >
> > >> > >> > object Topic {
> > >> > >> >   val legalChars = "[a-zA-Z0-9\\._\\-]"
> > >> > >> >   private val maxNameLength = 255
> > >> > >> >   private val rgx = new Regex(legalChars + "+")
> > >> > >> >
> > >> > >> >   val InternalTopics = Set(OffsetManager.OffsetsTopicName)
> > >> > >> >
> > >> > >> >   def validate(topic: String) {
> > >> > >> >     if (topic.length <= 0)
> > >> > >> >       throw new InvalidTopicException("topic name is illegal,
> > can't
> > >> be
> > >> > >> > empty")
> > >> > >> >     else if (topic.equals(".") || topic.equals(".."))
> > >> > >> >       throw new InvalidTopicException("topic name cannot be
> > \".\" or
> > >> > >> > \"..\"")
> > >> > >> >     else if (topic.length > maxNameLength)
> > >> > >> >       throw new InvalidTopicException("topic name is illegal,
> > can't
> > >> be
> > >> > >> > longer than " + maxNameLength + " characters")
> > >> > >> >
> > >> > >> >     rgx.findFirstIn(topic) match {
> > >> > >> >       case Some(t) =>
> > >> > >> >         if (!t.equals(topic))
> > >> > >> >           throw new InvalidTopicException("topic name " + topic
> > + "
> > >> is
> > >> > >> > illegal, contains a character other than ASCII alphanumerics,
> > '.',
> > >> '_'
> > >> > >> and
> > >> > >> > '-'")
> > >> > >> >       case None => throw new InvalidTopicException("topic name
> "
> > +
> > >> > topic
> > >> > >> +
> > >> > >> > " is illegal,  contains a character other than ASCII
> > alphanumerics,
> > >> > '.',
> > >> > >> > '_' and '-'")
> > >> > >> >     }
> > >> > >> >   }
> > >> > >> > }
> > >> > >> >
> > >> > >> > On Fri, Jul 10, 2015 at 2:50 PM, Todd Palino <
> tpalino@gmail.com>
> > >> > wrote:
> > >> > >> >
> > >> > >> > > I had to go look this one up again to make sure -
> > >> > >> > > https://issues.apache.org/jira/browse/KAFKA-495
> > >> > >> > >
> > >> > >> > > The only valid character names for topics are alphanumeric,
> > >> > underscore,
> > >> > >> > and
> > >> > >> > > dash. A period is not supposed to be a valid character to
> use.
> > If
> > >> > >> you're
> > >> > >> > > seeing them, then one of two things have happened:
> > >> > >> > >
> > >> > >> > > 1) You have topic names that are grandfathered in from before
> > that
> > >> > >> patch
> > >> > >> > > 2) The patch is not working properly and there is somewhere
> in
> > the
> > >> > >> broker
> > >> > >> > > that the standard is not being enforced.
> > >> > >> > >
> > >> > >> > > -Todd
> > >> > >> > >
> > >> > >> > >
> > >> > >> > > On Fri, Jul 10, 2015 at 12:13 PM, Brock Noland <
> > brock@apache.org>
> > >> > >> wrote:
> > >> > >> > >
> > >> > >> > > > On Fri, Jul 10, 2015 at 11:34 AM, Gwen Shapira <
> > >> > >> gshapira@cloudera.com>
> > >> > >> > > > wrote:
> > >> > >> > > > > Hi Kafka Fans,
> > >> > >> > > > >
> > >> > >> > > > > If you have one topic named "kafka_lab_2" and the other
> > named
> > >> > >> > > > > "kafka.lab.2", the topic level metrics will be named
> > >> kafka_lab_2
> > >> > >> for
> > >> > >> > > > > both, effectively making it impossible to monitor them
> > >> properly.
> > >> > >> > > > >
> > >> > >> > > > > The reason this happens is that using "." in topic names
> is
> > >> > pretty
> > >> > >> > > > > common, especially as a way to group topics into data
> > centers,
> > >> > >> > > > > relevant apps, etc - basically a work-around to our
> current
> > >> > lack of
> > >> > >> > > > > name spaces. However, most metric monitoring systems
> using
> > "."
> > >> > to
> > >> > >> > > > > annotate hierarchy, so to avoid issues around metric
> names,
> > >> > Kafka
> > >> > >> > > > > replaces the "." in the name with an underscore.
> > >> > >> > > > >
> > >> > >> > > > > This generates good metric names, but creates the problem
> > with
> > >> > name
> > >> > >> > > > collisions.
> > >> > >> > > > >
> > >> > >> > > > > I'm wondering if it makes sense to simply limit the range
> > of
> > >> > >> > > > > characters permitted in a topic name and disallow "_"?
> > >> Obviously
> > >> > >> > > > > existing topics will need to remain as is, which is a bit
> > >> > awkward.
> > >> > >> > > >
> > >> > >> > > > Interesting problem! Many if not most users I personally am
> > >> aware
> > >> > of
> > >> > >> > > > use "_" as a separator in topic names. I am sure that many
> > users
> > >> > >> would
> > >> > >> > > > be quite surprised by this limitation. With that said, I am
> > sure
> > >> > >> > > > they'd transition accordingly.
> > >> > >> > > >
> > >> > >> > > > >
> > >> > >> > > > > If anyone has better backward-compatible solutions to
> this,
> > >> I'm
> > >> > all
> > >> > >> > > ears
> > >> > >> > > > :)
> > >> > >> > > > >
> > >> > >> > > > > Gwen
> > >> > >> > > >
> > >> > >> > >
> > >> > >> >
> > >> > >> >
> > >> > >> >
> > >> > >> > --
> > >> > >> > Grant Henke
> > >> > >> > Solutions Consultant | Cloudera
> > >> > >> > ghenke@cloudera.com | twitter.com/gchenke |
> > >> > linkedin.com/in/granthenke
> > >> > >> >
> > >> > >>
> > >> > >
> > >> > >
> > >> > >
> > >> > > --
> > >> > > Grant Henke
> > >> > > Solutions Consultant | Cloudera
> > >> > > ghenke@cloudera.com | twitter.com/gchenke |
> > linkedin.com/in/granthenke
> > >> >
> > >>
> >
>
>
>
> --
> Thanks,
> Neha
>

Re: [Discussion] Limitations on topic names

Posted by Ashish Singh <as...@cloudera.com>.
The problem with '.' seems only to be in case of metrics. Should kafka
replace '.' with some special character, not in [a-zA-Z0-9\\._\\-] or some
reserved seq of characters?

On Fri, Jul 10, 2015 at 2:08 PM, Neha Narkhede <ne...@confluent.io> wrote:

> "." seems natural for grouping topic names. +1 for 2) going forward only
> without breaking previously created topics with "_" though that might
> require us to patch the code somewhat awkwardly till we phase it out a
> couple (purposely left vague to stay out of Ewen's wrath :-)) versions
> later.
>
> On Fri, Jul 10, 2015 at 2:02 PM, Gwen Shapira <gs...@cloudera.com>
> wrote:
>
> > I don't think we should break existing topics. Just disallow new
> > topics going forward.
> >
> > Agree that having both is horrible, but we should have a solution that
> > fails when you run "kafka_topics.sh --create", not when you configure
> > Ganglia.
> >
> > Gwen
> >
> > On Fri, Jul 10, 2015 at 1:53 PM, Jay Kreps <ja...@confluent.io> wrote:
> > > Unfortunately '.' is pretty common too. I agree that it is perverse,
> but
> > > people seem to do it. Breaking all the topics with '.' in the name
> seems
> > > like it could be worse than combining metrics for people who have a
> > > 'foo_bar' AND 'foo.bar' (and after all, having both is DEEPLY perverse,
> > > no?).
> > >
> > > Where is our Dean of Compatibility, Ewen, on this?
> > >
> > > -Jay
> > >
> > > On Fri, Jul 10, 2015 at 1:32 PM, Todd Palino <tp...@gmail.com>
> wrote:
> > >
> > >> My selfish point of view is that we do #1, as we use "_" extensively
> in
> > >> topic names here :) I also happen to think it's the right choice,
> > >> specifically because "." has more special meanings, as you noted.
> > >>
> > >> -Todd
> > >>
> > >>
> > >> On Fri, Jul 10, 2015 at 1:30 PM, Gwen Shapira <gs...@cloudera.com>
> > >> wrote:
> > >>
> > >> > Unintentional side effect from allowing IP addresses in consumer
> > client
> > >> > IDs :)
> > >> >
> > >> > So the question is, what do we do now?
> > >> >
> > >> > 1) disallow "."
> > >> > 2) disallow "_"
> > >> > 3) find a reversible way to encode "." and "_" that won't break
> > existing
> > >> > metrics
> > >> > 4) all of the above?
> > >> >
> > >> > btw. it looks like "." and ".." are currently valid. Topic names are
> > >> > used for directories, right? this sounds like fun :)
> > >> >
> > >> > I vote for option #1, although if someone has a good idea for #3 it
> > >> > will be even better.
> > >> >
> > >> > Gwen
> > >> >
> > >> >
> > >> >
> > >> > On Fri, Jul 10, 2015 at 1:22 PM, Grant Henke <gh...@cloudera.com>
> > >> wrote:
> > >> > > Found it was added here:
> > >> https://issues.apache.org/jira/browse/KAFKA-697
> > >> > >
> > >> > > On Fri, Jul 10, 2015 at 3:18 PM, Todd Palino <tp...@gmail.com>
> > >> wrote:
> > >> > >
> > >> > >> This was definitely changed at some point after KAFKA-495. The
> > >> question
> > >> > is
> > >> > >> when and why.
> > >> > >>
> > >> > >> Here's the relevant code from that patch:
> > >> > >>
> > >> > >>
> ===================================================================
> > >> > >> --- core/src/main/scala/kafka/utils/Topic.scala (revision
> 1390178)
> > >> > >> +++ core/src/main/scala/kafka/utils/Topic.scala (working copy)
> > >> > >> @@ -21,24 +21,21 @@
> > >> > >>  import util.matching.Regex
> > >> > >>
> > >> > >>  object Topic {
> > >> > >> +  val legalChars = "[a-zA-Z0-9_-]"
> > >> > >>
> > >> > >>
> > >> > >>
> > >> > >> -Todd
> > >> > >>
> > >> > >>
> > >> > >> On Fri, Jul 10, 2015 at 1:02 PM, Grant Henke <
> ghenke@cloudera.com>
> > >> > wrote:
> > >> > >>
> > >> > >> > kafka.common.Topic shows that currently period is a valid
> > character
> > >> > and I
> > >> > >> > have verified I can use kafka-topics.sh to create a new topic
> > with a
> > >> > >> > period.
> > >> > >> >
> > >> > >> >
> > >> > >> > AdminUtils.createOrUpdateTopicPartitionAssignmentPathInZK
> > currently
> > >> > uses
> > >> > >> > Topic.validate before writing to Zookeeper.
> > >> > >> >
> > >> > >> > Should period character support be removed? I was under the
> same
> > >> > >> impression
> > >> > >> > as Gwen, that a period was used by many as a way to "group"
> > topics.
> > >> > >> >
> > >> > >> > The code is pasted below since its small:
> > >> > >> >
> > >> > >> > object Topic {
> > >> > >> >   val legalChars = "[a-zA-Z0-9\\._\\-]"
> > >> > >> >   private val maxNameLength = 255
> > >> > >> >   private val rgx = new Regex(legalChars + "+")
> > >> > >> >
> > >> > >> >   val InternalTopics = Set(OffsetManager.OffsetsTopicName)
> > >> > >> >
> > >> > >> >   def validate(topic: String) {
> > >> > >> >     if (topic.length <= 0)
> > >> > >> >       throw new InvalidTopicException("topic name is illegal,
> > can't
> > >> be
> > >> > >> > empty")
> > >> > >> >     else if (topic.equals(".") || topic.equals(".."))
> > >> > >> >       throw new InvalidTopicException("topic name cannot be
> > \".\" or
> > >> > >> > \"..\"")
> > >> > >> >     else if (topic.length > maxNameLength)
> > >> > >> >       throw new InvalidTopicException("topic name is illegal,
> > can't
> > >> be
> > >> > >> > longer than " + maxNameLength + " characters")
> > >> > >> >
> > >> > >> >     rgx.findFirstIn(topic) match {
> > >> > >> >       case Some(t) =>
> > >> > >> >         if (!t.equals(topic))
> > >> > >> >           throw new InvalidTopicException("topic name " + topic
> > + "
> > >> is
> > >> > >> > illegal, contains a character other than ASCII alphanumerics,
> > '.',
> > >> '_'
> > >> > >> and
> > >> > >> > '-'")
> > >> > >> >       case None => throw new InvalidTopicException("topic name
> "
> > +
> > >> > topic
> > >> > >> +
> > >> > >> > " is illegal,  contains a character other than ASCII
> > alphanumerics,
> > >> > '.',
> > >> > >> > '_' and '-'")
> > >> > >> >     }
> > >> > >> >   }
> > >> > >> > }
> > >> > >> >
> > >> > >> > On Fri, Jul 10, 2015 at 2:50 PM, Todd Palino <
> tpalino@gmail.com>
> > >> > wrote:
> > >> > >> >
> > >> > >> > > I had to go look this one up again to make sure -
> > >> > >> > > https://issues.apache.org/jira/browse/KAFKA-495
> > >> > >> > >
> > >> > >> > > The only valid character names for topics are alphanumeric,
> > >> > underscore,
> > >> > >> > and
> > >> > >> > > dash. A period is not supposed to be a valid character to
> use.
> > If
> > >> > >> you're
> > >> > >> > > seeing them, then one of two things have happened:
> > >> > >> > >
> > >> > >> > > 1) You have topic names that are grandfathered in from before
> > that
> > >> > >> patch
> > >> > >> > > 2) The patch is not working properly and there is somewhere
> in
> > the
> > >> > >> broker
> > >> > >> > > that the standard is not being enforced.
> > >> > >> > >
> > >> > >> > > -Todd
> > >> > >> > >
> > >> > >> > >
> > >> > >> > > On Fri, Jul 10, 2015 at 12:13 PM, Brock Noland <
> > brock@apache.org>
> > >> > >> wrote:
> > >> > >> > >
> > >> > >> > > > On Fri, Jul 10, 2015 at 11:34 AM, Gwen Shapira <
> > >> > >> gshapira@cloudera.com>
> > >> > >> > > > wrote:
> > >> > >> > > > > Hi Kafka Fans,
> > >> > >> > > > >
> > >> > >> > > > > If you have one topic named "kafka_lab_2" and the other
> > named
> > >> > >> > > > > "kafka.lab.2", the topic level metrics will be named
> > >> kafka_lab_2
> > >> > >> for
> > >> > >> > > > > both, effectively making it impossible to monitor them
> > >> properly.
> > >> > >> > > > >
> > >> > >> > > > > The reason this happens is that using "." in topic names
> is
> > >> > pretty
> > >> > >> > > > > common, especially as a way to group topics into data
> > centers,
> > >> > >> > > > > relevant apps, etc - basically a work-around to our
> current
> > >> > lack of
> > >> > >> > > > > name spaces. However, most metric monitoring systems
> using
> > "."
> > >> > to
> > >> > >> > > > > annotate hierarchy, so to avoid issues around metric
> names,
> > >> > Kafka
> > >> > >> > > > > replaces the "." in the name with an underscore.
> > >> > >> > > > >
> > >> > >> > > > > This generates good metric names, but creates the problem
> > with
> > >> > name
> > >> > >> > > > collisions.
> > >> > >> > > > >
> > >> > >> > > > > I'm wondering if it makes sense to simply limit the range
> > of
> > >> > >> > > > > characters permitted in a topic name and disallow "_"?
> > >> Obviously
> > >> > >> > > > > existing topics will need to remain as is, which is a bit
> > >> > awkward.
> > >> > >> > > >
> > >> > >> > > > Interesting problem! Many if not most users I personally am
> > >> aware
> > >> > of
> > >> > >> > > > use "_" as a separator in topic names. I am sure that many
> > users
> > >> > >> would
> > >> > >> > > > be quite surprised by this limitation. With that said, I am
> > sure
> > >> > >> > > > they'd transition accordingly.
> > >> > >> > > >
> > >> > >> > > > >
> > >> > >> > > > > If anyone has better backward-compatible solutions to
> this,
> > >> I'm
> > >> > all
> > >> > >> > > ears
> > >> > >> > > > :)
> > >> > >> > > > >
> > >> > >> > > > > Gwen
> > >> > >> > > >
> > >> > >> > >
> > >> > >> >
> > >> > >> >
> > >> > >> >
> > >> > >> > --
> > >> > >> > Grant Henke
> > >> > >> > Solutions Consultant | Cloudera
> > >> > >> > ghenke@cloudera.com | twitter.com/gchenke |
> > >> > linkedin.com/in/granthenke
> > >> > >> >
> > >> > >>
> > >> > >
> > >> > >
> > >> > >
> > >> > > --
> > >> > > Grant Henke
> > >> > > Solutions Consultant | Cloudera
> > >> > > ghenke@cloudera.com | twitter.com/gchenke |
> > linkedin.com/in/granthenke
> > >> >
> > >>
> >
>
>
>
> --
> Thanks,
> Neha
>



-- 

Regards,
Ashish

Re: [Discussion] Limitations on topic names

Posted by Neha Narkhede <ne...@confluent.io>.
"." seems natural for grouping topic names. +1 for 2) going forward only
without breaking previously created topics with "_" though that might
require us to patch the code somewhat awkwardly till we phase it out a
couple (purposely left vague to stay out of Ewen's wrath :-)) versions
later.

On Fri, Jul 10, 2015 at 2:02 PM, Gwen Shapira <gs...@cloudera.com> wrote:

> I don't think we should break existing topics. Just disallow new
> topics going forward.
>
> Agree that having both is horrible, but we should have a solution that
> fails when you run "kafka_topics.sh --create", not when you configure
> Ganglia.
>
> Gwen
>
> On Fri, Jul 10, 2015 at 1:53 PM, Jay Kreps <ja...@confluent.io> wrote:
> > Unfortunately '.' is pretty common too. I agree that it is perverse, but
> > people seem to do it. Breaking all the topics with '.' in the name seems
> > like it could be worse than combining metrics for people who have a
> > 'foo_bar' AND 'foo.bar' (and after all, having both is DEEPLY perverse,
> > no?).
> >
> > Where is our Dean of Compatibility, Ewen, on this?
> >
> > -Jay
> >
> > On Fri, Jul 10, 2015 at 1:32 PM, Todd Palino <tp...@gmail.com> wrote:
> >
> >> My selfish point of view is that we do #1, as we use "_" extensively in
> >> topic names here :) I also happen to think it's the right choice,
> >> specifically because "." has more special meanings, as you noted.
> >>
> >> -Todd
> >>
> >>
> >> On Fri, Jul 10, 2015 at 1:30 PM, Gwen Shapira <gs...@cloudera.com>
> >> wrote:
> >>
> >> > Unintentional side effect from allowing IP addresses in consumer
> client
> >> > IDs :)
> >> >
> >> > So the question is, what do we do now?
> >> >
> >> > 1) disallow "."
> >> > 2) disallow "_"
> >> > 3) find a reversible way to encode "." and "_" that won't break
> existing
> >> > metrics
> >> > 4) all of the above?
> >> >
> >> > btw. it looks like "." and ".." are currently valid. Topic names are
> >> > used for directories, right? this sounds like fun :)
> >> >
> >> > I vote for option #1, although if someone has a good idea for #3 it
> >> > will be even better.
> >> >
> >> > Gwen
> >> >
> >> >
> >> >
> >> > On Fri, Jul 10, 2015 at 1:22 PM, Grant Henke <gh...@cloudera.com>
> >> wrote:
> >> > > Found it was added here:
> >> https://issues.apache.org/jira/browse/KAFKA-697
> >> > >
> >> > > On Fri, Jul 10, 2015 at 3:18 PM, Todd Palino <tp...@gmail.com>
> >> wrote:
> >> > >
> >> > >> This was definitely changed at some point after KAFKA-495. The
> >> question
> >> > is
> >> > >> when and why.
> >> > >>
> >> > >> Here's the relevant code from that patch:
> >> > >>
> >> > >> ===================================================================
> >> > >> --- core/src/main/scala/kafka/utils/Topic.scala (revision 1390178)
> >> > >> +++ core/src/main/scala/kafka/utils/Topic.scala (working copy)
> >> > >> @@ -21,24 +21,21 @@
> >> > >>  import util.matching.Regex
> >> > >>
> >> > >>  object Topic {
> >> > >> +  val legalChars = "[a-zA-Z0-9_-]"
> >> > >>
> >> > >>
> >> > >>
> >> > >> -Todd
> >> > >>
> >> > >>
> >> > >> On Fri, Jul 10, 2015 at 1:02 PM, Grant Henke <gh...@cloudera.com>
> >> > wrote:
> >> > >>
> >> > >> > kafka.common.Topic shows that currently period is a valid
> character
> >> > and I
> >> > >> > have verified I can use kafka-topics.sh to create a new topic
> with a
> >> > >> > period.
> >> > >> >
> >> > >> >
> >> > >> > AdminUtils.createOrUpdateTopicPartitionAssignmentPathInZK
> currently
> >> > uses
> >> > >> > Topic.validate before writing to Zookeeper.
> >> > >> >
> >> > >> > Should period character support be removed? I was under the same
> >> > >> impression
> >> > >> > as Gwen, that a period was used by many as a way to "group"
> topics.
> >> > >> >
> >> > >> > The code is pasted below since its small:
> >> > >> >
> >> > >> > object Topic {
> >> > >> >   val legalChars = "[a-zA-Z0-9\\._\\-]"
> >> > >> >   private val maxNameLength = 255
> >> > >> >   private val rgx = new Regex(legalChars + "+")
> >> > >> >
> >> > >> >   val InternalTopics = Set(OffsetManager.OffsetsTopicName)
> >> > >> >
> >> > >> >   def validate(topic: String) {
> >> > >> >     if (topic.length <= 0)
> >> > >> >       throw new InvalidTopicException("topic name is illegal,
> can't
> >> be
> >> > >> > empty")
> >> > >> >     else if (topic.equals(".") || topic.equals(".."))
> >> > >> >       throw new InvalidTopicException("topic name cannot be
> \".\" or
> >> > >> > \"..\"")
> >> > >> >     else if (topic.length > maxNameLength)
> >> > >> >       throw new InvalidTopicException("topic name is illegal,
> can't
> >> be
> >> > >> > longer than " + maxNameLength + " characters")
> >> > >> >
> >> > >> >     rgx.findFirstIn(topic) match {
> >> > >> >       case Some(t) =>
> >> > >> >         if (!t.equals(topic))
> >> > >> >           throw new InvalidTopicException("topic name " + topic
> + "
> >> is
> >> > >> > illegal, contains a character other than ASCII alphanumerics,
> '.',
> >> '_'
> >> > >> and
> >> > >> > '-'")
> >> > >> >       case None => throw new InvalidTopicException("topic name "
> +
> >> > topic
> >> > >> +
> >> > >> > " is illegal,  contains a character other than ASCII
> alphanumerics,
> >> > '.',
> >> > >> > '_' and '-'")
> >> > >> >     }
> >> > >> >   }
> >> > >> > }
> >> > >> >
> >> > >> > On Fri, Jul 10, 2015 at 2:50 PM, Todd Palino <tp...@gmail.com>
> >> > wrote:
> >> > >> >
> >> > >> > > I had to go look this one up again to make sure -
> >> > >> > > https://issues.apache.org/jira/browse/KAFKA-495
> >> > >> > >
> >> > >> > > The only valid character names for topics are alphanumeric,
> >> > underscore,
> >> > >> > and
> >> > >> > > dash. A period is not supposed to be a valid character to use.
> If
> >> > >> you're
> >> > >> > > seeing them, then one of two things have happened:
> >> > >> > >
> >> > >> > > 1) You have topic names that are grandfathered in from before
> that
> >> > >> patch
> >> > >> > > 2) The patch is not working properly and there is somewhere in
> the
> >> > >> broker
> >> > >> > > that the standard is not being enforced.
> >> > >> > >
> >> > >> > > -Todd
> >> > >> > >
> >> > >> > >
> >> > >> > > On Fri, Jul 10, 2015 at 12:13 PM, Brock Noland <
> brock@apache.org>
> >> > >> wrote:
> >> > >> > >
> >> > >> > > > On Fri, Jul 10, 2015 at 11:34 AM, Gwen Shapira <
> >> > >> gshapira@cloudera.com>
> >> > >> > > > wrote:
> >> > >> > > > > Hi Kafka Fans,
> >> > >> > > > >
> >> > >> > > > > If you have one topic named "kafka_lab_2" and the other
> named
> >> > >> > > > > "kafka.lab.2", the topic level metrics will be named
> >> kafka_lab_2
> >> > >> for
> >> > >> > > > > both, effectively making it impossible to monitor them
> >> properly.
> >> > >> > > > >
> >> > >> > > > > The reason this happens is that using "." in topic names is
> >> > pretty
> >> > >> > > > > common, especially as a way to group topics into data
> centers,
> >> > >> > > > > relevant apps, etc - basically a work-around to our current
> >> > lack of
> >> > >> > > > > name spaces. However, most metric monitoring systems using
> "."
> >> > to
> >> > >> > > > > annotate hierarchy, so to avoid issues around metric names,
> >> > Kafka
> >> > >> > > > > replaces the "." in the name with an underscore.
> >> > >> > > > >
> >> > >> > > > > This generates good metric names, but creates the problem
> with
> >> > name
> >> > >> > > > collisions.
> >> > >> > > > >
> >> > >> > > > > I'm wondering if it makes sense to simply limit the range
> of
> >> > >> > > > > characters permitted in a topic name and disallow "_"?
> >> Obviously
> >> > >> > > > > existing topics will need to remain as is, which is a bit
> >> > awkward.
> >> > >> > > >
> >> > >> > > > Interesting problem! Many if not most users I personally am
> >> aware
> >> > of
> >> > >> > > > use "_" as a separator in topic names. I am sure that many
> users
> >> > >> would
> >> > >> > > > be quite surprised by this limitation. With that said, I am
> sure
> >> > >> > > > they'd transition accordingly.
> >> > >> > > >
> >> > >> > > > >
> >> > >> > > > > If anyone has better backward-compatible solutions to this,
> >> I'm
> >> > all
> >> > >> > > ears
> >> > >> > > > :)
> >> > >> > > > >
> >> > >> > > > > Gwen
> >> > >> > > >
> >> > >> > >
> >> > >> >
> >> > >> >
> >> > >> >
> >> > >> > --
> >> > >> > Grant Henke
> >> > >> > Solutions Consultant | Cloudera
> >> > >> > ghenke@cloudera.com | twitter.com/gchenke |
> >> > linkedin.com/in/granthenke
> >> > >> >
> >> > >>
> >> > >
> >> > >
> >> > >
> >> > > --
> >> > > Grant Henke
> >> > > Solutions Consultant | Cloudera
> >> > > ghenke@cloudera.com | twitter.com/gchenke |
> linkedin.com/in/granthenke
> >> >
> >>
>



-- 
Thanks,
Neha

Re: [Discussion] Limitations on topic names

Posted by Todd Palino <tp...@gmail.com>.
Yes, agree here. While it can be a little confusing, I think it's better to
just disallow the character for all creation steps so you can't create more
"bad" topic names, but not try and enforce it for topics that already
exist. Anyone who is in that situation is already there with regards to
metrics, and so they are probably making sure they don't collide names that
only differ in the use of "_" and ".". However, we don't want a new user to
accidentally do it.

-Todd


On Fri, Jul 10, 2015 at 2:02 PM, Gwen Shapira <gs...@cloudera.com> wrote:

> I don't think we should break existing topics. Just disallow new
> topics going forward.
>
> Agree that having both is horrible, but we should have a solution that
> fails when you run "kafka_topics.sh --create", not when you configure
> Ganglia.
>
> Gwen
>
> On Fri, Jul 10, 2015 at 1:53 PM, Jay Kreps <ja...@confluent.io> wrote:
> > Unfortunately '.' is pretty common too. I agree that it is perverse, but
> > people seem to do it. Breaking all the topics with '.' in the name seems
> > like it could be worse than combining metrics for people who have a
> > 'foo_bar' AND 'foo.bar' (and after all, having both is DEEPLY perverse,
> > no?).
> >
> > Where is our Dean of Compatibility, Ewen, on this?
> >
> > -Jay
> >
> > On Fri, Jul 10, 2015 at 1:32 PM, Todd Palino <tp...@gmail.com> wrote:
> >
> >> My selfish point of view is that we do #1, as we use "_" extensively in
> >> topic names here :) I also happen to think it's the right choice,
> >> specifically because "." has more special meanings, as you noted.
> >>
> >> -Todd
> >>
> >>
> >> On Fri, Jul 10, 2015 at 1:30 PM, Gwen Shapira <gs...@cloudera.com>
> >> wrote:
> >>
> >> > Unintentional side effect from allowing IP addresses in consumer
> client
> >> > IDs :)
> >> >
> >> > So the question is, what do we do now?
> >> >
> >> > 1) disallow "."
> >> > 2) disallow "_"
> >> > 3) find a reversible way to encode "." and "_" that won't break
> existing
> >> > metrics
> >> > 4) all of the above?
> >> >
> >> > btw. it looks like "." and ".." are currently valid. Topic names are
> >> > used for directories, right? this sounds like fun :)
> >> >
> >> > I vote for option #1, although if someone has a good idea for #3 it
> >> > will be even better.
> >> >
> >> > Gwen
> >> >
> >> >
> >> >
> >> > On Fri, Jul 10, 2015 at 1:22 PM, Grant Henke <gh...@cloudera.com>
> >> wrote:
> >> > > Found it was added here:
> >> https://issues.apache.org/jira/browse/KAFKA-697
> >> > >
> >> > > On Fri, Jul 10, 2015 at 3:18 PM, Todd Palino <tp...@gmail.com>
> >> wrote:
> >> > >
> >> > >> This was definitely changed at some point after KAFKA-495. The
> >> question
> >> > is
> >> > >> when and why.
> >> > >>
> >> > >> Here's the relevant code from that patch:
> >> > >>
> >> > >> ===================================================================
> >> > >> --- core/src/main/scala/kafka/utils/Topic.scala (revision 1390178)
> >> > >> +++ core/src/main/scala/kafka/utils/Topic.scala (working copy)
> >> > >> @@ -21,24 +21,21 @@
> >> > >>  import util.matching.Regex
> >> > >>
> >> > >>  object Topic {
> >> > >> +  val legalChars = "[a-zA-Z0-9_-]"
> >> > >>
> >> > >>
> >> > >>
> >> > >> -Todd
> >> > >>
> >> > >>
> >> > >> On Fri, Jul 10, 2015 at 1:02 PM, Grant Henke <gh...@cloudera.com>
> >> > wrote:
> >> > >>
> >> > >> > kafka.common.Topic shows that currently period is a valid
> character
> >> > and I
> >> > >> > have verified I can use kafka-topics.sh to create a new topic
> with a
> >> > >> > period.
> >> > >> >
> >> > >> >
> >> > >> > AdminUtils.createOrUpdateTopicPartitionAssignmentPathInZK
> currently
> >> > uses
> >> > >> > Topic.validate before writing to Zookeeper.
> >> > >> >
> >> > >> > Should period character support be removed? I was under the same
> >> > >> impression
> >> > >> > as Gwen, that a period was used by many as a way to "group"
> topics.
> >> > >> >
> >> > >> > The code is pasted below since its small:
> >> > >> >
> >> > >> > object Topic {
> >> > >> >   val legalChars = "[a-zA-Z0-9\\._\\-]"
> >> > >> >   private val maxNameLength = 255
> >> > >> >   private val rgx = new Regex(legalChars + "+")
> >> > >> >
> >> > >> >   val InternalTopics = Set(OffsetManager.OffsetsTopicName)
> >> > >> >
> >> > >> >   def validate(topic: String) {
> >> > >> >     if (topic.length <= 0)
> >> > >> >       throw new InvalidTopicException("topic name is illegal,
> can't
> >> be
> >> > >> > empty")
> >> > >> >     else if (topic.equals(".") || topic.equals(".."))
> >> > >> >       throw new InvalidTopicException("topic name cannot be
> \".\" or
> >> > >> > \"..\"")
> >> > >> >     else if (topic.length > maxNameLength)
> >> > >> >       throw new InvalidTopicException("topic name is illegal,
> can't
> >> be
> >> > >> > longer than " + maxNameLength + " characters")
> >> > >> >
> >> > >> >     rgx.findFirstIn(topic) match {
> >> > >> >       case Some(t) =>
> >> > >> >         if (!t.equals(topic))
> >> > >> >           throw new InvalidTopicException("topic name " + topic
> + "
> >> is
> >> > >> > illegal, contains a character other than ASCII alphanumerics,
> '.',
> >> '_'
> >> > >> and
> >> > >> > '-'")
> >> > >> >       case None => throw new InvalidTopicException("topic name "
> +
> >> > topic
> >> > >> +
> >> > >> > " is illegal,  contains a character other than ASCII
> alphanumerics,
> >> > '.',
> >> > >> > '_' and '-'")
> >> > >> >     }
> >> > >> >   }
> >> > >> > }
> >> > >> >
> >> > >> > On Fri, Jul 10, 2015 at 2:50 PM, Todd Palino <tp...@gmail.com>
> >> > wrote:
> >> > >> >
> >> > >> > > I had to go look this one up again to make sure -
> >> > >> > > https://issues.apache.org/jira/browse/KAFKA-495
> >> > >> > >
> >> > >> > > The only valid character names for topics are alphanumeric,
> >> > underscore,
> >> > >> > and
> >> > >> > > dash. A period is not supposed to be a valid character to use.
> If
> >> > >> you're
> >> > >> > > seeing them, then one of two things have happened:
> >> > >> > >
> >> > >> > > 1) You have topic names that are grandfathered in from before
> that
> >> > >> patch
> >> > >> > > 2) The patch is not working properly and there is somewhere in
> the
> >> > >> broker
> >> > >> > > that the standard is not being enforced.
> >> > >> > >
> >> > >> > > -Todd
> >> > >> > >
> >> > >> > >
> >> > >> > > On Fri, Jul 10, 2015 at 12:13 PM, Brock Noland <
> brock@apache.org>
> >> > >> wrote:
> >> > >> > >
> >> > >> > > > On Fri, Jul 10, 2015 at 11:34 AM, Gwen Shapira <
> >> > >> gshapira@cloudera.com>
> >> > >> > > > wrote:
> >> > >> > > > > Hi Kafka Fans,
> >> > >> > > > >
> >> > >> > > > > If you have one topic named "kafka_lab_2" and the other
> named
> >> > >> > > > > "kafka.lab.2", the topic level metrics will be named
> >> kafka_lab_2
> >> > >> for
> >> > >> > > > > both, effectively making it impossible to monitor them
> >> properly.
> >> > >> > > > >
> >> > >> > > > > The reason this happens is that using "." in topic names is
> >> > pretty
> >> > >> > > > > common, especially as a way to group topics into data
> centers,
> >> > >> > > > > relevant apps, etc - basically a work-around to our current
> >> > lack of
> >> > >> > > > > name spaces. However, most metric monitoring systems using
> "."
> >> > to
> >> > >> > > > > annotate hierarchy, so to avoid issues around metric names,
> >> > Kafka
> >> > >> > > > > replaces the "." in the name with an underscore.
> >> > >> > > > >
> >> > >> > > > > This generates good metric names, but creates the problem
> with
> >> > name
> >> > >> > > > collisions.
> >> > >> > > > >
> >> > >> > > > > I'm wondering if it makes sense to simply limit the range
> of
> >> > >> > > > > characters permitted in a topic name and disallow "_"?
> >> Obviously
> >> > >> > > > > existing topics will need to remain as is, which is a bit
> >> > awkward.
> >> > >> > > >
> >> > >> > > > Interesting problem! Many if not most users I personally am
> >> aware
> >> > of
> >> > >> > > > use "_" as a separator in topic names. I am sure that many
> users
> >> > >> would
> >> > >> > > > be quite surprised by this limitation. With that said, I am
> sure
> >> > >> > > > they'd transition accordingly.
> >> > >> > > >
> >> > >> > > > >
> >> > >> > > > > If anyone has better backward-compatible solutions to this,
> >> I'm
> >> > all
> >> > >> > > ears
> >> > >> > > > :)
> >> > >> > > > >
> >> > >> > > > > Gwen
> >> > >> > > >
> >> > >> > >
> >> > >> >
> >> > >> >
> >> > >> >
> >> > >> > --
> >> > >> > Grant Henke
> >> > >> > Solutions Consultant | Cloudera
> >> > >> > ghenke@cloudera.com | twitter.com/gchenke |
> >> > linkedin.com/in/granthenke
> >> > >> >
> >> > >>
> >> > >
> >> > >
> >> > >
> >> > > --
> >> > > Grant Henke
> >> > > Solutions Consultant | Cloudera
> >> > > ghenke@cloudera.com | twitter.com/gchenke |
> linkedin.com/in/granthenke
> >> >
> >>
>

Re: [Discussion] Limitations on topic names

Posted by Gwen Shapira <gs...@cloudera.com>.
I don't think we should break existing topics. Just disallow new
topics going forward.

Agree that having both is horrible, but we should have a solution that
fails when you run "kafka_topics.sh --create", not when you configure
Ganglia.

Gwen

On Fri, Jul 10, 2015 at 1:53 PM, Jay Kreps <ja...@confluent.io> wrote:
> Unfortunately '.' is pretty common too. I agree that it is perverse, but
> people seem to do it. Breaking all the topics with '.' in the name seems
> like it could be worse than combining metrics for people who have a
> 'foo_bar' AND 'foo.bar' (and after all, having both is DEEPLY perverse,
> no?).
>
> Where is our Dean of Compatibility, Ewen, on this?
>
> -Jay
>
> On Fri, Jul 10, 2015 at 1:32 PM, Todd Palino <tp...@gmail.com> wrote:
>
>> My selfish point of view is that we do #1, as we use "_" extensively in
>> topic names here :) I also happen to think it's the right choice,
>> specifically because "." has more special meanings, as you noted.
>>
>> -Todd
>>
>>
>> On Fri, Jul 10, 2015 at 1:30 PM, Gwen Shapira <gs...@cloudera.com>
>> wrote:
>>
>> > Unintentional side effect from allowing IP addresses in consumer client
>> > IDs :)
>> >
>> > So the question is, what do we do now?
>> >
>> > 1) disallow "."
>> > 2) disallow "_"
>> > 3) find a reversible way to encode "." and "_" that won't break existing
>> > metrics
>> > 4) all of the above?
>> >
>> > btw. it looks like "." and ".." are currently valid. Topic names are
>> > used for directories, right? this sounds like fun :)
>> >
>> > I vote for option #1, although if someone has a good idea for #3 it
>> > will be even better.
>> >
>> > Gwen
>> >
>> >
>> >
>> > On Fri, Jul 10, 2015 at 1:22 PM, Grant Henke <gh...@cloudera.com>
>> wrote:
>> > > Found it was added here:
>> https://issues.apache.org/jira/browse/KAFKA-697
>> > >
>> > > On Fri, Jul 10, 2015 at 3:18 PM, Todd Palino <tp...@gmail.com>
>> wrote:
>> > >
>> > >> This was definitely changed at some point after KAFKA-495. The
>> question
>> > is
>> > >> when and why.
>> > >>
>> > >> Here's the relevant code from that patch:
>> > >>
>> > >> ===================================================================
>> > >> --- core/src/main/scala/kafka/utils/Topic.scala (revision 1390178)
>> > >> +++ core/src/main/scala/kafka/utils/Topic.scala (working copy)
>> > >> @@ -21,24 +21,21 @@
>> > >>  import util.matching.Regex
>> > >>
>> > >>  object Topic {
>> > >> +  val legalChars = "[a-zA-Z0-9_-]"
>> > >>
>> > >>
>> > >>
>> > >> -Todd
>> > >>
>> > >>
>> > >> On Fri, Jul 10, 2015 at 1:02 PM, Grant Henke <gh...@cloudera.com>
>> > wrote:
>> > >>
>> > >> > kafka.common.Topic shows that currently period is a valid character
>> > and I
>> > >> > have verified I can use kafka-topics.sh to create a new topic with a
>> > >> > period.
>> > >> >
>> > >> >
>> > >> > AdminUtils.createOrUpdateTopicPartitionAssignmentPathInZK currently
>> > uses
>> > >> > Topic.validate before writing to Zookeeper.
>> > >> >
>> > >> > Should period character support be removed? I was under the same
>> > >> impression
>> > >> > as Gwen, that a period was used by many as a way to "group" topics.
>> > >> >
>> > >> > The code is pasted below since its small:
>> > >> >
>> > >> > object Topic {
>> > >> >   val legalChars = "[a-zA-Z0-9\\._\\-]"
>> > >> >   private val maxNameLength = 255
>> > >> >   private val rgx = new Regex(legalChars + "+")
>> > >> >
>> > >> >   val InternalTopics = Set(OffsetManager.OffsetsTopicName)
>> > >> >
>> > >> >   def validate(topic: String) {
>> > >> >     if (topic.length <= 0)
>> > >> >       throw new InvalidTopicException("topic name is illegal, can't
>> be
>> > >> > empty")
>> > >> >     else if (topic.equals(".") || topic.equals(".."))
>> > >> >       throw new InvalidTopicException("topic name cannot be \".\" or
>> > >> > \"..\"")
>> > >> >     else if (topic.length > maxNameLength)
>> > >> >       throw new InvalidTopicException("topic name is illegal, can't
>> be
>> > >> > longer than " + maxNameLength + " characters")
>> > >> >
>> > >> >     rgx.findFirstIn(topic) match {
>> > >> >       case Some(t) =>
>> > >> >         if (!t.equals(topic))
>> > >> >           throw new InvalidTopicException("topic name " + topic + "
>> is
>> > >> > illegal, contains a character other than ASCII alphanumerics, '.',
>> '_'
>> > >> and
>> > >> > '-'")
>> > >> >       case None => throw new InvalidTopicException("topic name " +
>> > topic
>> > >> +
>> > >> > " is illegal,  contains a character other than ASCII alphanumerics,
>> > '.',
>> > >> > '_' and '-'")
>> > >> >     }
>> > >> >   }
>> > >> > }
>> > >> >
>> > >> > On Fri, Jul 10, 2015 at 2:50 PM, Todd Palino <tp...@gmail.com>
>> > wrote:
>> > >> >
>> > >> > > I had to go look this one up again to make sure -
>> > >> > > https://issues.apache.org/jira/browse/KAFKA-495
>> > >> > >
>> > >> > > The only valid character names for topics are alphanumeric,
>> > underscore,
>> > >> > and
>> > >> > > dash. A period is not supposed to be a valid character to use. If
>> > >> you're
>> > >> > > seeing them, then one of two things have happened:
>> > >> > >
>> > >> > > 1) You have topic names that are grandfathered in from before that
>> > >> patch
>> > >> > > 2) The patch is not working properly and there is somewhere in the
>> > >> broker
>> > >> > > that the standard is not being enforced.
>> > >> > >
>> > >> > > -Todd
>> > >> > >
>> > >> > >
>> > >> > > On Fri, Jul 10, 2015 at 12:13 PM, Brock Noland <br...@apache.org>
>> > >> wrote:
>> > >> > >
>> > >> > > > On Fri, Jul 10, 2015 at 11:34 AM, Gwen Shapira <
>> > >> gshapira@cloudera.com>
>> > >> > > > wrote:
>> > >> > > > > Hi Kafka Fans,
>> > >> > > > >
>> > >> > > > > If you have one topic named "kafka_lab_2" and the other named
>> > >> > > > > "kafka.lab.2", the topic level metrics will be named
>> kafka_lab_2
>> > >> for
>> > >> > > > > both, effectively making it impossible to monitor them
>> properly.
>> > >> > > > >
>> > >> > > > > The reason this happens is that using "." in topic names is
>> > pretty
>> > >> > > > > common, especially as a way to group topics into data centers,
>> > >> > > > > relevant apps, etc - basically a work-around to our current
>> > lack of
>> > >> > > > > name spaces. However, most metric monitoring systems using "."
>> > to
>> > >> > > > > annotate hierarchy, so to avoid issues around metric names,
>> > Kafka
>> > >> > > > > replaces the "." in the name with an underscore.
>> > >> > > > >
>> > >> > > > > This generates good metric names, but creates the problem with
>> > name
>> > >> > > > collisions.
>> > >> > > > >
>> > >> > > > > I'm wondering if it makes sense to simply limit the range of
>> > >> > > > > characters permitted in a topic name and disallow "_"?
>> Obviously
>> > >> > > > > existing topics will need to remain as is, which is a bit
>> > awkward.
>> > >> > > >
>> > >> > > > Interesting problem! Many if not most users I personally am
>> aware
>> > of
>> > >> > > > use "_" as a separator in topic names. I am sure that many users
>> > >> would
>> > >> > > > be quite surprised by this limitation. With that said, I am sure
>> > >> > > > they'd transition accordingly.
>> > >> > > >
>> > >> > > > >
>> > >> > > > > If anyone has better backward-compatible solutions to this,
>> I'm
>> > all
>> > >> > > ears
>> > >> > > > :)
>> > >> > > > >
>> > >> > > > > Gwen
>> > >> > > >
>> > >> > >
>> > >> >
>> > >> >
>> > >> >
>> > >> > --
>> > >> > Grant Henke
>> > >> > Solutions Consultant | Cloudera
>> > >> > ghenke@cloudera.com | twitter.com/gchenke |
>> > linkedin.com/in/granthenke
>> > >> >
>> > >>
>> > >
>> > >
>> > >
>> > > --
>> > > Grant Henke
>> > > Solutions Consultant | Cloudera
>> > > ghenke@cloudera.com | twitter.com/gchenke | linkedin.com/in/granthenke
>> >
>>

Re: [Discussion] Limitations on topic names

Posted by Jay Kreps <ja...@confluent.io>.
Unfortunately '.' is pretty common too. I agree that it is perverse, but
people seem to do it. Breaking all the topics with '.' in the name seems
like it could be worse than combining metrics for people who have a
'foo_bar' AND 'foo.bar' (and after all, having both is DEEPLY perverse,
no?).

Where is our Dean of Compatibility, Ewen, on this?

-Jay

On Fri, Jul 10, 2015 at 1:32 PM, Todd Palino <tp...@gmail.com> wrote:

> My selfish point of view is that we do #1, as we use "_" extensively in
> topic names here :) I also happen to think it's the right choice,
> specifically because "." has more special meanings, as you noted.
>
> -Todd
>
>
> On Fri, Jul 10, 2015 at 1:30 PM, Gwen Shapira <gs...@cloudera.com>
> wrote:
>
> > Unintentional side effect from allowing IP addresses in consumer client
> > IDs :)
> >
> > So the question is, what do we do now?
> >
> > 1) disallow "."
> > 2) disallow "_"
> > 3) find a reversible way to encode "." and "_" that won't break existing
> > metrics
> > 4) all of the above?
> >
> > btw. it looks like "." and ".." are currently valid. Topic names are
> > used for directories, right? this sounds like fun :)
> >
> > I vote for option #1, although if someone has a good idea for #3 it
> > will be even better.
> >
> > Gwen
> >
> >
> >
> > On Fri, Jul 10, 2015 at 1:22 PM, Grant Henke <gh...@cloudera.com>
> wrote:
> > > Found it was added here:
> https://issues.apache.org/jira/browse/KAFKA-697
> > >
> > > On Fri, Jul 10, 2015 at 3:18 PM, Todd Palino <tp...@gmail.com>
> wrote:
> > >
> > >> This was definitely changed at some point after KAFKA-495. The
> question
> > is
> > >> when and why.
> > >>
> > >> Here's the relevant code from that patch:
> > >>
> > >> ===================================================================
> > >> --- core/src/main/scala/kafka/utils/Topic.scala (revision 1390178)
> > >> +++ core/src/main/scala/kafka/utils/Topic.scala (working copy)
> > >> @@ -21,24 +21,21 @@
> > >>  import util.matching.Regex
> > >>
> > >>  object Topic {
> > >> +  val legalChars = "[a-zA-Z0-9_-]"
> > >>
> > >>
> > >>
> > >> -Todd
> > >>
> > >>
> > >> On Fri, Jul 10, 2015 at 1:02 PM, Grant Henke <gh...@cloudera.com>
> > wrote:
> > >>
> > >> > kafka.common.Topic shows that currently period is a valid character
> > and I
> > >> > have verified I can use kafka-topics.sh to create a new topic with a
> > >> > period.
> > >> >
> > >> >
> > >> > AdminUtils.createOrUpdateTopicPartitionAssignmentPathInZK currently
> > uses
> > >> > Topic.validate before writing to Zookeeper.
> > >> >
> > >> > Should period character support be removed? I was under the same
> > >> impression
> > >> > as Gwen, that a period was used by many as a way to "group" topics.
> > >> >
> > >> > The code is pasted below since its small:
> > >> >
> > >> > object Topic {
> > >> >   val legalChars = "[a-zA-Z0-9\\._\\-]"
> > >> >   private val maxNameLength = 255
> > >> >   private val rgx = new Regex(legalChars + "+")
> > >> >
> > >> >   val InternalTopics = Set(OffsetManager.OffsetsTopicName)
> > >> >
> > >> >   def validate(topic: String) {
> > >> >     if (topic.length <= 0)
> > >> >       throw new InvalidTopicException("topic name is illegal, can't
> be
> > >> > empty")
> > >> >     else if (topic.equals(".") || topic.equals(".."))
> > >> >       throw new InvalidTopicException("topic name cannot be \".\" or
> > >> > \"..\"")
> > >> >     else if (topic.length > maxNameLength)
> > >> >       throw new InvalidTopicException("topic name is illegal, can't
> be
> > >> > longer than " + maxNameLength + " characters")
> > >> >
> > >> >     rgx.findFirstIn(topic) match {
> > >> >       case Some(t) =>
> > >> >         if (!t.equals(topic))
> > >> >           throw new InvalidTopicException("topic name " + topic + "
> is
> > >> > illegal, contains a character other than ASCII alphanumerics, '.',
> '_'
> > >> and
> > >> > '-'")
> > >> >       case None => throw new InvalidTopicException("topic name " +
> > topic
> > >> +
> > >> > " is illegal,  contains a character other than ASCII alphanumerics,
> > '.',
> > >> > '_' and '-'")
> > >> >     }
> > >> >   }
> > >> > }
> > >> >
> > >> > On Fri, Jul 10, 2015 at 2:50 PM, Todd Palino <tp...@gmail.com>
> > wrote:
> > >> >
> > >> > > I had to go look this one up again to make sure -
> > >> > > https://issues.apache.org/jira/browse/KAFKA-495
> > >> > >
> > >> > > The only valid character names for topics are alphanumeric,
> > underscore,
> > >> > and
> > >> > > dash. A period is not supposed to be a valid character to use. If
> > >> you're
> > >> > > seeing them, then one of two things have happened:
> > >> > >
> > >> > > 1) You have topic names that are grandfathered in from before that
> > >> patch
> > >> > > 2) The patch is not working properly and there is somewhere in the
> > >> broker
> > >> > > that the standard is not being enforced.
> > >> > >
> > >> > > -Todd
> > >> > >
> > >> > >
> > >> > > On Fri, Jul 10, 2015 at 12:13 PM, Brock Noland <br...@apache.org>
> > >> wrote:
> > >> > >
> > >> > > > On Fri, Jul 10, 2015 at 11:34 AM, Gwen Shapira <
> > >> gshapira@cloudera.com>
> > >> > > > wrote:
> > >> > > > > Hi Kafka Fans,
> > >> > > > >
> > >> > > > > If you have one topic named "kafka_lab_2" and the other named
> > >> > > > > "kafka.lab.2", the topic level metrics will be named
> kafka_lab_2
> > >> for
> > >> > > > > both, effectively making it impossible to monitor them
> properly.
> > >> > > > >
> > >> > > > > The reason this happens is that using "." in topic names is
> > pretty
> > >> > > > > common, especially as a way to group topics into data centers,
> > >> > > > > relevant apps, etc - basically a work-around to our current
> > lack of
> > >> > > > > name spaces. However, most metric monitoring systems using "."
> > to
> > >> > > > > annotate hierarchy, so to avoid issues around metric names,
> > Kafka
> > >> > > > > replaces the "." in the name with an underscore.
> > >> > > > >
> > >> > > > > This generates good metric names, but creates the problem with
> > name
> > >> > > > collisions.
> > >> > > > >
> > >> > > > > I'm wondering if it makes sense to simply limit the range of
> > >> > > > > characters permitted in a topic name and disallow "_"?
> Obviously
> > >> > > > > existing topics will need to remain as is, which is a bit
> > awkward.
> > >> > > >
> > >> > > > Interesting problem! Many if not most users I personally am
> aware
> > of
> > >> > > > use "_" as a separator in topic names. I am sure that many users
> > >> would
> > >> > > > be quite surprised by this limitation. With that said, I am sure
> > >> > > > they'd transition accordingly.
> > >> > > >
> > >> > > > >
> > >> > > > > If anyone has better backward-compatible solutions to this,
> I'm
> > all
> > >> > > ears
> > >> > > > :)
> > >> > > > >
> > >> > > > > Gwen
> > >> > > >
> > >> > >
> > >> >
> > >> >
> > >> >
> > >> > --
> > >> > Grant Henke
> > >> > Solutions Consultant | Cloudera
> > >> > ghenke@cloudera.com | twitter.com/gchenke |
> > linkedin.com/in/granthenke
> > >> >
> > >>
> > >
> > >
> > >
> > > --
> > > Grant Henke
> > > Solutions Consultant | Cloudera
> > > ghenke@cloudera.com | twitter.com/gchenke | linkedin.com/in/granthenke
> >
>

Re: [Discussion] Limitations on topic names

Posted by Todd Palino <tp...@gmail.com>.
My selfish point of view is that we do #1, as we use "_" extensively in
topic names here :) I also happen to think it's the right choice,
specifically because "." has more special meanings, as you noted.

-Todd


On Fri, Jul 10, 2015 at 1:30 PM, Gwen Shapira <gs...@cloudera.com> wrote:

> Unintentional side effect from allowing IP addresses in consumer client
> IDs :)
>
> So the question is, what do we do now?
>
> 1) disallow "."
> 2) disallow "_"
> 3) find a reversible way to encode "." and "_" that won't break existing
> metrics
> 4) all of the above?
>
> btw. it looks like "." and ".." are currently valid. Topic names are
> used for directories, right? this sounds like fun :)
>
> I vote for option #1, although if someone has a good idea for #3 it
> will be even better.
>
> Gwen
>
>
>
> On Fri, Jul 10, 2015 at 1:22 PM, Grant Henke <gh...@cloudera.com> wrote:
> > Found it was added here: https://issues.apache.org/jira/browse/KAFKA-697
> >
> > On Fri, Jul 10, 2015 at 3:18 PM, Todd Palino <tp...@gmail.com> wrote:
> >
> >> This was definitely changed at some point after KAFKA-495. The question
> is
> >> when and why.
> >>
> >> Here's the relevant code from that patch:
> >>
> >> ===================================================================
> >> --- core/src/main/scala/kafka/utils/Topic.scala (revision 1390178)
> >> +++ core/src/main/scala/kafka/utils/Topic.scala (working copy)
> >> @@ -21,24 +21,21 @@
> >>  import util.matching.Regex
> >>
> >>  object Topic {
> >> +  val legalChars = "[a-zA-Z0-9_-]"
> >>
> >>
> >>
> >> -Todd
> >>
> >>
> >> On Fri, Jul 10, 2015 at 1:02 PM, Grant Henke <gh...@cloudera.com>
> wrote:
> >>
> >> > kafka.common.Topic shows that currently period is a valid character
> and I
> >> > have verified I can use kafka-topics.sh to create a new topic with a
> >> > period.
> >> >
> >> >
> >> > AdminUtils.createOrUpdateTopicPartitionAssignmentPathInZK currently
> uses
> >> > Topic.validate before writing to Zookeeper.
> >> >
> >> > Should period character support be removed? I was under the same
> >> impression
> >> > as Gwen, that a period was used by many as a way to "group" topics.
> >> >
> >> > The code is pasted below since its small:
> >> >
> >> > object Topic {
> >> >   val legalChars = "[a-zA-Z0-9\\._\\-]"
> >> >   private val maxNameLength = 255
> >> >   private val rgx = new Regex(legalChars + "+")
> >> >
> >> >   val InternalTopics = Set(OffsetManager.OffsetsTopicName)
> >> >
> >> >   def validate(topic: String) {
> >> >     if (topic.length <= 0)
> >> >       throw new InvalidTopicException("topic name is illegal, can't be
> >> > empty")
> >> >     else if (topic.equals(".") || topic.equals(".."))
> >> >       throw new InvalidTopicException("topic name cannot be \".\" or
> >> > \"..\"")
> >> >     else if (topic.length > maxNameLength)
> >> >       throw new InvalidTopicException("topic name is illegal, can't be
> >> > longer than " + maxNameLength + " characters")
> >> >
> >> >     rgx.findFirstIn(topic) match {
> >> >       case Some(t) =>
> >> >         if (!t.equals(topic))
> >> >           throw new InvalidTopicException("topic name " + topic + " is
> >> > illegal, contains a character other than ASCII alphanumerics, '.', '_'
> >> and
> >> > '-'")
> >> >       case None => throw new InvalidTopicException("topic name " +
> topic
> >> +
> >> > " is illegal,  contains a character other than ASCII alphanumerics,
> '.',
> >> > '_' and '-'")
> >> >     }
> >> >   }
> >> > }
> >> >
> >> > On Fri, Jul 10, 2015 at 2:50 PM, Todd Palino <tp...@gmail.com>
> wrote:
> >> >
> >> > > I had to go look this one up again to make sure -
> >> > > https://issues.apache.org/jira/browse/KAFKA-495
> >> > >
> >> > > The only valid character names for topics are alphanumeric,
> underscore,
> >> > and
> >> > > dash. A period is not supposed to be a valid character to use. If
> >> you're
> >> > > seeing them, then one of two things have happened:
> >> > >
> >> > > 1) You have topic names that are grandfathered in from before that
> >> patch
> >> > > 2) The patch is not working properly and there is somewhere in the
> >> broker
> >> > > that the standard is not being enforced.
> >> > >
> >> > > -Todd
> >> > >
> >> > >
> >> > > On Fri, Jul 10, 2015 at 12:13 PM, Brock Noland <br...@apache.org>
> >> wrote:
> >> > >
> >> > > > On Fri, Jul 10, 2015 at 11:34 AM, Gwen Shapira <
> >> gshapira@cloudera.com>
> >> > > > wrote:
> >> > > > > Hi Kafka Fans,
> >> > > > >
> >> > > > > If you have one topic named "kafka_lab_2" and the other named
> >> > > > > "kafka.lab.2", the topic level metrics will be named kafka_lab_2
> >> for
> >> > > > > both, effectively making it impossible to monitor them properly.
> >> > > > >
> >> > > > > The reason this happens is that using "." in topic names is
> pretty
> >> > > > > common, especially as a way to group topics into data centers,
> >> > > > > relevant apps, etc - basically a work-around to our current
> lack of
> >> > > > > name spaces. However, most metric monitoring systems using "."
> to
> >> > > > > annotate hierarchy, so to avoid issues around metric names,
> Kafka
> >> > > > > replaces the "." in the name with an underscore.
> >> > > > >
> >> > > > > This generates good metric names, but creates the problem with
> name
> >> > > > collisions.
> >> > > > >
> >> > > > > I'm wondering if it makes sense to simply limit the range of
> >> > > > > characters permitted in a topic name and disallow "_"? Obviously
> >> > > > > existing topics will need to remain as is, which is a bit
> awkward.
> >> > > >
> >> > > > Interesting problem! Many if not most users I personally am aware
> of
> >> > > > use "_" as a separator in topic names. I am sure that many users
> >> would
> >> > > > be quite surprised by this limitation. With that said, I am sure
> >> > > > they'd transition accordingly.
> >> > > >
> >> > > > >
> >> > > > > If anyone has better backward-compatible solutions to this, I'm
> all
> >> > > ears
> >> > > > :)
> >> > > > >
> >> > > > > Gwen
> >> > > >
> >> > >
> >> >
> >> >
> >> >
> >> > --
> >> > Grant Henke
> >> > Solutions Consultant | Cloudera
> >> > ghenke@cloudera.com | twitter.com/gchenke |
> linkedin.com/in/granthenke
> >> >
> >>
> >
> >
> >
> > --
> > Grant Henke
> > Solutions Consultant | Cloudera
> > ghenke@cloudera.com | twitter.com/gchenke | linkedin.com/in/granthenke
>

Re: [Discussion] Limitations on topic names

Posted by Gwen Shapira <gs...@cloudera.com>.
Unintentional side effect from allowing IP addresses in consumer client IDs :)

So the question is, what do we do now?

1) disallow "."
2) disallow "_"
3) find a reversible way to encode "." and "_" that won't break existing metrics
4) all of the above?

btw. it looks like "." and ".." are currently valid. Topic names are
used for directories, right? this sounds like fun :)

I vote for option #1, although if someone has a good idea for #3 it
will be even better.

Gwen



On Fri, Jul 10, 2015 at 1:22 PM, Grant Henke <gh...@cloudera.com> wrote:
> Found it was added here: https://issues.apache.org/jira/browse/KAFKA-697
>
> On Fri, Jul 10, 2015 at 3:18 PM, Todd Palino <tp...@gmail.com> wrote:
>
>> This was definitely changed at some point after KAFKA-495. The question is
>> when and why.
>>
>> Here's the relevant code from that patch:
>>
>> ===================================================================
>> --- core/src/main/scala/kafka/utils/Topic.scala (revision 1390178)
>> +++ core/src/main/scala/kafka/utils/Topic.scala (working copy)
>> @@ -21,24 +21,21 @@
>>  import util.matching.Regex
>>
>>  object Topic {
>> +  val legalChars = "[a-zA-Z0-9_-]"
>>
>>
>>
>> -Todd
>>
>>
>> On Fri, Jul 10, 2015 at 1:02 PM, Grant Henke <gh...@cloudera.com> wrote:
>>
>> > kafka.common.Topic shows that currently period is a valid character and I
>> > have verified I can use kafka-topics.sh to create a new topic with a
>> > period.
>> >
>> >
>> > AdminUtils.createOrUpdateTopicPartitionAssignmentPathInZK currently uses
>> > Topic.validate before writing to Zookeeper.
>> >
>> > Should period character support be removed? I was under the same
>> impression
>> > as Gwen, that a period was used by many as a way to "group" topics.
>> >
>> > The code is pasted below since its small:
>> >
>> > object Topic {
>> >   val legalChars = "[a-zA-Z0-9\\._\\-]"
>> >   private val maxNameLength = 255
>> >   private val rgx = new Regex(legalChars + "+")
>> >
>> >   val InternalTopics = Set(OffsetManager.OffsetsTopicName)
>> >
>> >   def validate(topic: String) {
>> >     if (topic.length <= 0)
>> >       throw new InvalidTopicException("topic name is illegal, can't be
>> > empty")
>> >     else if (topic.equals(".") || topic.equals(".."))
>> >       throw new InvalidTopicException("topic name cannot be \".\" or
>> > \"..\"")
>> >     else if (topic.length > maxNameLength)
>> >       throw new InvalidTopicException("topic name is illegal, can't be
>> > longer than " + maxNameLength + " characters")
>> >
>> >     rgx.findFirstIn(topic) match {
>> >       case Some(t) =>
>> >         if (!t.equals(topic))
>> >           throw new InvalidTopicException("topic name " + topic + " is
>> > illegal, contains a character other than ASCII alphanumerics, '.', '_'
>> and
>> > '-'")
>> >       case None => throw new InvalidTopicException("topic name " + topic
>> +
>> > " is illegal,  contains a character other than ASCII alphanumerics, '.',
>> > '_' and '-'")
>> >     }
>> >   }
>> > }
>> >
>> > On Fri, Jul 10, 2015 at 2:50 PM, Todd Palino <tp...@gmail.com> wrote:
>> >
>> > > I had to go look this one up again to make sure -
>> > > https://issues.apache.org/jira/browse/KAFKA-495
>> > >
>> > > The only valid character names for topics are alphanumeric, underscore,
>> > and
>> > > dash. A period is not supposed to be a valid character to use. If
>> you're
>> > > seeing them, then one of two things have happened:
>> > >
>> > > 1) You have topic names that are grandfathered in from before that
>> patch
>> > > 2) The patch is not working properly and there is somewhere in the
>> broker
>> > > that the standard is not being enforced.
>> > >
>> > > -Todd
>> > >
>> > >
>> > > On Fri, Jul 10, 2015 at 12:13 PM, Brock Noland <br...@apache.org>
>> wrote:
>> > >
>> > > > On Fri, Jul 10, 2015 at 11:34 AM, Gwen Shapira <
>> gshapira@cloudera.com>
>> > > > wrote:
>> > > > > Hi Kafka Fans,
>> > > > >
>> > > > > If you have one topic named "kafka_lab_2" and the other named
>> > > > > "kafka.lab.2", the topic level metrics will be named kafka_lab_2
>> for
>> > > > > both, effectively making it impossible to monitor them properly.
>> > > > >
>> > > > > The reason this happens is that using "." in topic names is pretty
>> > > > > common, especially as a way to group topics into data centers,
>> > > > > relevant apps, etc - basically a work-around to our current lack of
>> > > > > name spaces. However, most metric monitoring systems using "." to
>> > > > > annotate hierarchy, so to avoid issues around metric names, Kafka
>> > > > > replaces the "." in the name with an underscore.
>> > > > >
>> > > > > This generates good metric names, but creates the problem with name
>> > > > collisions.
>> > > > >
>> > > > > I'm wondering if it makes sense to simply limit the range of
>> > > > > characters permitted in a topic name and disallow "_"? Obviously
>> > > > > existing topics will need to remain as is, which is a bit awkward.
>> > > >
>> > > > Interesting problem! Many if not most users I personally am aware of
>> > > > use "_" as a separator in topic names. I am sure that many users
>> would
>> > > > be quite surprised by this limitation. With that said, I am sure
>> > > > they'd transition accordingly.
>> > > >
>> > > > >
>> > > > > If anyone has better backward-compatible solutions to this, I'm all
>> > > ears
>> > > > :)
>> > > > >
>> > > > > Gwen
>> > > >
>> > >
>> >
>> >
>> >
>> > --
>> > Grant Henke
>> > Solutions Consultant | Cloudera
>> > ghenke@cloudera.com | twitter.com/gchenke | linkedin.com/in/granthenke
>> >
>>
>
>
>
> --
> Grant Henke
> Solutions Consultant | Cloudera
> ghenke@cloudera.com | twitter.com/gchenke | linkedin.com/in/granthenke

Re: [Discussion] Limitations on topic names

Posted by Todd Palino <tp...@gmail.com>.
Thanks, Grant. That seems like a bad solution to the problem that John ran
into in that ticket. It's entirely reasonable to have separate validators
for separate things, but it seems like the choice was made to try and mash
it all into a single validator. And it appears that despite the commentary
in the ticket at the time, Gwen's identified a very good reason to be
restrictive about topic naming.

-Todd



On Fri, Jul 10, 2015 at 1:22 PM, Grant Henke <gh...@cloudera.com> wrote:

> Found it was added here: https://issues.apache.org/jira/browse/KAFKA-697
>
> On Fri, Jul 10, 2015 at 3:18 PM, Todd Palino <tp...@gmail.com> wrote:
>
> > This was definitely changed at some point after KAFKA-495. The question
> is
> > when and why.
> >
> > Here's the relevant code from that patch:
> >
> > ===================================================================
> > --- core/src/main/scala/kafka/utils/Topic.scala (revision 1390178)
> > +++ core/src/main/scala/kafka/utils/Topic.scala (working copy)
> > @@ -21,24 +21,21 @@
> >  import util.matching.Regex
> >
> >  object Topic {
> > +  val legalChars = "[a-zA-Z0-9_-]"
> >
> >
> >
> > -Todd
> >
> >
> > On Fri, Jul 10, 2015 at 1:02 PM, Grant Henke <gh...@cloudera.com>
> wrote:
> >
> > > kafka.common.Topic shows that currently period is a valid character
> and I
> > > have verified I can use kafka-topics.sh to create a new topic with a
> > > period.
> > >
> > >
> > > AdminUtils.createOrUpdateTopicPartitionAssignmentPathInZK currently
> uses
> > > Topic.validate before writing to Zookeeper.
> > >
> > > Should period character support be removed? I was under the same
> > impression
> > > as Gwen, that a period was used by many as a way to "group" topics.
> > >
> > > The code is pasted below since its small:
> > >
> > > object Topic {
> > >   val legalChars = "[a-zA-Z0-9\\._\\-]"
> > >   private val maxNameLength = 255
> > >   private val rgx = new Regex(legalChars + "+")
> > >
> > >   val InternalTopics = Set(OffsetManager.OffsetsTopicName)
> > >
> > >   def validate(topic: String) {
> > >     if (topic.length <= 0)
> > >       throw new InvalidTopicException("topic name is illegal, can't be
> > > empty")
> > >     else if (topic.equals(".") || topic.equals(".."))
> > >       throw new InvalidTopicException("topic name cannot be \".\" or
> > > \"..\"")
> > >     else if (topic.length > maxNameLength)
> > >       throw new InvalidTopicException("topic name is illegal, can't be
> > > longer than " + maxNameLength + " characters")
> > >
> > >     rgx.findFirstIn(topic) match {
> > >       case Some(t) =>
> > >         if (!t.equals(topic))
> > >           throw new InvalidTopicException("topic name " + topic + " is
> > > illegal, contains a character other than ASCII alphanumerics, '.', '_'
> > and
> > > '-'")
> > >       case None => throw new InvalidTopicException("topic name " +
> topic
> > +
> > > " is illegal,  contains a character other than ASCII alphanumerics,
> '.',
> > > '_' and '-'")
> > >     }
> > >   }
> > > }
> > >
> > > On Fri, Jul 10, 2015 at 2:50 PM, Todd Palino <tp...@gmail.com>
> wrote:
> > >
> > > > I had to go look this one up again to make sure -
> > > > https://issues.apache.org/jira/browse/KAFKA-495
> > > >
> > > > The only valid character names for topics are alphanumeric,
> underscore,
> > > and
> > > > dash. A period is not supposed to be a valid character to use. If
> > you're
> > > > seeing them, then one of two things have happened:
> > > >
> > > > 1) You have topic names that are grandfathered in from before that
> > patch
> > > > 2) The patch is not working properly and there is somewhere in the
> > broker
> > > > that the standard is not being enforced.
> > > >
> > > > -Todd
> > > >
> > > >
> > > > On Fri, Jul 10, 2015 at 12:13 PM, Brock Noland <br...@apache.org>
> > wrote:
> > > >
> > > > > On Fri, Jul 10, 2015 at 11:34 AM, Gwen Shapira <
> > gshapira@cloudera.com>
> > > > > wrote:
> > > > > > Hi Kafka Fans,
> > > > > >
> > > > > > If you have one topic named "kafka_lab_2" and the other named
> > > > > > "kafka.lab.2", the topic level metrics will be named kafka_lab_2
> > for
> > > > > > both, effectively making it impossible to monitor them properly.
> > > > > >
> > > > > > The reason this happens is that using "." in topic names is
> pretty
> > > > > > common, especially as a way to group topics into data centers,
> > > > > > relevant apps, etc - basically a work-around to our current lack
> of
> > > > > > name spaces. However, most metric monitoring systems using "." to
> > > > > > annotate hierarchy, so to avoid issues around metric names, Kafka
> > > > > > replaces the "." in the name with an underscore.
> > > > > >
> > > > > > This generates good metric names, but creates the problem with
> name
> > > > > collisions.
> > > > > >
> > > > > > I'm wondering if it makes sense to simply limit the range of
> > > > > > characters permitted in a topic name and disallow "_"? Obviously
> > > > > > existing topics will need to remain as is, which is a bit
> awkward.
> > > > >
> > > > > Interesting problem! Many if not most users I personally am aware
> of
> > > > > use "_" as a separator in topic names. I am sure that many users
> > would
> > > > > be quite surprised by this limitation. With that said, I am sure
> > > > > they'd transition accordingly.
> > > > >
> > > > > >
> > > > > > If anyone has better backward-compatible solutions to this, I'm
> all
> > > > ears
> > > > > :)
> > > > > >
> > > > > > Gwen
> > > > >
> > > >
> > >
> > >
> > >
> > > --
> > > Grant Henke
> > > Solutions Consultant | Cloudera
> > > ghenke@cloudera.com | twitter.com/gchenke | linkedin.com/in/granthenke
> > >
> >
>
>
>
> --
> Grant Henke
> Solutions Consultant | Cloudera
> ghenke@cloudera.com | twitter.com/gchenke | linkedin.com/in/granthenke
>

Re: [Discussion] Limitations on topic names

Posted by Grant Henke <gh...@cloudera.com>.
Found it was added here: https://issues.apache.org/jira/browse/KAFKA-697

On Fri, Jul 10, 2015 at 3:18 PM, Todd Palino <tp...@gmail.com> wrote:

> This was definitely changed at some point after KAFKA-495. The question is
> when and why.
>
> Here's the relevant code from that patch:
>
> ===================================================================
> --- core/src/main/scala/kafka/utils/Topic.scala (revision 1390178)
> +++ core/src/main/scala/kafka/utils/Topic.scala (working copy)
> @@ -21,24 +21,21 @@
>  import util.matching.Regex
>
>  object Topic {
> +  val legalChars = "[a-zA-Z0-9_-]"
>
>
>
> -Todd
>
>
> On Fri, Jul 10, 2015 at 1:02 PM, Grant Henke <gh...@cloudera.com> wrote:
>
> > kafka.common.Topic shows that currently period is a valid character and I
> > have verified I can use kafka-topics.sh to create a new topic with a
> > period.
> >
> >
> > AdminUtils.createOrUpdateTopicPartitionAssignmentPathInZK currently uses
> > Topic.validate before writing to Zookeeper.
> >
> > Should period character support be removed? I was under the same
> impression
> > as Gwen, that a period was used by many as a way to "group" topics.
> >
> > The code is pasted below since its small:
> >
> > object Topic {
> >   val legalChars = "[a-zA-Z0-9\\._\\-]"
> >   private val maxNameLength = 255
> >   private val rgx = new Regex(legalChars + "+")
> >
> >   val InternalTopics = Set(OffsetManager.OffsetsTopicName)
> >
> >   def validate(topic: String) {
> >     if (topic.length <= 0)
> >       throw new InvalidTopicException("topic name is illegal, can't be
> > empty")
> >     else if (topic.equals(".") || topic.equals(".."))
> >       throw new InvalidTopicException("topic name cannot be \".\" or
> > \"..\"")
> >     else if (topic.length > maxNameLength)
> >       throw new InvalidTopicException("topic name is illegal, can't be
> > longer than " + maxNameLength + " characters")
> >
> >     rgx.findFirstIn(topic) match {
> >       case Some(t) =>
> >         if (!t.equals(topic))
> >           throw new InvalidTopicException("topic name " + topic + " is
> > illegal, contains a character other than ASCII alphanumerics, '.', '_'
> and
> > '-'")
> >       case None => throw new InvalidTopicException("topic name " + topic
> +
> > " is illegal,  contains a character other than ASCII alphanumerics, '.',
> > '_' and '-'")
> >     }
> >   }
> > }
> >
> > On Fri, Jul 10, 2015 at 2:50 PM, Todd Palino <tp...@gmail.com> wrote:
> >
> > > I had to go look this one up again to make sure -
> > > https://issues.apache.org/jira/browse/KAFKA-495
> > >
> > > The only valid character names for topics are alphanumeric, underscore,
> > and
> > > dash. A period is not supposed to be a valid character to use. If
> you're
> > > seeing them, then one of two things have happened:
> > >
> > > 1) You have topic names that are grandfathered in from before that
> patch
> > > 2) The patch is not working properly and there is somewhere in the
> broker
> > > that the standard is not being enforced.
> > >
> > > -Todd
> > >
> > >
> > > On Fri, Jul 10, 2015 at 12:13 PM, Brock Noland <br...@apache.org>
> wrote:
> > >
> > > > On Fri, Jul 10, 2015 at 11:34 AM, Gwen Shapira <
> gshapira@cloudera.com>
> > > > wrote:
> > > > > Hi Kafka Fans,
> > > > >
> > > > > If you have one topic named "kafka_lab_2" and the other named
> > > > > "kafka.lab.2", the topic level metrics will be named kafka_lab_2
> for
> > > > > both, effectively making it impossible to monitor them properly.
> > > > >
> > > > > The reason this happens is that using "." in topic names is pretty
> > > > > common, especially as a way to group topics into data centers,
> > > > > relevant apps, etc - basically a work-around to our current lack of
> > > > > name spaces. However, most metric monitoring systems using "." to
> > > > > annotate hierarchy, so to avoid issues around metric names, Kafka
> > > > > replaces the "." in the name with an underscore.
> > > > >
> > > > > This generates good metric names, but creates the problem with name
> > > > collisions.
> > > > >
> > > > > I'm wondering if it makes sense to simply limit the range of
> > > > > characters permitted in a topic name and disallow "_"? Obviously
> > > > > existing topics will need to remain as is, which is a bit awkward.
> > > >
> > > > Interesting problem! Many if not most users I personally am aware of
> > > > use "_" as a separator in topic names. I am sure that many users
> would
> > > > be quite surprised by this limitation. With that said, I am sure
> > > > they'd transition accordingly.
> > > >
> > > > >
> > > > > If anyone has better backward-compatible solutions to this, I'm all
> > > ears
> > > > :)
> > > > >
> > > > > Gwen
> > > >
> > >
> >
> >
> >
> > --
> > Grant Henke
> > Solutions Consultant | Cloudera
> > ghenke@cloudera.com | twitter.com/gchenke | linkedin.com/in/granthenke
> >
>



-- 
Grant Henke
Solutions Consultant | Cloudera
ghenke@cloudera.com | twitter.com/gchenke | linkedin.com/in/granthenke

Re: [Discussion] Limitations on topic names

Posted by Todd Palino <tp...@gmail.com>.
This was definitely changed at some point after KAFKA-495. The question is
when and why.

Here's the relevant code from that patch:

===================================================================
--- core/src/main/scala/kafka/utils/Topic.scala (revision 1390178)
+++ core/src/main/scala/kafka/utils/Topic.scala (working copy)
@@ -21,24 +21,21 @@
 import util.matching.Regex

 object Topic {
+  val legalChars = "[a-zA-Z0-9_-]"



-Todd


On Fri, Jul 10, 2015 at 1:02 PM, Grant Henke <gh...@cloudera.com> wrote:

> kafka.common.Topic shows that currently period is a valid character and I
> have verified I can use kafka-topics.sh to create a new topic with a
> period.
>
>
> AdminUtils.createOrUpdateTopicPartitionAssignmentPathInZK currently uses
> Topic.validate before writing to Zookeeper.
>
> Should period character support be removed? I was under the same impression
> as Gwen, that a period was used by many as a way to "group" topics.
>
> The code is pasted below since its small:
>
> object Topic {
>   val legalChars = "[a-zA-Z0-9\\._\\-]"
>   private val maxNameLength = 255
>   private val rgx = new Regex(legalChars + "+")
>
>   val InternalTopics = Set(OffsetManager.OffsetsTopicName)
>
>   def validate(topic: String) {
>     if (topic.length <= 0)
>       throw new InvalidTopicException("topic name is illegal, can't be
> empty")
>     else if (topic.equals(".") || topic.equals(".."))
>       throw new InvalidTopicException("topic name cannot be \".\" or
> \"..\"")
>     else if (topic.length > maxNameLength)
>       throw new InvalidTopicException("topic name is illegal, can't be
> longer than " + maxNameLength + " characters")
>
>     rgx.findFirstIn(topic) match {
>       case Some(t) =>
>         if (!t.equals(topic))
>           throw new InvalidTopicException("topic name " + topic + " is
> illegal, contains a character other than ASCII alphanumerics, '.', '_' and
> '-'")
>       case None => throw new InvalidTopicException("topic name " + topic +
> " is illegal,  contains a character other than ASCII alphanumerics, '.',
> '_' and '-'")
>     }
>   }
> }
>
> On Fri, Jul 10, 2015 at 2:50 PM, Todd Palino <tp...@gmail.com> wrote:
>
> > I had to go look this one up again to make sure -
> > https://issues.apache.org/jira/browse/KAFKA-495
> >
> > The only valid character names for topics are alphanumeric, underscore,
> and
> > dash. A period is not supposed to be a valid character to use. If you're
> > seeing them, then one of two things have happened:
> >
> > 1) You have topic names that are grandfathered in from before that patch
> > 2) The patch is not working properly and there is somewhere in the broker
> > that the standard is not being enforced.
> >
> > -Todd
> >
> >
> > On Fri, Jul 10, 2015 at 12:13 PM, Brock Noland <br...@apache.org> wrote:
> >
> > > On Fri, Jul 10, 2015 at 11:34 AM, Gwen Shapira <gs...@cloudera.com>
> > > wrote:
> > > > Hi Kafka Fans,
> > > >
> > > > If you have one topic named "kafka_lab_2" and the other named
> > > > "kafka.lab.2", the topic level metrics will be named kafka_lab_2 for
> > > > both, effectively making it impossible to monitor them properly.
> > > >
> > > > The reason this happens is that using "." in topic names is pretty
> > > > common, especially as a way to group topics into data centers,
> > > > relevant apps, etc - basically a work-around to our current lack of
> > > > name spaces. However, most metric monitoring systems using "." to
> > > > annotate hierarchy, so to avoid issues around metric names, Kafka
> > > > replaces the "." in the name with an underscore.
> > > >
> > > > This generates good metric names, but creates the problem with name
> > > collisions.
> > > >
> > > > I'm wondering if it makes sense to simply limit the range of
> > > > characters permitted in a topic name and disallow "_"? Obviously
> > > > existing topics will need to remain as is, which is a bit awkward.
> > >
> > > Interesting problem! Many if not most users I personally am aware of
> > > use "_" as a separator in topic names. I am sure that many users would
> > > be quite surprised by this limitation. With that said, I am sure
> > > they'd transition accordingly.
> > >
> > > >
> > > > If anyone has better backward-compatible solutions to this, I'm all
> > ears
> > > :)
> > > >
> > > > Gwen
> > >
> >
>
>
>
> --
> Grant Henke
> Solutions Consultant | Cloudera
> ghenke@cloudera.com | twitter.com/gchenke | linkedin.com/in/granthenke
>

Re: [Discussion] Limitations on topic names

Posted by Grant Henke <gh...@cloudera.com>.
kafka.common.Topic shows that currently period is a valid character and I
have verified I can use kafka-topics.sh to create a new topic with a period.


AdminUtils.createOrUpdateTopicPartitionAssignmentPathInZK currently uses
Topic.validate before writing to Zookeeper.

Should period character support be removed? I was under the same impression
as Gwen, that a period was used by many as a way to "group" topics.

The code is pasted below since its small:

object Topic {
  val legalChars = "[a-zA-Z0-9\\._\\-]"
  private val maxNameLength = 255
  private val rgx = new Regex(legalChars + "+")

  val InternalTopics = Set(OffsetManager.OffsetsTopicName)

  def validate(topic: String) {
    if (topic.length <= 0)
      throw new InvalidTopicException("topic name is illegal, can't be
empty")
    else if (topic.equals(".") || topic.equals(".."))
      throw new InvalidTopicException("topic name cannot be \".\" or
\"..\"")
    else if (topic.length > maxNameLength)
      throw new InvalidTopicException("topic name is illegal, can't be
longer than " + maxNameLength + " characters")

    rgx.findFirstIn(topic) match {
      case Some(t) =>
        if (!t.equals(topic))
          throw new InvalidTopicException("topic name " + topic + " is
illegal, contains a character other than ASCII alphanumerics, '.', '_' and
'-'")
      case None => throw new InvalidTopicException("topic name " + topic +
" is illegal,  contains a character other than ASCII alphanumerics, '.',
'_' and '-'")
    }
  }
}

On Fri, Jul 10, 2015 at 2:50 PM, Todd Palino <tp...@gmail.com> wrote:

> I had to go look this one up again to make sure -
> https://issues.apache.org/jira/browse/KAFKA-495
>
> The only valid character names for topics are alphanumeric, underscore, and
> dash. A period is not supposed to be a valid character to use. If you're
> seeing them, then one of two things have happened:
>
> 1) You have topic names that are grandfathered in from before that patch
> 2) The patch is not working properly and there is somewhere in the broker
> that the standard is not being enforced.
>
> -Todd
>
>
> On Fri, Jul 10, 2015 at 12:13 PM, Brock Noland <br...@apache.org> wrote:
>
> > On Fri, Jul 10, 2015 at 11:34 AM, Gwen Shapira <gs...@cloudera.com>
> > wrote:
> > > Hi Kafka Fans,
> > >
> > > If you have one topic named "kafka_lab_2" and the other named
> > > "kafka.lab.2", the topic level metrics will be named kafka_lab_2 for
> > > both, effectively making it impossible to monitor them properly.
> > >
> > > The reason this happens is that using "." in topic names is pretty
> > > common, especially as a way to group topics into data centers,
> > > relevant apps, etc - basically a work-around to our current lack of
> > > name spaces. However, most metric monitoring systems using "." to
> > > annotate hierarchy, so to avoid issues around metric names, Kafka
> > > replaces the "." in the name with an underscore.
> > >
> > > This generates good metric names, but creates the problem with name
> > collisions.
> > >
> > > I'm wondering if it makes sense to simply limit the range of
> > > characters permitted in a topic name and disallow "_"? Obviously
> > > existing topics will need to remain as is, which is a bit awkward.
> >
> > Interesting problem! Many if not most users I personally am aware of
> > use "_" as a separator in topic names. I am sure that many users would
> > be quite surprised by this limitation. With that said, I am sure
> > they'd transition accordingly.
> >
> > >
> > > If anyone has better backward-compatible solutions to this, I'm all
> ears
> > :)
> > >
> > > Gwen
> >
>



-- 
Grant Henke
Solutions Consultant | Cloudera
ghenke@cloudera.com | twitter.com/gchenke | linkedin.com/in/granthenke

Re: [Discussion] Limitations on topic names

Posted by Todd Palino <tp...@gmail.com>.
I had to go look this one up again to make sure -
https://issues.apache.org/jira/browse/KAFKA-495

The only valid character names for topics are alphanumeric, underscore, and
dash. A period is not supposed to be a valid character to use. If you're
seeing them, then one of two things have happened:

1) You have topic names that are grandfathered in from before that patch
2) The patch is not working properly and there is somewhere in the broker
that the standard is not being enforced.

-Todd


On Fri, Jul 10, 2015 at 12:13 PM, Brock Noland <br...@apache.org> wrote:

> On Fri, Jul 10, 2015 at 11:34 AM, Gwen Shapira <gs...@cloudera.com>
> wrote:
> > Hi Kafka Fans,
> >
> > If you have one topic named "kafka_lab_2" and the other named
> > "kafka.lab.2", the topic level metrics will be named kafka_lab_2 for
> > both, effectively making it impossible to monitor them properly.
> >
> > The reason this happens is that using "." in topic names is pretty
> > common, especially as a way to group topics into data centers,
> > relevant apps, etc - basically a work-around to our current lack of
> > name spaces. However, most metric monitoring systems using "." to
> > annotate hierarchy, so to avoid issues around metric names, Kafka
> > replaces the "." in the name with an underscore.
> >
> > This generates good metric names, but creates the problem with name
> collisions.
> >
> > I'm wondering if it makes sense to simply limit the range of
> > characters permitted in a topic name and disallow "_"? Obviously
> > existing topics will need to remain as is, which is a bit awkward.
>
> Interesting problem! Many if not most users I personally am aware of
> use "_" as a separator in topic names. I am sure that many users would
> be quite surprised by this limitation. With that said, I am sure
> they'd transition accordingly.
>
> >
> > If anyone has better backward-compatible solutions to this, I'm all ears
> :)
> >
> > Gwen
>

Re: [Discussion] Limitations on topic names

Posted by Brock Noland <br...@apache.org>.
On Fri, Jul 10, 2015 at 11:34 AM, Gwen Shapira <gs...@cloudera.com> wrote:
> Hi Kafka Fans,
>
> If you have one topic named "kafka_lab_2" and the other named
> "kafka.lab.2", the topic level metrics will be named kafka_lab_2 for
> both, effectively making it impossible to monitor them properly.
>
> The reason this happens is that using "." in topic names is pretty
> common, especially as a way to group topics into data centers,
> relevant apps, etc - basically a work-around to our current lack of
> name spaces. However, most metric monitoring systems using "." to
> annotate hierarchy, so to avoid issues around metric names, Kafka
> replaces the "." in the name with an underscore.
>
> This generates good metric names, but creates the problem with name collisions.
>
> I'm wondering if it makes sense to simply limit the range of
> characters permitted in a topic name and disallow "_"? Obviously
> existing topics will need to remain as is, which is a bit awkward.

Interesting problem! Many if not most users I personally am aware of
use "_" as a separator in topic names. I am sure that many users would
be quite surprised by this limitation. With that said, I am sure
they'd transition accordingly.

>
> If anyone has better backward-compatible solutions to this, I'm all ears :)
>
> Gwen