You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@kafka.apache.org by 刘明敏 <di...@gmail.com> on 2012/06/02 12:07:37 UTC

encoding issue in kafka

we encountered an encoding issue when dealing with Chinese character

the producer send characters in right encode(UTF-8),while after the
consumer get it ,it all turns into question marks:????

when start up producer,kafka broker server and consumer, we tried specified
-Dfile.encoding=UTF-8,but it doesn't work


In producer,we use StringEncoder,below is the snippet of producer:


  val props = new Properties();

  ...

  props.put("serializer.class", "kafka.serializer.StringEncoder");
  props.put("compression.codec", "1") //gzip

  val producerConfig = new ProducerConfig(props);
  val producer = new Producer[String, String](producerConfig);
    val data = new ProducerData[String, String](topic, partitionKey,
List("string_to_send_to_borker"));


  producer.send(data);


and consumer:


    val topicMessageStreams =
consumerConnector.createMessageStreams(Predef.Map(topic -> consumers),
new StringDecoder)

    for ((topic, streamList) <- topicMessageStreams) {
      for (stream <- streamList) {
        val processor = new StreamProcessor(stream)

        new Thread(processor).start();
      }
    }

and the StreamProcessor just iterate each streams
  val message = iterator.next.message//chinese characters in message
turns into ?????


Anyone any help?


-- 
Best Regards

----------------------
刘明敏 | mmLiu

Re: encoding issue in kafka

Posted by Jay Kreps <ja...@gmail.com>.
It is definitely a problem in StringEncoder/StringDecoder. They are weirdly
packaged in with the Encoder.scala/Decoder.scala (maybe should be moved
out). I filed a JIRA for it and marked it with newbie label in case anyone
wants to take a shot:
https://issues.apache.org/jira/browse/KAFKA-367

-Jay

On Mon, Jun 18, 2012 at 6:32 PM, Chris Burroughs
<ch...@gmail.com>wrote:

> I thought there was another thread or ticket about this but I can't find
> it.  Did we ever get to the bottom of this?
>
> On 06/06/2012 05:45 PM, Roman Garcia wrote:
> > I guess having to set the platform encoding shouldn't be the fix, right?
> > If that's the case, then somewhere in the code the default encoding
> > (platform) is being used...that was the StringEncoder I understand.
> > Then again, I believe these were removed from trunk, right? (at least, I
> > can't seem to find em)
> > If they weren't removed (just moved outside), it should be considered to
> > start using the expected String constructor: String(bytes, charset)
> >
> > Regards,
> > Roman
> >
> > 2012/6/6 Patricio Echagüe <pa...@gmail.com>
> >
> >> I ran into the same issue and setting -Dfile.encoding=UTF-8 in the
> startup
> >> script fixed it.
> >>
> >> On Wed, Jun 6, 2012 at 12:18 AM, 刘明敏 <di...@gmail.com>
> wrote:
> >>
> >>> sorry for the late reply, it is my fault
> >>>
> >>> after set -Dfile.encoding=UTF-8  when start up producer
> >>>
> >>> problem solved.
> >>>
> >>> On Sat, Jun 2, 2012 at 6:07 PM, 刘明敏 <di...@gmail.com>
> wrote:
> >>>
> >>>> we encountered an encoding issue when dealing with Chinese character
> >>>>
> >>>> the producer send characters in right encode(UTF-8),while after the
> >>>> consumer get it ,it all turns into question marks:????
> >>>>
> >>>> when start up producer,kafka broker server and consumer, we tried
> >>>> specified -Dfile.encoding=UTF-8,but it doesn't work
> >>>>
> >>>>
> >>>> In producer,we use StringEncoder,below is the snippet of producer:
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>   val props = new Properties();
> >>>>
> >>>>
> >>>>
> >>>>   ...
> >>>>
> >>>>   props.put("serializer.class", "kafka.serializer.StringEncoder");
> >>>>
> >>>>
> >>>>   props.put("compression.codec", "1") //gzip
> >>>>
> >>>>
> >>>>
> >>>>   val producerConfig = new ProducerConfig(props);
> >>>>
> >>>>
> >>>>   val producer = new Producer[String, String](producerConfig);
> >>>>
> >>>>
> >>>>     val data = new ProducerData[String, String](topic, partitionKey,
> >>> List("string_to_send_to_borker"));
> >>>>
> >>>>
> >>>>
> >>>>   producer.send(data);
> >>>>
> >>>>
> >>>>
> >>>> and consumer:
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>     val topicMessageStreams =
> >>> consumerConnector.createMessageStreams(Predef.Map(topic -> consumers),
> >> new
> >>> StringDecoder)
> >>>>
> >>>>
> >>>>
> >>>>     for ((topic, streamList) <- topicMessageStreams) {
> >>>>
> >>>>
> >>>>       for (stream <- streamList) {
> >>>>
> >>>>
> >>>>         val processor = new StreamProcessor(stream)
> >>>>
> >>>>
> >>>>
> >>>>         new Thread(processor).start();
> >>>>
> >>>>
> >>>>       }
> >>>>
> >>>>
> >>>>     }
> >>>>
> >>>>
> >>>>
> >>>> and the StreamProcessor just iterate each streams
> >>>>
> >>>>
> >>>>   val message = iterator.next.message//chinese characters in message
> >>> turns into ?????
> >>>>
> >>>>
> >>>>
> >>>> Anyone any help?
> >>>>
> >>>>
> >>>> --
> >>>> Best Regards
> >>>>
> >>>> ----------------------
> >>>> 刘明敏 | mmLiu
> >>>>
> >>>>
> >>>
> >>>
> >>> --
> >>> Best Regards
> >>>
> >>> ----------------------
> >>> 刘明敏 | mmLiu
> >>>
> >>
> >
>
>

Re: encoding issue in kafka

Posted by Chris Burroughs <ch...@gmail.com>.
I thought there was another thread or ticket about this but I can't find
it.  Did we ever get to the bottom of this?

On 06/06/2012 05:45 PM, Roman Garcia wrote:
> I guess having to set the platform encoding shouldn't be the fix, right?
> If that's the case, then somewhere in the code the default encoding
> (platform) is being used...that was the StringEncoder I understand.
> Then again, I believe these were removed from trunk, right? (at least, I
> can't seem to find em)
> If they weren't removed (just moved outside), it should be considered to
> start using the expected String constructor: String(bytes, charset)
> 
> Regards,
> Roman
> 
> 2012/6/6 Patricio Echagüe <pa...@gmail.com>
> 
>> I ran into the same issue and setting -Dfile.encoding=UTF-8 in the startup
>> script fixed it.
>>
>> On Wed, Jun 6, 2012 at 12:18 AM, 刘明敏 <di...@gmail.com> wrote:
>>
>>> sorry for the late reply, it is my fault
>>>
>>> after set -Dfile.encoding=UTF-8  when start up producer
>>>
>>> problem solved.
>>>
>>> On Sat, Jun 2, 2012 at 6:07 PM, 刘明敏 <di...@gmail.com> wrote:
>>>
>>>> we encountered an encoding issue when dealing with Chinese character
>>>>
>>>> the producer send characters in right encode(UTF-8),while after the
>>>> consumer get it ,it all turns into question marks:????
>>>>
>>>> when start up producer,kafka broker server and consumer, we tried
>>>> specified -Dfile.encoding=UTF-8,but it doesn't work
>>>>
>>>>
>>>> In producer,we use StringEncoder,below is the snippet of producer:
>>>>
>>>>
>>>>
>>>>
>>>>   val props = new Properties();
>>>>
>>>>
>>>>
>>>>   ...
>>>>
>>>>   props.put("serializer.class", "kafka.serializer.StringEncoder");
>>>>
>>>>
>>>>   props.put("compression.codec", "1") //gzip
>>>>
>>>>
>>>>
>>>>   val producerConfig = new ProducerConfig(props);
>>>>
>>>>
>>>>   val producer = new Producer[String, String](producerConfig);
>>>>
>>>>
>>>>     val data = new ProducerData[String, String](topic, partitionKey,
>>> List("string_to_send_to_borker"));
>>>>
>>>>
>>>>
>>>>   producer.send(data);
>>>>
>>>>
>>>>
>>>> and consumer:
>>>>
>>>>
>>>>
>>>>
>>>>     val topicMessageStreams =
>>> consumerConnector.createMessageStreams(Predef.Map(topic -> consumers),
>> new
>>> StringDecoder)
>>>>
>>>>
>>>>
>>>>     for ((topic, streamList) <- topicMessageStreams) {
>>>>
>>>>
>>>>       for (stream <- streamList) {
>>>>
>>>>
>>>>         val processor = new StreamProcessor(stream)
>>>>
>>>>
>>>>
>>>>         new Thread(processor).start();
>>>>
>>>>
>>>>       }
>>>>
>>>>
>>>>     }
>>>>
>>>>
>>>>
>>>> and the StreamProcessor just iterate each streams
>>>>
>>>>
>>>>   val message = iterator.next.message//chinese characters in message
>>> turns into ?????
>>>>
>>>>
>>>>
>>>> Anyone any help?
>>>>
>>>>
>>>> --
>>>> Best Regards
>>>>
>>>> ----------------------
>>>> 刘明敏 | mmLiu
>>>>
>>>>
>>>
>>>
>>> --
>>> Best Regards
>>>
>>> ----------------------
>>> 刘明敏 | mmLiu
>>>
>>
> 


Re: encoding issue in kafka

Posted by Roman Garcia <ro...@gmail.com>.
I guess having to set the platform encoding shouldn't be the fix, right?
If that's the case, then somewhere in the code the default encoding
(platform) is being used...that was the StringEncoder I understand.
Then again, I believe these were removed from trunk, right? (at least, I
can't seem to find em)
If they weren't removed (just moved outside), it should be considered to
start using the expected String constructor: String(bytes, charset)

Regards,
Roman

2012/6/6 Patricio Echagüe <pa...@gmail.com>

> I ran into the same issue and setting -Dfile.encoding=UTF-8 in the startup
> script fixed it.
>
> On Wed, Jun 6, 2012 at 12:18 AM, 刘明敏 <di...@gmail.com> wrote:
>
> > sorry for the late reply, it is my fault
> >
> > after set -Dfile.encoding=UTF-8  when start up producer
> >
> > problem solved.
> >
> > On Sat, Jun 2, 2012 at 6:07 PM, 刘明敏 <di...@gmail.com> wrote:
> >
> > > we encountered an encoding issue when dealing with Chinese character
> > >
> > > the producer send characters in right encode(UTF-8),while after the
> > > consumer get it ,it all turns into question marks:????
> > >
> > > when start up producer,kafka broker server and consumer, we tried
> > > specified -Dfile.encoding=UTF-8,but it doesn't work
> > >
> > >
> > > In producer,we use StringEncoder,below is the snippet of producer:
> > >
> > >
> > >
> > >
> > >   val props = new Properties();
> > >
> > >
> > >
> > >   ...
> > >
> > >   props.put("serializer.class", "kafka.serializer.StringEncoder");
> > >
> > >
> > >   props.put("compression.codec", "1") //gzip
> > >
> > >
> > >
> > >   val producerConfig = new ProducerConfig(props);
> > >
> > >
> > >   val producer = new Producer[String, String](producerConfig);
> > >
> > >
> > >     val data = new ProducerData[String, String](topic, partitionKey,
> > List("string_to_send_to_borker"));
> > >
> > >
> > >
> > >   producer.send(data);
> > >
> > >
> > >
> > > and consumer:
> > >
> > >
> > >
> > >
> > >     val topicMessageStreams =
> > consumerConnector.createMessageStreams(Predef.Map(topic -> consumers),
> new
> > StringDecoder)
> > >
> > >
> > >
> > >     for ((topic, streamList) <- topicMessageStreams) {
> > >
> > >
> > >       for (stream <- streamList) {
> > >
> > >
> > >         val processor = new StreamProcessor(stream)
> > >
> > >
> > >
> > >         new Thread(processor).start();
> > >
> > >
> > >       }
> > >
> > >
> > >     }
> > >
> > >
> > >
> > > and the StreamProcessor just iterate each streams
> > >
> > >
> > >   val message = iterator.next.message//chinese characters in message
> > turns into ?????
> > >
> > >
> > >
> > > Anyone any help?
> > >
> > >
> > > --
> > > Best Regards
> > >
> > > ----------------------
> > > 刘明敏 | mmLiu
> > >
> > >
> >
> >
> > --
> > Best Regards
> >
> > ----------------------
> > 刘明敏 | mmLiu
> >
>

Re: encoding issue in kafka

Posted by Patricio Echagüe <pa...@gmail.com>.
I ran into the same issue and setting -Dfile.encoding=UTF-8 in the startup
script fixed it.

On Wed, Jun 6, 2012 at 12:18 AM, 刘明敏 <di...@gmail.com> wrote:

> sorry for the late reply, it is my fault
>
> after set -Dfile.encoding=UTF-8  when start up producer
>
> problem solved.
>
> On Sat, Jun 2, 2012 at 6:07 PM, 刘明敏 <di...@gmail.com> wrote:
>
> > we encountered an encoding issue when dealing with Chinese character
> >
> > the producer send characters in right encode(UTF-8),while after the
> > consumer get it ,it all turns into question marks:????
> >
> > when start up producer,kafka broker server and consumer, we tried
> > specified -Dfile.encoding=UTF-8,but it doesn't work
> >
> >
> > In producer,we use StringEncoder,below is the snippet of producer:
> >
> >
> >
> >
> >   val props = new Properties();
> >
> >
> >
> >   ...
> >
> >   props.put("serializer.class", "kafka.serializer.StringEncoder");
> >
> >
> >   props.put("compression.codec", "1") //gzip
> >
> >
> >
> >   val producerConfig = new ProducerConfig(props);
> >
> >
> >   val producer = new Producer[String, String](producerConfig);
> >
> >
> >     val data = new ProducerData[String, String](topic, partitionKey,
> List("string_to_send_to_borker"));
> >
> >
> >
> >   producer.send(data);
> >
> >
> >
> > and consumer:
> >
> >
> >
> >
> >     val topicMessageStreams =
> consumerConnector.createMessageStreams(Predef.Map(topic -> consumers), new
> StringDecoder)
> >
> >
> >
> >     for ((topic, streamList) <- topicMessageStreams) {
> >
> >
> >       for (stream <- streamList) {
> >
> >
> >         val processor = new StreamProcessor(stream)
> >
> >
> >
> >         new Thread(processor).start();
> >
> >
> >       }
> >
> >
> >     }
> >
> >
> >
> > and the StreamProcessor just iterate each streams
> >
> >
> >   val message = iterator.next.message//chinese characters in message
> turns into ?????
> >
> >
> >
> > Anyone any help?
> >
> >
> > --
> > Best Regards
> >
> > ----------------------
> > 刘明敏 | mmLiu
> >
> >
>
>
> --
> Best Regards
>
> ----------------------
> 刘明敏 | mmLiu
>

Re: encoding issue in kafka

Posted by 刘明敏 <di...@gmail.com>.
sorry for the late reply, it is my fault

after set -Dfile.encoding=UTF-8  when start up producer

problem solved.

On Sat, Jun 2, 2012 at 6:07 PM, 刘明敏 <di...@gmail.com> wrote:

> we encountered an encoding issue when dealing with Chinese character
>
> the producer send characters in right encode(UTF-8),while after the
> consumer get it ,it all turns into question marks:????
>
> when start up producer,kafka broker server and consumer, we tried
> specified -Dfile.encoding=UTF-8,but it doesn't work
>
>
> In producer,we use StringEncoder,below is the snippet of producer:
>
>
>
>
>   val props = new Properties();
>
>
>
>   ...
>
>   props.put("serializer.class", "kafka.serializer.StringEncoder");
>
>
>   props.put("compression.codec", "1") //gzip
>
>
>
>   val producerConfig = new ProducerConfig(props);
>
>
>   val producer = new Producer[String, String](producerConfig);
>
>
>     val data = new ProducerData[String, String](topic, partitionKey, List("string_to_send_to_borker"));
>
>
>
>   producer.send(data);
>
>
>
> and consumer:
>
>
>
>
>     val topicMessageStreams = consumerConnector.createMessageStreams(Predef.Map(topic -> consumers), new StringDecoder)
>
>
>
>     for ((topic, streamList) <- topicMessageStreams) {
>
>
>       for (stream <- streamList) {
>
>
>         val processor = new StreamProcessor(stream)
>
>
>
>         new Thread(processor).start();
>
>
>       }
>
>
>     }
>
>
>
> and the StreamProcessor just iterate each streams
>
>
>   val message = iterator.next.message//chinese characters in message turns into ?????
>
>
>
> Anyone any help?
>
>
> --
> Best Regards
>
> ----------------------
> 刘明敏 | mmLiu
>
>


-- 
Best Regards

----------------------
刘明敏 | mmLiu