You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@kafka.apache.org by Kidong Lee <my...@gmail.com> on 2016/08/01 08:55:57 UTC

Kafka ETL for Parquet

Hi,

I have written a simple Kafka ETL which consumes avro encoded data from
Kafka and save them to Parquet on HDFS:
https://github.com/mykidong/kafka-etl-consumer

It is implemented with Kafka Consumer API and Parquet Writer API.

- Kidong Lee.

Re: Kafka ETL for Parquet

Posted by Shikhar Bhushan <sh...@confluent.io>.
Hi Kidong,

What specific issues did you run into when trying this out?

I think the basic idea would be to depend on the avro-serializer
package and proceed
with implementing your custom Converter similarly to AvroConverter
interface. You only need the deserialization bits (`toConnectData`), and
can stub out `fromConnectData`, since the HDFS connector being a 'sink
connector' will not exercise the latter. The avro-serializer package does
pull in a dependency on kafka-schema-registry-client since it's using the
`SchemaRegistryClient` interface. You can supply your own implementation
here, not all methods are needed for the deserialization bits so it need
not be complete.

Best,

Shikhar

On Mon, Aug 1, 2016 at 5:44 PM Kidong Lee <my...@gmail.com> wrote:

> Thanks for your interest Shikhar,
>
> Actually, I have questioned and discussed in the thread:
>
> https://mail-archives.apache.org/mod_mbox/kafka-users/201607.mbox/%3CCAE1jLMOnYb2ScNweoBdsXRHOxjYLe=Ha-6igLDNTL95aBUyXBg@mail.gmail.com%3E
> The problem was for me that it was not easy to understand the connect
> internal data structure, and I have tried written AvroConverter as you
> mentioned, but I could not reach to run correctly.
> I could not find how to avoid SR for writing AvroConverter.
>
> Could you give me some concrete implementation of AvroConverter to support
> for instance, Classpath Avro Schema Registry?
>
> - Kidong.
>
>
>
>
>
> 2016-08-02 7:40 GMT+09:00 Shikhar Bhushan <sh...@confluent.io>:
>
> > Er, mislinked HDFS connector :)
> > https://github.com/confluentinc/kafka-connect-hdfs
> >
> >
> > On Mon, Aug 1, 2016 at 3:39 PM Shikhar Bhushan <sh...@confluent.io>
> > wrote:
> >
> > > Hi Kidong,
> > >
> > > That's pretty cool! I'm curious what this offers over the Confluent
> HDFS
> > > connector <https://github.com/mykidong/kafka-etl-consumer>, though.
> > >
> > > The README mentions not depending on the Schema Registry, and that the
> > > schema can be retrieved via the classpath and Consul. This
> functionality
> > > should actually be pluggable with Connect by implementing a custom
> > > `Converter`, e.g. the SR comes with AvroConverter which acts as the
> glue.
> > > Converter classes can be specified with the `key.converter` and
> > > `value.converter` configs.
> > >
> > > Best,
> > >
> > > Shikhar
> > >
> > > On Mon, Aug 1, 2016 at 1:56 AM Kidong Lee <my...@gmail.com> wrote:
> > >
> > >> Hi,
> > >>
> > >> I have written a simple Kafka ETL which consumes avro encoded data
> from
> > >> Kafka and save them to Parquet on HDFS:
> > >> https://github.com/mykidong/kafka-etl-consumer
> > >>
> > >> It is implemented with Kafka Consumer API and Parquet Writer API.
> > >>
> > >> - Kidong Lee.
> > >>
> > >
> >
>

Re: Kafka ETL for Parquet

Posted by Kidong Lee <my...@gmail.com>.
Thanks for your interest Shikhar,

Actually, I have questioned and discussed in the thread:
https://mail-archives.apache.org/mod_mbox/kafka-users/201607.mbox/%3CCAE1jLMOnYb2ScNweoBdsXRHOxjYLe=Ha-6igLDNTL95aBUyXBg@mail.gmail.com%3E
The problem was for me that it was not easy to understand the connect
internal data structure, and I have tried written AvroConverter as you
mentioned, but I could not reach to run correctly.
I could not find how to avoid SR for writing AvroConverter.

Could you give me some concrete implementation of AvroConverter to support
for instance, Classpath Avro Schema Registry?

- Kidong.





2016-08-02 7:40 GMT+09:00 Shikhar Bhushan <sh...@confluent.io>:

> Er, mislinked HDFS connector :)
> https://github.com/confluentinc/kafka-connect-hdfs
>
>
> On Mon, Aug 1, 2016 at 3:39 PM Shikhar Bhushan <sh...@confluent.io>
> wrote:
>
> > Hi Kidong,
> >
> > That's pretty cool! I'm curious what this offers over the Confluent HDFS
> > connector <https://github.com/mykidong/kafka-etl-consumer>, though.
> >
> > The README mentions not depending on the Schema Registry, and that the
> > schema can be retrieved via the classpath and Consul. This functionality
> > should actually be pluggable with Connect by implementing a custom
> > `Converter`, e.g. the SR comes with AvroConverter which acts as the glue.
> > Converter classes can be specified with the `key.converter` and
> > `value.converter` configs.
> >
> > Best,
> >
> > Shikhar
> >
> > On Mon, Aug 1, 2016 at 1:56 AM Kidong Lee <my...@gmail.com> wrote:
> >
> >> Hi,
> >>
> >> I have written a simple Kafka ETL which consumes avro encoded data from
> >> Kafka and save them to Parquet on HDFS:
> >> https://github.com/mykidong/kafka-etl-consumer
> >>
> >> It is implemented with Kafka Consumer API and Parquet Writer API.
> >>
> >> - Kidong Lee.
> >>
> >
>

Re: Kafka ETL for Parquet

Posted by Shikhar Bhushan <sh...@confluent.io>.
Er, mislinked HDFS connector :)
https://github.com/confluentinc/kafka-connect-hdfs


On Mon, Aug 1, 2016 at 3:39 PM Shikhar Bhushan <sh...@confluent.io> wrote:

> Hi Kidong,
>
> That's pretty cool! I'm curious what this offers over the Confluent HDFS
> connector <https://github.com/mykidong/kafka-etl-consumer>, though.
>
> The README mentions not depending on the Schema Registry, and that the
> schema can be retrieved via the classpath and Consul. This functionality
> should actually be pluggable with Connect by implementing a custom
> `Converter`, e.g. the SR comes with AvroConverter which acts as the glue.
> Converter classes can be specified with the `key.converter` and
> `value.converter` configs.
>
> Best,
>
> Shikhar
>
> On Mon, Aug 1, 2016 at 1:56 AM Kidong Lee <my...@gmail.com> wrote:
>
>> Hi,
>>
>> I have written a simple Kafka ETL which consumes avro encoded data from
>> Kafka and save them to Parquet on HDFS:
>> https://github.com/mykidong/kafka-etl-consumer
>>
>> It is implemented with Kafka Consumer API and Parquet Writer API.
>>
>> - Kidong Lee.
>>
>

Re: Kafka ETL for Parquet

Posted by Shikhar Bhushan <sh...@confluent.io>.
Hi Kidong,

That's pretty cool! I'm curious what this offers over the Confluent HDFS
connector <https://github.com/mykidong/kafka-etl-consumer>, though.

The README mentions not depending on the Schema Registry, and that the
schema can be retrieved via the classpath and Consul. This functionality
should actually be pluggable with Connect by implementing a custom
`Converter`, e.g. the SR comes with AvroConverter which acts as the glue.
Converter classes can be specified with the `key.converter` and
`value.converter` configs.

Best,

Shikhar

On Mon, Aug 1, 2016 at 1:56 AM Kidong Lee <my...@gmail.com> wrote:

> Hi,
>
> I have written a simple Kafka ETL which consumes avro encoded data from
> Kafka and save them to Parquet on HDFS:
> https://github.com/mykidong/kafka-etl-consumer
>
> It is implemented with Kafka Consumer API and Parquet Writer API.
>
> - Kidong Lee.
>