You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@kafka.apache.org by Clayton Wohl <cl...@gmail.com> on 2017/05/07 09:36:37 UTC

Kafka Connect Parquet Support?

With the Kafka Connect S3 sink, I can choose Avro or JSON output format. Is
there any chance that Parquet will be supported?

For record at a time processing, Parquet isn't a good fit. But for
reading/writing batches of records, which is what the Kafka Connect Sink is
writing, Parquet is generally better than Avro.

Would attempting writing support for this be wise to try or not?

Re: Kafka Connect Parquet Support?

Posted by Ewen Cheslack-Postava <ew...@confluent.io>.

Theoretically yes, we would want to support this in Confluent's S3
connector. One of the stumbling blocks is that apparently the Parquet code
is somewhat tied to HDFS currently which causes problems when you're not
using the HDFS S3 connectivity. See, e.g.,
https://github.com/confluentinc/kafka-connect-storage-cloud/issues/26, on
the cloud storage/S3 connector.

On Wed, May 31, 2017 at 11:17 AM, Colin McCabe <cm...@apache.org> wrote:

> Hi Clayton,
>
> It seems like an interesting improvement.  Given that Parquet is
> columnar, you would expect some space savings.  I guess the big question
> is, would each batch of records become a single parquet file?  And how
> does this integrate with the existing logic, which might assume that
> each record can be serialized on its own?
>
> best,
> Colin
>
>
> On Sun, May 7, 2017, at 02:36, Clayton Wohl wrote:
> > With the Kafka Connect S3 sink, I can choose Avro or JSON output format.
> > Is
> > there any chance that Parquet will be supported?
> >
> > For record at a time processing, Parquet isn't a good fit. But for
> > reading/writing batches of records, which is what the Kafka Connect Sink
> > is
> > writing, Parquet is generally better than Avro.
> >
> > Would attempting writing support for this be wise to try or not?
>

Re: Kafka Connect Parquet Support?

Posted by Colin McCabe <cm...@apache.org>.

Hi Clayton,

It seems like an interesting improvement.  Given that Parquet is
columnar, you would expect some space savings.  I guess the big question
is, would each batch of records become a single parquet file?  And how
does this integrate with the existing logic, which might assume that
each record can be serialized on its own?

best,
Colin

On Sun, May 7, 2017, at 02:36, Clayton Wohl wrote:
> With the Kafka Connect S3 sink, I can choose Avro or JSON output format.
> Is
> there any chance that Parquet will be supported?
> 
> For record at a time processing, Parquet isn't a good fit. But for
> reading/writing batches of records, which is what the Kafka Connect Sink
> is
> writing, Parquet is generally better than Avro.
> 
> Would attempting writing support for this be wise to try or not?