You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@parquet.apache.org by "lizhenmxcz@163.com" <li...@163.com> on 2015/10/27 07:29:58 UTC
how to convert text parquet in flume serialization
hi all,
i want to convert the flume sink to the parquet format in the serialization, but the parquet writer constructor need a path parameter, while the flume serialization just provide a outputstream interface. i don't how to solve it. who can give me a sample ,thanks。
lizhenmxcz@163.com
Re: Re: how to convert text parquet in flume serialization
Posted by "lizhenmxcz@163.com" <li...@163.com>.
thanks to Ryan, i will do as you say.
lizhenmxcz@163.com
From: Ryan Blue
Date: 2015-10-28 00:07
To: dev
Subject: Re: how to convert text parquet in flume serialization
I wouldn't recommend writing directly from Flume to Parquet. Parquet
can't guarantee that data is on disk until a file is closed, so you end
up with long-running transactions that back up into your file channel.
Plus, if you are writing to a partitioned dataset you end up with
several open files and huge memory consumption. I recommend first
writing to Avro and then using a batch job to convert into Parquet.
If you really need to write directly to Parquet, take a look at the Kite
DatasetSink instead of using the HDFS sink. That allows you to write
directly to Parquet.
rb
On 10/26/2015 11:29 PM, lizhenmxcz@163.com wrote:
>
> hi all,
> i want to convert the flume sink to the parquet format in the serialization, but the parquet writer constructor need a path parameter, while the flume serialization just provide a outputstream interface. i don't how to solve it. who can give me a sample ,thanks。
>
>
> lizhenmxcz@163.com
>
--
Ryan Blue
Software Engineer
Cloudera, Inc.
Re: how to convert text parquet in flume serialization
Posted by Ryan Blue <bl...@cloudera.com>.
I wouldn't recommend writing directly from Flume to Parquet. Parquet
can't guarantee that data is on disk until a file is closed, so you end
up with long-running transactions that back up into your file channel.
Plus, if you are writing to a partitioned dataset you end up with
several open files and huge memory consumption. I recommend first
writing to Avro and then using a batch job to convert into Parquet.
If you really need to write directly to Parquet, take a look at the Kite
DatasetSink instead of using the HDFS sink. That allows you to write
directly to Parquet.
rb
On 10/26/2015 11:29 PM, lizhenmxcz@163.com wrote:
>
> hi all,
> i want to convert the flume sink to the parquet format in the serialization, but the parquet writer constructor need a path parameter, while the flume serialization just provide a outputstream interface. i don't how to solve it. who can give me a sample ,thanks。
>
>
> lizhenmxcz@163.com
>
--
Ryan Blue
Software Engineer
Cloudera, Inc.