You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@parquet.apache.org by "lizhenmxcz@163.com" <li...@163.com> on 2015/10/27 07:29:58 UTC

how to convert text parquet in flume serialization

hi all,
    i want to convert the flume sink to the parquet format in the serialization, but the parquet writer constructor need a path parameter, while the flume serialization just provide a outputstream interface. i don't how to solve it. who can give me a sample ,thanks。


lizhenmxcz@163.com

Re: Re: how to convert text parquet in flume serialization

Posted by "lizhenmxcz@163.com" <li...@163.com>.
thanks to Ryan,  i will do as you say.


lizhenmxcz@163.com
 
From: Ryan Blue
Date: 2015-10-28 00:07
To: dev
Subject: Re: how to convert text parquet in flume serialization
I wouldn't recommend writing directly from Flume to Parquet. Parquet 
can't guarantee that data is on disk until a file is closed, so you end 
up with long-running transactions that back up into your file channel. 
Plus, if you are writing to a partitioned dataset you end up with 
several open files and huge memory consumption. I recommend first 
writing to Avro and then using a batch job to convert into Parquet.
 
If you really need to write directly to Parquet, take a look at the Kite 
DatasetSink instead of using the HDFS sink. That allows you to write 
directly to Parquet.
 
rb
 
On 10/26/2015 11:29 PM, lizhenmxcz@163.com wrote:
>
> hi all,
>      i want to convert the flume sink to the parquet format in the serialization, but the parquet writer constructor need a path parameter, while the flume serialization just provide a outputstream interface. i don't how to solve it. who can give me a sample ,thanks。
>
>
> lizhenmxcz@163.com
>
 
 
-- 
Ryan Blue
Software Engineer
Cloudera, Inc.

Re: how to convert text parquet in flume serialization

Posted by Ryan Blue <bl...@cloudera.com>.
I wouldn't recommend writing directly from Flume to Parquet. Parquet 
can't guarantee that data is on disk until a file is closed, so you end 
up with long-running transactions that back up into your file channel. 
Plus, if you are writing to a partitioned dataset you end up with 
several open files and huge memory consumption. I recommend first 
writing to Avro and then using a batch job to convert into Parquet.

If you really need to write directly to Parquet, take a look at the Kite 
DatasetSink instead of using the HDFS sink. That allows you to write 
directly to Parquet.

rb

On 10/26/2015 11:29 PM, lizhenmxcz@163.com wrote:
>
> hi all,
>      i want to convert the flume sink to the parquet format in the serialization, but the parquet writer constructor need a path parameter, while the flume serialization just provide a outputstream interface. i don't how to solve it. who can give me a sample ,thanks。
>
>
> lizhenmxcz@163.com
>


-- 
Ryan Blue
Software Engineer
Cloudera, Inc.