You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@flume.apache.org by Andrei Stryia <An...@epam.com> on 2015/08/20 11:38:50 UTC

Generate valid Avro file in S3

Hi there,

Our system generates a lot of small files in Avro format with the same Schema and sends them to the Flume via Thrift RPC.
Our Flume agent has the following configuration:

agent.channels=ch1
agent.sources=thrift-source1
agent.sinks=s3-sink1
agent.channels.ch1.type=file
agent.channels.ch1.checkpointDir=/flume/ch1/checkpoint
agent.channels.ch1.dataDirs=/flume/ch1/data
agent.sources.thrift-source1.channels=ch1
agent.sources.thrift-source1.type=thrift
agent.sources.thrift-source1.bind=0.0.0.0
agent.sources.thrift-source1.threads=5
agent.sources.thrift-source1.port=1026
agent.sinks.s3-sink1.channel=ch1
agent.sinks.s3-sink1.type=hdfs
agent.sinks.s3-sink1.hdfs.path=s3n://bucket/path/
agent.sinks.s3-sink1.hdfs.filePrefix=documents
agent.sinks.s3-sink1.hdfs.fileSuffix=.avro
agent.sinks.s3-sink1.hdfs.rollInterval =0
agent.sinks.s3-sink1.hdfs.rollSize=20971520
agent.sinks.s3-sink1.hdfs.rollCount=0
agent.sinks.s3-sink1.hdfs.batchSize=10
agent.sinks.s3-sink1.hdfs.fileType=DataStream
agent.sinks.s3-sink1.hdfs.useLocalTimeStamp=true

Currently Flume just concatenate all Avro files to the single file, as result I have one big file, where Schema and other Avro specific metadata written multiple times.
How can I configure Flume to generate valid Avro container file, where schema is written once and which contains Avro datum (without metadata) from all small files (the schema for all files are the same).

Thanks,
Andrei.