You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@flume.apache.org by Kristopher Kalish <kr...@kalish.net> on 2014/08/11 18:25:27 UTC

Corrupt gzip files when using the hdfs sink to write to s3n

Hey everyone,

I apologize if this has been asked before, but I was unable to find a
similar problem in the archives. I have successfully configured Flume to
write to s3n. However, if I turn on gzip compression the files that end up
in s3n are malformed gzip files. Their "packed size" is larger than their
extracted size.

I have a hypothesis that this is due to the s3n "driver" not implementing
isFileClosed(). I imagine that the HDFS is not closing and reopening a
compressed stream somewhere under the hood.

This seems like it would be a common configuration scenario though so I'm
wondering if anyone has some insight. Not sure if it matters, but I'm using
Windows Server 2008. Here is a copy of the agent configuration I'm using:


# Agent
agent.sources =  http
agent.channels = s3
agent.sinks = s3

# source
agent.sources.http.type = http
agent.sources.http.bind = localhost
agent.sources.http.port = 6162
agent.sources.http.channels = s3

# route events base on event type header
agent.sources.http.selector.type = multiplexing
agent.sources.http.selector.header = event-type
#...
agent.sources.http.selector.default = s3

#  s3
###########################################################
# channel
agent.channels.s3.type = file
agent.channels.s3.checkpointDir =
D:\\flume-data\\flume-file-channel\\s3\\checkpoint
agent.channels.s3.dataDirs = D:\\flume-data\\flume-file-channel\\s3\\data
agent.channels.s3.maxFileSize = 10485760

# sink
agent.sinks.s3.type = hdfs
agent.sinks.s3.channel = s3
agent.sinks.s3.hdfs.path = s3n://XXXXX:XXXX@mybucket
/%{event-type}/y=%Y/m=%m/d=%d/h=%H
agent.sinks.s3.hdfs.fileType = DataStream
agent.sinks.s3.hdfs..writeFormat = Text
agent.sinks.s3.hdfs.batchSize = 10000
agent.sinks.s3.hdfs.rollCount = 10000
agent.sinks.s3.hdfs.rollInterval = 300
agent.sinks.s3.hdfs.rollSize = 0
agent.sinks.s3.hdfs.filePrefix = flume.%{host}
agent.sinks.s3.hdfs.fileSuffix = .txt
agent.sinks.s3.hdfs.timeZone = UTC