You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@flume.apache.org by "Alberto Sarubbi (JIRA)" <ji...@apache.org> on 2016/08/05 19:14:20 UTC

[jira] [Commented] (FLUME-2967) Corrupted gzip files generated when writting to S3

    [ https://issues.apache.org/jira/browse/FLUME-2967?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15409921#comment-15409921 ] 

Alberto Sarubbi commented on FLUME-2967:
----------------------------------------

h4. UPDATE:
*got it working*
solution was to include hadoop native libraries in my flume installation using the _plugins.d/hdfs/native_ directory.
apparently java compression, which happens as a fallback mechanism, is not working properly, breaking files.

key message to determine this is :
{noformat}
2016-08-05 18:10:19,038 (hdfs-khdfs-call-runner-0) 
[WARN - org.apache.hadoop.util.NativeCodeLoader.<clinit>(NativeCodeLoader.java:62)] 
Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
{noformat}

after copying the libraries, this message is displayed
{noformat}
2016-08-05 18:56:10,509 (hdfs-khdfs-call-runner-0) 
[INFO - org.apache.hadoop.io.compress.zlib.ZlibFactory.<clinit>(ZlibFactory.java:49)] 
Successfully loaded & initialized native-zlib library
{noformat}

> Corrupted gzip files generated when writting to S3
> --------------------------------------------------
>
>                 Key: FLUME-2967
>                 URL: https://issues.apache.org/jira/browse/FLUME-2967
>             Project: Flume
>          Issue Type: Bug
>          Components: Sinks+Sources
>    Affects Versions: v1.6.0
>         Environment: Java(TM) SE Runtime Environment (build 1.8.0_72-b15)
> Java HotSpot(TM) 64-Bit Server VM (build 25.72-b15, mixed mode)
> Amazon Linux AMI release 2016.03
> 4.1.17-22.30.amzn1.x86_64
>            Reporter: Alberto Sarubbi
>         Attachments: useractivity.1470406765436.json.gz
>
>
> a flume process configured with the following parameters writes corrupt gzip files to AWS S3
> h4. Configuration
> {noformat}
> #### SINKS ####
> #sink to write to S3
> a1.sinks.khdfs.type = hdfs
> a1.sinks.khdfs.hdfs.path = s3n://@logs.tigo.com/useractivity/%Y/%m/%d/p6-v2/
> a1.sinks.khdfs.hdfs.fileType = CompressedStream
> a1.sinks.khdfs.hdfs.codeC = gzip
> a1.sinks.khdfs.hdfs.filePrefix = useractivity
> a1.sinks.khdfs.hdfs.fileSuffix = .json.gz
> a1.sinks.khdfs.hdfs.writeFormat = Writable
> a1.sinks.khdfs.hdfs.rollCount = 100
> a1.sinks.khdfs.hdfs.rollSize = 0
> a1.sinks.khdfs.hdfs.callTimeout = 120000
> a1.sinks.khdfs.hdfs.batchSize = 1000
> a1.sinks.khdfs.hdfs.threadsPoolSize = 40
> a1.sinks.khdfs.hdfs.rollTimerPoolSize = 1
> a1.sinks.khdfs.channel = chdfs
> {noformat}
> the input is a simple JSON structure
> {code:javascript}
> {
>   "origin": "Mi Tigo App sv",
>   "date": "2016-08-05T14:26:10.859Z",
>   "country": "SV",
>   "action": "MI-TIGO-APP Header Enrichment",
>   "msisdn": "76821107",
>   "ip": "181.189.178.89",
>   "useragent": "Mi Tigo  samsung zerofltedv SM-G920I 5.1.1 22 V: 31 (1.503.0.73)",
>   "data": {
>     "variables": "{\"!msisdn\":\"76821107\"}"
>   },
>   "event_id": "mta_login"
> }
> {code}
> i use a combination of hdfs sink and the following libraries in the plugins.d/hdfs/libext folder
> {noformat}
>   hdfs group: 'com.amazonaws', name: 'aws-java-sdk-s3', version: '1.10.72'
>   hdfs group: 'org.apache.hadoop', name: 'hadoop-common', version: '2.5.2'
>   hdfs group: 'org.apache.hadoop', name: 'hadoop-hdfs', version: '2.5.2'
>   hdfs group: 'org.apache.hadoop', name: 'hadoop-annotations', version: '2.5.2'
>   hdfs group: 'org.apache.hadoop', name: 'hadoop-auth', version: '2.5.2'
>   hdfs group: 'org.apache.hadoop', name: 'hadoop-mapreduce-client-core', version: '2.5.2'
>   hdfs group: 'org.apache.hadoop', name: 'hadoop-mapreduce-client-common', version: '2.5.2'
>   hdfs group: 'org.apache.hadoop', name: 'hadoop-mapreduce-client-jobclient', version: '2.5.2'
>   hdfs group: 'commons-configuration', name: 'commons-configuration', version: '1.10'
>   hdfs group: 'net.java.dev.jets3t', name: 'jets3t', version: '0.9.4'
>   hdfs group: 'org.apache.httpcomponents', name: 'httpclient', version: '4.5.2'
>   hdfs group: 'org.apache.httpcomponents', name: 'httpcore', version: '4.4.5'
> {noformat}
> i expect a file with 100 events and compressed in gzip format to be on S3, but the generated file is damaged: 
> * the size of the compressed size is greater than the internal file
> * most tools fails to decompress the file, arguing is damaged.
> * gzip -d forcefully decompresses, not without complaining about extra 
>  trailing garbage
> {noformat}
> gzip -d useractivity.1470407170478.json.gz 
> gzip: useractivity.1470407170478.json.gz: decompression OK, trailing garbage ignored
> {noformat}
> * last but not least, the resulting file from the forced decompression contains only one or two lines, where 100 is expected.
> h4. we tried (to no avail) :
> * both Writable and Text file types
> * all options on controlling the file content by rolling: time, events, size
> * all combinations of recipes for writing to S3, including more than one set of libraries
> * all schemas (s3n, s3a)
> * not compressing. this generates the expected json files just fine.
> * vanilla flume libraries
> * heavily replaced flume libraries, with newer or different versions of libraries (just in case)
> * read all available documentation
> h4. we haven't tried:
> * install Hadoop and refer libraries in classpath (we want to avoid this, we are not using Hadoop on the Flume nodes)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)