You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flume.apache.org by "Tomas Zulberti (Jira)" <ji...@apache.org> on 2020/05/21 21:53:00 UTC

[jira] [Created] (FLUME-3369) Corrupt S3 File

Tomas Zulberti created FLUME-3369:
-------------------------------------

             Summary: Corrupt S3 File
                 Key: FLUME-3369
                 URL: https://issues.apache.org/jira/browse/FLUME-3369
             Project: Flume
          Issue Type: Bug
    Affects Versions: 1.9.0
            Reporter: Tomas Zulberti


We are using Flume to read from Kinesis, and upload the files to S3. The issue comes that the generated Gzip file is corrupt:

- it is an empty file
- it is a file that isn't a valid Gz File.

I checked FLUME-2967, and we are already using native libraries. The stack trace I have is as follows:

{code}
21 May 2020 01:09:27,342 INFO  [SinkRunner-PollingRunner-DefaultSinkProcessor] (org.apache.flume.sink.hdfs.BucketWriter.open:246)  - Creating s3a://mycompany/foobar/year=2020/month=05/day=21/hour=01/172_17_5_220_bids4.1590023192733.gz

21 May 2020 01:09:27,393 INFO  [hdfs-bids4-call-runner-19] (org.apache.flume.sink.hdfs.AbstractHDFSWriter.reflectGetNumCurrentReplicas:190)  - FileSystem's output stream doesn't support getNumCurrentReplicas; --HDFS-826 not available; fsOut=org.apache.hadoop.fs.s3a.S3ABlockOutputStream; err=java.lang.NoSuchMethodException: org.apache.hadoop.fs.s3a.S3ABlockOutputStream.getNumCurrentReplicas()

21 May 2020 01:09:27,396 INFO  [SinkRunner-PollingRunner-DefaultSinkProcessor] (org.apache.flume.sink.hdfs.BucketWriter.getRefIsClosed:197)  - isFileClosed() is not available in the version of the distributed filesystem being used. Flume will not attempt to re-close files if the close fails on the first attempt

21 May 2020 01:09:27,614 WARN  [hdfs-bids4-roll-timer-0] (org.apache.flume.sink.hdfs.BucketWriter$CloseHandler.close:348)  - Closing file: s3a://mycompany/foobar/year=2020/month=05/day=21/hour=01/172_17_5_220_foobar.1590022801143.gz failed. Will retry again in 180 seconds.

java.io.IOException: Filesystem {bucket=dw.jampp.com, key='foobar/year=2020/month=05/day=21/hour=01/172_17_5_220_foobar.1590022801143.gz'} closed
        at org.apache.hadoop.fs.s3a.S3ABlockOutputStream.checkOpen(S3ABlockOutputStream.java:224)
        at org.apache.hadoop.fs.s3a.S3ABlockOutputStream.write(S3ABlockOutputStream.java:270)
        at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.write(FSDataOutputStream.java:58)
        at java.io.DataOutputStream.write(DataOutputStream.java:107)
        at org.apache.hadoop.io.compress.CompressorStream.compress(CompressorStream.java:83)
        at org.apache.hadoop.io.compress.CompressorStream.finish(CompressorStream.java:92)
        at org.apache.flume.sink.hdfs.HDFSCompressedDataStream.close(HDFSCompressedDataStream.java:149)
        at org.apache.flume.sink.hdfs.BucketWriter$3.call(BucketWriter.java:319)
        at org.apache.flume.sink.hdfs.BucketWriter$3.call(BucketWriter.java:316)
        at org.apache.flume.sink.hdfs.BucketWriter$8$1.run(BucketWriter.java:727)
        at org.apache.flume.auth.SimpleAuthenticator.execute(SimpleAuthenticator.java:50)
        at org.apache.flume.sink.hdfs.BucketWriter$8.call(BucketWriter.java:724)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)

21 May 2020 01:09:27,656 WARN  [SinkRunner-PollingRunner-DefaultSinkProcessor] (org.apache.flume.sink.hdfs.BucketWriter.append:613)  - Caught IOException writing to HDFSWriter (write beyond end of stream). Closing file (s3a://mycompany/foobar/year=2020/month=05/day=21/hour=01/172_17_5_220_foobar.1590023192733.gz) and rethrowing exception.

21 May 2020 01:09:27,658 INFO  [SinkRunner-PollingRunner-DefaultSinkProcessor] (org.apache.flume.sink.hdfs.HDFSEventSink$1.run:393)  - Writer callback called.

21 May 2020 01:09:27,658 INFO  [SinkRunner-PollingRunner-DefaultSinkProcessor] (org.apache.flume.sink.hdfs.BucketWriter.doClose:438)  - Closing s3a://mycompany/foobar/year=2020/month=05/day=21/hour=01/172_17_5_220_foobar.1590023192733.gz
{code}




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@flume.apache.org
For additional commands, e-mail: issues-help@flume.apache.org