You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@flink.apache.org by Fabian Hueske <fh...@gmail.com> on 2018/10/01 08:52:18 UTC

Re: Streaming to Parquet Files in HDFS

Hi Bill,

Flink 1.6.0 supports writing Avro records as Parquet files to HDFS via the
previously mentioned StreamingFileSink [1], [2].

Best, Fabian

[1] https://issues.apache.org/jira/browse/FLINK-9753
[2] https://issues.apache.org/jira/browse/FLINK-9750

Am Fr., 28. Sep. 2018 um 23:36 Uhr schrieb hao gao <ha...@gmail.com>:

> Hi Bill,
>
> I wrote those two medium posts you mentioned above. But clearly, the
> techlab one is much better
> I would suggest just "close the file when checkpointing" which is the
> easiest way. If you use BucketingSink, you can modify the code to make it
> work. Just replace the code from line 691 to 693 with
> closeCurrentPartFile()
>
> https://github.com/apache/flink/blob/release-1.3.2-rc1/flink-connectors/flink-connector-filesystem/src/main/java/org/apache/flink/streaming/connectors/fs/bucketing/BucketingSink.java#L691
> This should guarantee exactly-once. You may have some files with
> underscore prefix when flink job failed. But usually those files are
> ignored by the query engine/ readers for example,  Presto
>
> If you use 1.6 and later, I think the issue is already addressed
> https://issues.apache.org/jira/browse/FLINK-9750
>
> Thanks
> Hao
>
> On Fri, Sep 28, 2018 at 1:57 PM William Speirs <ws...@apache.org> wrote:
>
>> I'm trying to stream log messages (syslog fed into Kafak) into Parquet
>> files on HDFS via Flink. I'm able to read, parse, and construct objects for
>> my messages in Flink; however, writing to Parquet is tripping me up. I do
>> *not* need to have this be real-time; a delay of a few minutes, even up to
>> an hour, is fine.
>>
>> I've found the following articles talking about this being very difficult:
>> *
>> https://medium.com/hadoop-noob/a-realtime-flink-parquet-data-warehouse-df8c3bd7401
>> * https://medium.com/hadoop-noob/flink-parquet-writer-d127f745b519
>> *
>> https://techlab.bol.com/how-not-to-sink-a-data-stream-to-files-journeys-from-kafka-to-parquet/
>> *
>> http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Rolling-sink-parquet-Avro-output-td11123.html
>>
>> All of these posts speak of troubles using the check-pointing mechanisms
>> and Parquets need to perform batch writes. I'm not experienced enough with
>> Flink's check-pointing or Parquet's file format to completely understand
>> the issue. So my questions are as follows:
>>
>> 1) Is this possible in Flink in an exactly-once way? If not, is it
>> possible in a way that _might_ cause duplicates during an error?
>>
>> 2) Is there another/better format to use other than Parquet that offers
>> compression and the ability to be queried by something like Drill or Impala?
>>
>> 3) Any further recommendations for solving the overall problem: ingesting
>> syslogs and writing them to a file(s) that is searchable by an SQL(-like)
>> framework?
>>
>> Thanks!
>>
>> Bill-
>>
>
>
> --
> Thanks
>  - Hao
>

Re: Streaming to Parquet Files in HDFS

Posted by Averell <lv...@gmail.com>.

Hi Kostas,

Thanks for the info. That error caused by I built your code along with not
up-to-date baseline. I rebased my branch build, and there's no more such
issue.
I've been testing, and until now have some questions/issues as below:

1. I'm not able to write to S3 with the following URI format: *s3*://<path>,
and had to use *s3a*://<path>. Is this behaviour expected? (I am running
Flink on AWS EMR, and I thought that EMR provides a wrapper for HDFS over S3
with something called EMRFS).

2. Occasionally/randomly I got the below message ( parquet_error1.log
<http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/file/t1586/parquet_error1.log>
). I'm using ParquetAvroWriters.forReflectRecord() method to write Scala
case classes. Re-running the job doesn't get that error at the same data
location, so I don't think that there's issue with data.
*java.lang.ArrayIndexOutOfBoundsException: <some random number>* /at
org.apache.parquet.column.values.dictionary.DictionaryValuesWriter$PlainBinaryDictionaryValuesWriter.fallBackDictionaryEncodedData/.

3. Sometimes I got this error message when I use parallelism of 8 for the
sink ( parquet_error2.log
<http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/file/t1586/parquet_error2.log>
).
Reducing to 2 solves the issue. But is it possible to increase the pool
size? I could not find any place that I can change the
/fs.s3.maxconnections/ parameter.
/java.io.InterruptedIOException: initiate MultiPartUpload on
Test/output/dt=2018-09-20/part-7-5:
org.apache.flink.fs.s3base.shaded.com.amazonaws.SdkClientException: Unable
to execute HTTP request: Timeout waiting for connection from pool/

4. Where is the temporary folder that you store the parquet file before
uploading to S3?

Thanks a lot for your help.

Best regards,
Averell

--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/