You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@flink.apache.org by Dan Hill <qu...@gmail.com> on 2020/08/10 22:13:06 UTC

Batch version of StreamingFileSink.forRowFormat(...)

Hi. I have a streaming job that writes to
StreamingFileSink.forRowFormat(...) with an encoder that converts protocol
buffers to byte arrays.

How do read this data back in during a batch pipeline (using DataSet)?  Do
I use env.readFile with a custom DelimitedInputFormat?  The streamfile sink
documentation
<https://ci.apache.org/projects/flink/flink-docs-stable/dev/connectors/streamfile_sink.html>
is a bit vague.

These files are used as raw logs.  They're processed offline and the whole
record is read and used at the same time.

Thanks!
- Dan
<https://ci.apache.org/projects/flink/flink-docs-stable/dev/connectors/streamfile_sink.html>

Re: Batch version of StreamingFileSink.forRowFormat(...)

Posted by Dan Hill <qu...@gmail.com>.

Thanks!  I'll take a look.

On Tue, Aug 11, 2020 at 1:33 AM Timo Walther <tw...@apache.org> wrote:

> Hi Dan,
>
> InputFormats are the connectors of the DataSet API. Yes, you can use
> either readFile, readCsvFile, readFileOfPrimitives etc. However, I would
> recommend to also give Table API a try. The unified TableEnvironment is
> able to perform batch processing and is integrated with a bunch of
> connectors such as for filesystems [1] and through Hive abstractions [2].
>
> I hope this helps.
>
> Regards,
> Timo
>
> [1]
>
> https://ci.apache.org/projects/flink/flink-docs-master/dev/table/connectors/filesystem.html
> [2]
>
> https://ci.apache.org/projects/flink/flink-docs-master/dev/table/hive/hive_read_write.html
>
> On 11.08.20 00:13, Dan Hill wrote:
> > Hi. I have a streaming job that writes to
> > StreamingFileSink.forRowFormat(...) with an encoder that converts
> > protocol buffers to byte arrays.
> >
> > How do read this data back in during a batch pipeline (using DataSet)?
> > Do I use env.readFile with a custom DelimitedInputFormat?  The
> > streamfile sink documentation
> > <
> https://ci.apache.org/projects/flink/flink-docs-stable/dev/connectors/streamfile_sink.html>
>
> > is a bit vague.
> >
> > These files are used as raw logs.  They're processed offline and the
> > whole record is read and used at the same time.
> >
> > Thanks!
> > - Dan
> > <
> https://ci.apache.org/projects/flink/flink-docs-stable/dev/connectors/streamfile_sink.html
> >
>
>

Re: Batch version of StreamingFileSink.forRowFormat(...)

Posted by Timo Walther <tw...@apache.org>.

Hi Dan,

InputFormats are the connectors of the DataSet API. Yes, you can use 
either readFile, readCsvFile, readFileOfPrimitives etc. However, I would 
recommend to also give Table API a try. The unified TableEnvironment is 
able to perform batch processing and is integrated with a bunch of 
connectors such as for filesystems [1] and through Hive abstractions [2].

I hope this helps.

Regards,
Timo

[1] 
https://ci.apache.org/projects/flink/flink-docs-master/dev/table/connectors/filesystem.html
[2] 
https://ci.apache.org/projects/flink/flink-docs-master/dev/table/hive/hive_read_write.html

On 11.08.20 00:13, Dan Hill wrote:
> Hi. I have a streaming job that writes to 
> StreamingFileSink.forRowFormat(...) with an encoder that converts 
> protocol buffers to byte arrays.
> 
> How do read this data back in during a batch pipeline (using DataSet)?  
> Do I use env.readFile with a custom DelimitedInputFormat?  The 
> streamfile sink documentation 
> <https://ci.apache.org/projects/flink/flink-docs-stable/dev/connectors/streamfile_sink.html> 
> is a bit vague.
> 
> These files are used as raw logs.  They're processed offline and the 
> whole record is read and used at the same time.
> 
> Thanks!
> - Dan
> <https://ci.apache.org/projects/flink/flink-docs-stable/dev/connectors/streamfile_sink.html>