You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@flink.apache.org by Chirag Dewan via user <us...@flink.apache.org> on 2023/03/07 11:35:47 UTC

CSV File Sink in Streaming Use Case

Hi,
I am working on a Java DataStream application and need to implement a File sink with CSV format.
I see that I have two options here - Row and Bulk (https://nightlies.apache.org/flink/flink-docs-release-1.16/docs/connectors/datastream/filesystem/#format-types-1)
So for CSV file distribution which one should I use? Row or Bulk?
I think the documentation is confusing for File connectors. Because I can see an example for PyFlink which uses a BulkWriter for CSV. But the same class is not public in flink-csv. So does Flink not support CSVBulkWriter for Java?
And for Table API File sink explicitly supports CSV for Row format. But fails to mention anything about CSV in DataStream File sink. 
This all is just really confusing. Any leads on this are much appreciated.
Thanks

Re: CSV File Sink in Streaming Use Case

Posted by ramkrishna vasudevan <ra...@gmail.com>.
Hi all,

One thing to note is that, the CSVBulkReader does not support the
splittable property. Previously with TextInputFormat we were able to use
the block size to split them, but in Streaming world this is not there.

Regards
Ram

On Wed, Mar 8, 2023 at 7:22 AM yuxia <lu...@alumni.sjtu.edu.cn> wrote:

> Hi, as the doc said:
> 'The BulkFormat reads and decodes batches of records at a time.' So, the
> bulk is not binded to column format, the bulk writer for csv is indeed
> implemented in the Flink code. Actaully, you can use either Row or Bulk
> depending on what style you would like to write data.
>
> As for the doc missing for CSV BulkFormat and not public in flink-csv, I
> really don't know why. I guess the reason maybe Flink won't expose it the
> datastream api, but only expose to table api.
>
> Best regards,
> Yuxia
>
> ------------------------------
> *发件人: *"User" <us...@flink.apache.org>
> *收件人: *"User" <us...@flink.apache.org>
> *发送时间: *星期二, 2023年 3 月 07日 下午 7:35:47
> *主题: *CSV File Sink in Streaming Use Case
>
> Hi,
>
> I am working on a Java DataStream application and need to implement a File
> sink with CSV format.
>
> I see that I have two options here - Row and Bulk (
> https://nightlies.apache.org/flink/flink-docs-release-1.16/docs/connectors/datastream/filesystem/#format-types-1
> )
>
> So for CSV file distribution which one should I use? Row or Bulk?
>
> I think the documentation is confusing for File connectors. Because I can
> see an example for PyFlink which uses a BulkWriter for CSV. But the same
> class is not public in flink-csv. So does Flink not support CSVBulkWriter
> for Java?
>
> And for Table API File sink explicitly supports CSV for Row format. But
> fails to mention anything about CSV in DataStream File sink.
>
> This all is just really confusing. Any leads on this are much appreciated.
>
> Thanks
>
>

Re: CSV File Sink in Streaming Use Case

Posted by yuxia <lu...@alumni.sjtu.edu.cn>.
Hi, as the doc said: 
'The BulkFormat reads and decodes batches of records at a time.' So, the bulk is not binded to column format, the bulk writer for csv is indeed implemented in the Flink code. Actaully, you can use either Row or Bulk depending on what style you would like to write data. 

As for the doc missing for CSV BulkFormat and not public in flink-csv, I really don't know why. I guess the reason maybe Flink won't expose it the datastream api, but only expose to table api. 

Best regards, 
Yuxia 


发件人: "User" <us...@flink.apache.org> 
收件人: "User" <us...@flink.apache.org> 
发送时间: 星期二, 2023年 3 月 07日 下午 7:35:47 
主题: CSV File Sink in Streaming Use Case 

Hi, 

I am working on a Java DataStream application and need to implement a File sink with CSV format. 

I see that I have two options here - Row and Bulk ( [ https://nightlies.apache.org/flink/flink-docs-release-1.16/docs/connectors/datastream/filesystem/#format-types-1 | https://nightlies.apache.org/flink/flink-docs-release-1.16/docs/connectors/datastream/filesystem/#format-types-1 ] ) 

So for CSV file distribution which one should I use? Row or Bulk? 

I think the documentation is confusing for File connectors. Because I can see an example for PyFlink which uses a BulkWriter for CSV. But the same class is not public in flink-csv. So does Flink not support CSVBulkWriter for Java? 

And for Table API File sink explicitly supports CSV for Row format. But fails to mention anything about CSV in DataStream File sink. 

This all is just really confusing. Any leads on this are much appreciated. 

Thanks 


Re: CSV File Sink in Streaming Use Case

Posted by Shammon FY <zj...@gmail.com>.
Hi Chirag

CSVBulkWriter implements BulkWriter and no special methods are added in
CSVBulkWriter. I think you can use BulkWriter instead of CSVBulkWriter in
your application directly. You can have a try, thanks

Best,
Shammon


On Fri, Mar 10, 2023 at 4:05 PM Shammon FY <zj...@gmail.com> wrote:

>
>

Re: CSV File Sink in Streaming Use Case

Posted by Chirag Dewan via user <us...@flink.apache.org>.
 Thanks for the reply Shammom. I looked at the DataStreamCsvITCase - it gives a very good example. I can implement something similar. However, the CSVBulkWriter that is uses to create a factory, has a default package access which can be accessed from this test case, but not from my application. 
Should I just replicate this as a public class in my API? If this class is intended to use as a CSVBulkWriter, should this be public in flink-csv?
Thanks
    On Wednesday, 8 March, 2023 at 07:11:19 am IST, Shammon FY <zj...@gmail.com> wrote:  
 
 Hi
You can create a `BulkWriter.Factory` which will create `CsvBulkWriter` and create `FileSink` by `FileSink.forBulkFormat`. You can see the detail in `DataStreamCsvITCase.testCustomBulkWriter`
Best,Shammon

On Tue, Mar 7, 2023 at 7:41 PM Chirag Dewan via user <us...@flink.apache.org> wrote:

Hi,
I am working on a Java DataStream application and need to implement a File sink with CSV format.
I see that I have two options here - Row and Bulk (https://nightlies.apache.org/flink/flink-docs-release-1.16/docs/connectors/datastream/filesystem/#format-types-1)
So for CSV file distribution which one should I use? Row or Bulk?
I think the documentation is confusing for File connectors. Because I can see an example for PyFlink which uses a BulkWriter for CSV. But the same class is not public in flink-csv. So does Flink not support CSVBulkWriter for Java?
And for Table API File sink explicitly supports CSV for Row format. But fails to mention anything about CSV in DataStream File sink. 
This all is just really confusing. Any leads on this are much appreciated.
Thanks
  

Re: CSV File Sink in Streaming Use Case

Posted by Shammon FY <zj...@gmail.com>.
Hi

You can create a `BulkWriter.Factory` which will create `CsvBulkWriter` and
create `FileSink` by `FileSink.forBulkFormat`. You can see the detail in
`DataStreamCsvITCase.testCustomBulkWriter`

Best,
Shammon


On Tue, Mar 7, 2023 at 7:41 PM Chirag Dewan via user <us...@flink.apache.org>
wrote:

> Hi,
>
> I am working on a Java DataStream application and need to implement a File
> sink with CSV format.
>
> I see that I have two options here - Row and Bulk (
> https://nightlies.apache.org/flink/flink-docs-release-1.16/docs/connectors/datastream/filesystem/#format-types-1
> )
>
> So for CSV file distribution which one should I use? Row or Bulk?
>
> I think the documentation is confusing for File connectors. Because I can
> see an example for PyFlink which uses a BulkWriter for CSV. But the same
> class is not public in flink-csv. So does Flink not support CSVBulkWriter
> for Java?
>
> And for Table API File sink explicitly supports CSV for Row format. But
> fails to mention anything about CSV in DataStream File sink.
>
> This all is just really confusing. Any leads on this are much appreciated.
>
> Thanks
>