You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by "Steve Loughran (Jira)" <ji...@apache.org> on 2020/12/04 13:27:00 UTC

[jira] [Commented] (FLINK-19595) Flink SQL support S3 select

    [ https://issues.apache.org/jira/browse/FLINK-19595?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17244004#comment-17244004 ] 

Steve Loughran commented on FLINK-19595:
----------------------------------------

s3a connector supports s3 select (HADOOP-15229) : https://hadoop.apache.org/docs/current/hadoop-aws/tools/hadoop-aws/s3_select.html

Does the tencent API work with this? Is there tuning needed?

FWIW it's very hard to use in code which tries to split files, or which assumes the length of the stream == length of the file. And with the results coming back in text as either CSV or JSON, it's inefficient about processing the results. If things came back as avro records life would be easier.

Also: is tencent's S3 consistent? That is: if S3Guard "went away" would everything still work?

> Flink SQL support S3 select
> ---------------------------
>
>                 Key: FLINK-19595
>                 URL: https://issues.apache.org/jira/browse/FLINK-19595
>             Project: Flink
>          Issue Type: Improvement
>          Components: FileSystems, Table SQL / Ecosystem
>            Reporter: liuxiaolong
>            Priority: Major
>         Attachments: image-2020-11-02-18-08-11-461.png, image-2020-11-02-18-18-14-961.png
>
>
> h4. Summarize
> Flink is based on S3AInputStream.java to select datas stored in Tencent COS, it will call the getObject function of AmazonS3Client.java. 
> Now, Tencent COS  have already support to pushdown the CSV and Parquert file format.
> In these cases, using getObject to select datas will wastes a lots of bandwidth.
> So, I think Flink SQL should support S3 Select, to reduce the waste of bandwidth.
>  
> h4. Design
> 1. In HiveMapredSplitReader.java , we used int[] selectedFields to construct S3 SELECT SQL. And we created a new Class named S3SelectCsvReader which used AmazonS3Client.selectObjectContent function to readLine CSV File.
> !image-2020-11-02-18-08-11-461.png|width=535,height=967!
>  
> !image-2020-11-02-18-18-14-961.png|width=629,height=284!
>  
> 2.  Flink Demo Table:
> 1) Table schema
> Flink SQL> desc cos.test_s3a;
>  root
> |– name: STRING (col1)|
> |– age: INT           (col2)|
> |– dt: STRING      (col3,it's a partition column)|
>  
> 2) Conversion relationship (FLINK SQL Convert To S3 SELECT SQL)
> FlinkSQL                                                                                              S3 SELECT SQL
> select name from cos.test_s3a;                                             =>       SELECT s._1, null FROM S3Object s
> select age from cos.test_s3a;                                                 =>      SELECT null, s._2 FROM S3Object s
> select dt, name, age from cos.test_s3a;                                =>       SELECT s._1, s._2 FROM S3Object s
> select dt from cos.test_s3a;                                                    =>      SELECT null, null FROM S3Object s
> select * from cos.test_s3a;                                                      =>      SELECT s._1, s._2 FROM S3Object s
> select name from cos.test_s3a where dt='2020-07-15';      =>      SELECT s._1, null FROM S3Object s
>  
> 3) Patch Commit
> https://github.com/Coderlxl/flink/commit/b211f4830a7301bf9283a6d37209000b176913ad



--
This message was sent by Atlassian Jira
(v8.3.4#803005)