You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2022/05/16 09:38:49 UTC

[GitHub] [iceberg] hameizi opened a new pull request, #3095: Flink: flink read iceberg upsert data use streaming mode

hameizi opened a new pull request, #3095:
URL: https://github.com/apache/iceberg/pull/3095

   Now, flink read iceberg use streaming mode ignore 'overwrite' snapshot, so user can't read the delete data real-time.
   This PR first emit the eqdelete as delete rowdata(-U), and then emit the rowdata(+I) that join adddata and posdelete by function applyPosdelete. And then flink can keep one primary table from read iceberg primary table.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] stevenzwu commented on a diff in pull request #3095: Flink: flink read iceberg upsert data use streaming mode

Posted by GitBox <gi...@apache.org>.
stevenzwu commented on code in PR #3095:
URL: https://github.com/apache/iceberg/pull/3095#discussion_r852575182


##########
flink/v1.13/flink/src/main/java/org/apache/iceberg/flink/source/StreamingRowDataFileScanTaskReader.java:
##########
@@ -0,0 +1,311 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.flink.source;
+
+import java.util.List;
+import java.util.Map;
+import org.apache.flink.annotation.Internal;
+import org.apache.flink.table.data.RowData;
+import org.apache.iceberg.FileContent;
+import org.apache.iceberg.FileScanTask;
+import org.apache.iceberg.MetadataColumns;
+import org.apache.iceberg.Schema;
+import org.apache.iceberg.avro.Avro;
+import org.apache.iceberg.encryption.InputFilesDecryptor;
+import org.apache.iceberg.flink.data.FlinkAvroReader;
+import org.apache.iceberg.flink.data.FlinkOrcReader;
+import org.apache.iceberg.flink.data.FlinkParquetReaders;
+import org.apache.iceberg.flink.data.RowDataProjection;
+import org.apache.iceberg.flink.data.RowDataUtil;
+import org.apache.iceberg.io.CloseableIterable;
+import org.apache.iceberg.io.CloseableIterator;
+import org.apache.iceberg.io.InputFile;
+import org.apache.iceberg.mapping.NameMappingParser;
+import org.apache.iceberg.orc.ORC;
+import org.apache.iceberg.parquet.Parquet;
+import org.apache.iceberg.relocated.com.google.common.collect.ImmutableMap;
+import org.apache.iceberg.relocated.com.google.common.collect.Iterables;
+import org.apache.iceberg.relocated.com.google.common.collect.Sets;
+import org.apache.iceberg.types.TypeUtil;
+import org.apache.iceberg.util.PartitionUtil;
+
+@Internal
+public class StreamingRowDataFileScanTaskReader implements FileScanTaskReader<RowData> {

Review Comment:
   Similarly, maybe `ChangelogRowDataFileScanTaskReader` is more accurate



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] 20100507 commented on pull request #3095: Flink: flink read iceberg upsert data use streaming mode

Posted by GitBox <gi...@apache.org>.
20100507 commented on PR #3095:
URL: https://github.com/apache/iceberg/pull/3095#issuecomment-1156358997

   @hameizi A month has passed, how is your progress?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] GavinH1984 commented on pull request #3095: Flink: flink read iceberg upsert data use streaming mode

Posted by GitBox <gi...@apache.org>.
GavinH1984 commented on PR #3095:
URL: https://github.com/apache/iceberg/pull/3095#issuecomment-1127355681

   @hameizi hihi, is there any plan for this PR? i saw the PR [#4580](https://github.com/apache/iceberg/pull/4580) already get merge d into master,


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] stevenzwu commented on a diff in pull request #3095: Flink: flink read iceberg upsert data use streaming mode

Posted by GitBox <gi...@apache.org>.
stevenzwu commented on code in PR #3095:
URL: https://github.com/apache/iceberg/pull/3095#discussion_r852574695


##########
core/src/main/java/org/apache/iceberg/ManifestGroup.java:
##########
@@ -189,6 +189,48 @@ public CloseableIterable<FileScanTask> planFiles() {
     }
   }
 
+  /**
+   * Returns a iterable of scan tasks. It is safe to add entries of this iterable
+   * to a collection as {@link DataFile} in each {@link FileScanTask} is defensively
+   * copied.
+   * @return a {@link CloseableIterable} of {@link FileScanTask}
+   */
+  public CloseableIterable<FileScanTask> planStreamingFiles() {

Review Comment:
   Streaming is an ambiguous name here. maybe `planFilesForChangelog`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] iflytek-hmwang5 commented on pull request #3095: Flink: flink read iceberg upsert data use streaming mode

Posted by GitBox <gi...@apache.org>.
iflytek-hmwang5 commented on PR #3095:
URL: https://github.com/apache/iceberg/pull/3095#issuecomment-1201962463

   > > @hameizi hihi, is there any plan for this PR? i saw the PR [#4580](https://github.com/apache/iceberg/pull/4580) already get merge d into master,
   > 
   > I will work on this these days.
   [hameizi](https://github.com/hameizi) how about this PR?
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] Xiangakun commented on pull request #3095: Flink: flink read iceberg upsert data use streaming mode

Posted by "Xiangakun (via GitHub)" <gi...@apache.org>.
Xiangakun commented on PR #3095:
URL: https://github.com/apache/iceberg/pull/3095#issuecomment-1584038291

   @hameizi @openinx Hello guys,what's the status of the issue now, please? Any blocking things? Thanks~


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] Bmyymwtao commented on pull request #3095: Flink: flink read iceberg upsert data use streaming mode

Posted by GitBox <gi...@apache.org>.
Bmyymwtao commented on PR #3095:
URL: https://github.com/apache/iceberg/pull/3095#issuecomment-1217677205

   > > @hameizi hihi, is there any plan for this PR? i saw the PR [#4580](https://github.com/apache/iceberg/pull/4580) already get merge d into master,
   > 
   > I will work on this these days.
   
   Hi, I pulled this PR and ran some tests,I found that when only deleteFile exists, the snapshot will not read the data. By viewing the ManifestGroup, I found that FileScanTask can only be generated when DataManifest exists.  Do you have the problem that only DeleteManifest has to consider reading -D data


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] stevenzwu commented on a diff in pull request #3095: Flink: flink read iceberg upsert data use streaming mode

Posted by GitBox <gi...@apache.org>.
stevenzwu commented on code in PR #3095:
URL: https://github.com/apache/iceberg/pull/3095#discussion_r852574835


##########
api/src/main/java/org/apache/iceberg/TableScan.java:
##########
@@ -159,6 +159,17 @@ default TableScan select(String... columns) {
    */
   TableScan appendsAfter(long fromSnapshotId);
 
+
+  /**
+   * Create a new {@link TableScan} to read appended data for {@code snapshotId} exclusive to the current snapshot
+   * inclusive.
+   *
+   * @param snapshotId - the snapshot id read by the user, exclusive
+   * @return a table scan which can read append data for {@code snapshotId}
+   * exclusive and up to current snapshot inclusive
+   */
+  TableScan appendsCurrent(long snapshotId);

Review Comment:
   this change seems unnecessary



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] hameizi closed pull request #3095: Flink: flink read iceberg upsert data use streaming mode

Posted by GitBox <gi...@apache.org>.
hameizi closed pull request #3095: Flink: flink read iceberg upsert data use streaming mode
URL: https://github.com/apache/iceberg/pull/3095


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] hameizi commented on pull request #3095: Flink: flink read iceberg upsert data use streaming mode

Posted by GitBox <gi...@apache.org>.
hameizi commented on PR #3095:
URL: https://github.com/apache/iceberg/pull/3095#issuecomment-1127451961

   > @hameizi hihi, is there any plan for this PR? i saw the PR [#4580](https://github.com/apache/iceberg/pull/4580) already get merge d into master,
   
   I will work on this these days.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] hameizi commented on pull request #3095: Flink: flink read iceberg upsert data use streaming mode

Posted by GitBox <gi...@apache.org>.
hameizi commented on PR #3095:
URL: https://github.com/apache/iceberg/pull/3095#issuecomment-1127450135

   > @hameizi hihi, is there any plan for this PR? i saw the PR [#4580](https://github.com/apache/iceberg/pull/4580) already get merge d into master,
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] hameizi commented on pull request #3095: Flink: flink read iceberg upsert data use streaming mode

Posted by GitBox <gi...@apache.org>.
hameizi commented on PR #3095:
URL: https://github.com/apache/iceberg/pull/3095#issuecomment-1102110824

   @stevenzwu I will perfect this PR after https://github.com/apache/iceberg/pull/4580 finished, because there is some correlation between them.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org