You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2020/10/21 06:50:33 UTC

[GitHub] [iceberg] openinx opened a new issue #1639: Flink: write the CDC records into apache iceberg tables.

openinx opened a new issue #1639:
URL: https://github.com/apache/iceberg/issues/1639


   Currently,  we apache iceberg have implemented the equality-deletes and position-delete internally,  but we still lack the ability to work with different engines.  In this PR https://github.com/apache/iceberg/pull/1444 ,  we spark could read the records by merging delete files and. data files,  and PR https://github.com/apache/iceberg/pull/1517 give flink the similar ability to read delete records.   So the core thing what we need is to make the engine writers enable,  that means : 
   
   1.   spark, flink, hive etc could write deletes CDC events into apache iceberg table; 
   2.  we could execute SQL sentences to delete records which is matching the query conditions in a batch job. 
   
   For spark engine,  I think @rdblue and @chenjunjiedada  have had a good plan to get those finished.  I would try to PR a PoC solution in this issue, mainly focused on the first cases,  batch delete should be similar to the spark implementation. 
   
   We flink have provided a lib to consume various database's change log events,  and made them into a RowData DataStream.  The RowData has a `RowKind` to indicate whether this event is an INSERT ? DELETE ? UPDATE_BEFORE ?  UPDATE_AFTER ?   This is very friendly to accomplish the cdc writers for flink.
   
   ```java
   /**
    * Lists all kinds of changes that a row can describe in a changelog.
    */
   @PublicEvolving
   public enum RowKind {
   
   	// Note: Enums have no stable hash code across different JVMs, use toByteValue() for
   	// this purpose.
   
   	/**
   	 * Insertion operation.
   	 */
   	INSERT("+I", (byte) 0),
   
   	/**
   	 * Update operation with the previous content of the updated row.
   	 *
   	 * <p>This kind SHOULD occur together with {@link #UPDATE_AFTER} for modelling an update that needs
   	 * to retract the previous row first. It is useful in cases of a non-idempotent update, i.e., an
   	 * update of a row that is not uniquely identifiable by a key.
   	 */
   	UPDATE_BEFORE("-U", (byte) 1),
   
   	/**
   	 * Update operation with new content of the updated row.
   	 *
   	 * <p>This kind CAN occur together with {@link #UPDATE_BEFORE} for modelling an update that
   	 * needs to retract the previous row first. OR it describes an idempotent update, i.e., an update
   	 * of a row that is uniquely identifiable by a key.
   	 */
   	UPDATE_AFTER("+U", (byte) 2),
   
   	/**
   	 * Deletion operation.
   	 */
   	DELETE("-D", (byte) 3);
          // ... 
   }
   ```
   
   It may need some work for how to integrate the current flink CDC API with apache iceberg implementation. Let's track those thing in this issue.
   
   
    


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] openinx commented on issue #1639: Flink: write the CDC records into apache iceberg tables.

Posted by GitBox <gi...@apache.org>.
openinx commented on issue #1639:
URL: https://github.com/apache/iceberg/issues/1639#issuecomment-713377359


   @paulzanmei ,  I think it will be available in the next 0.12.0 release, but I am not sure what's the precise release time, maybe two months later. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue closed issue #1639: Flink: write the CDC records into apache iceberg tables.

Posted by GitBox <gi...@apache.org>.
rdblue closed issue #1639:
URL: https://github.com/apache/iceberg/issues/1639


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] wg1026688210 commented on issue #1639: Flink: write the CDC records into apache iceberg tables.

Posted by GitBox <gi...@apache.org>.
wg1026688210 commented on issue #1639:
URL: https://github.com/apache/iceberg/issues/1639#issuecomment-716327956


   LGTM, We also have this requirement


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] paulzanmei commented on issue #1639: Flink: write the CDC records into apache iceberg tables.

Posted by GitBox <gi...@apache.org>.
paulzanmei commented on issue #1639:
URL: https://github.com/apache/iceberg/issues/1639#issuecomment-713382813


   @openinx   Thank you for your reply, I will continue to follow


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] paulzanmei commented on issue #1639: Flink: write the CDC records into apache iceberg tables.

Posted by GitBox <gi...@apache.org>.
paulzanmei commented on issue #1639:
URL: https://github.com/apache/iceberg/issues/1639#issuecomment-713371806


   When do you plan to complete the Flink CDC integration? We also have this requirement


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org