You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2021/09/15 13:17:38 UTC

[GitHub] [iceberg] bvinayakumar opened a new issue #3124: When writing data to S3 using Glue Catalog, current snapshot ID is -1 and not updated in the metadata file generated

bvinayakumar opened a new issue #3124:
URL: https://github.com/apache/iceberg/issues/3124


   I am trying to write data to an Iceberg table using Flink engine. Code snippet is provided below for reference. Please note I am testing this code presently from IDE. The iceberg table is created using Flink SQL client. I am using Flink version 1.13.1 and Iceberg 0.12.0 runtime.
   
   ```
   val catalogProperties = mutable.Map[String, String]().empty
   catalogProperties.put(Constants.CatalogType, "iceberg")
   catalogProperties.put(CatalogProperties.WAREHOUSE_LOCATION, "s3://my-iceberg-bucket")
   catalogProperties.put(CatalogProperties.CATALOG_IMPL, "org.apache.iceberg.aws.glue.GlueCatalog")
   catalogProperties.put(CatalogProperties.FILE_IO_IMPL, "org.apache.iceberg.aws.s3.S3FileIO")
   catalogProperties.put(CatalogProperties.LOCK_IMPL, "org.apache.iceberg.aws.glue.DynamoLockManager")
   catalogProperties.put(CatalogProperties.LOCK_TABLE, "myLockTable")
   
   val hadoopConf = new Configuration
   val catalogLoader = CatalogLoader.custom("my_catalog", catalogProperties.asJava, hadoopConf, "org.apache.iceberg.aws.glue.GlueCatalog")
   val tableLoader = TableLoader.fromCatalog(catalogLoader, TableIdentifier.of("my_ns", "data"))
   
   val dataStream = env.addSource(kafkaConsumer).setParallelism(4).flatMap(new DataMapper) // DataMapper converts JSON message from kafka source to GenericRowData 
   
   FlinkSink.forRowData(dataStream.javaStream)
     .tableLoader(tableLoader)
     .build()
   ```
   
   The data is written successfully (i.e. I can see parquet files being created under `data` folder within the specified S3 bucket). However, the `current-snapshot-id` is not updated (-1) in the metadata JSON file. There is only one JSON file generated in the `metadata` folder but there are several parquet files under the `data` folder.
   
   ```
   {
     "format-version" : 1,
     "table-uuid" : "3a1efd9d-a3ce-47b4-9f5f-d2a61ff6efd9",
     "location" : "s3://...",
     "last-updated-ms" : 1631709058366,
     "last-column-id" : 10,
     "schema" : {
       "type" : "struct",
       "schema-id" : 0,
       "fields" : [ ... ]
     },
     "current-schema-id" : 0,
     "schemas" : [ {
       "type" : "struct",
       "schema-id" : 0,
       "fields" : [ ... ]
     } ],
     "partition-spec" : [ ... ],
     "default-spec-id" : 0,
     "partition-specs" : [ {
       "spec-id" : 0,
       "fields" : [ ... ]
     } ],
     "last-partition-id" : 1003,
     "default-sort-order-id" : 0,
     "sort-orders" : [ {
       "order-id" : 0,
       "fields" : [ ]
     } ],
     "properties" : { },
     "current-snapshot-id" : -1,
     "snapshots" : [ ],
     "snapshot-log" : [ ],
     "metadata-log" : [ ]
   }
   ```
   
   I have read an excellent blog [1] about Apache Iceberg architecture and was expecting the current snapshot ID to be updated on successful writes.
   
   Any comments on why the current snapshot ID may not be updated in the metadata JSON file?
   
   [1] Apache Iceberg: An Architectural Look Under the Covers
   https://www.dremio.com/apache-iceberg-an-architectural-look-under-the-covers/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] bvinayakumar edited a comment on issue #3124: When writing data to S3 using Glue Catalog, current snapshot ID is -1 and not updated in the metadata file generated

Posted by GitBox <gi...@apache.org>.
bvinayakumar edited a comment on issue #3124:
URL: https://github.com/apache/iceberg/issues/3124#issuecomment-921069257


   Yes, parquet files generated for me under data folder almost instantaneously.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] tuziling removed a comment on issue #3124: When writing data to S3 using Glue Catalog, current snapshot ID is -1 and not updated in the metadata file generated

Posted by GitBox <gi...@apache.org>.
tuziling removed a comment on issue #3124:
URL: https://github.com/apache/iceberg/issues/3124#issuecomment-920772320


   Your data can be written to iceberg in real time


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] bvinayakumar commented on issue #3124: When writing data to S3 using Glue Catalog, current snapshot ID is -1 and not updated in the metadata file generated

Posted by GitBox <gi...@apache.org>.
bvinayakumar commented on issue #3124:
URL: https://github.com/apache/iceberg/issues/3124#issuecomment-921688448


   I did enable checkpointing. You tested code from IDE or deploying jar in Flink?
   
   `    env.enableCheckpointing(60000, CheckpointingMode.AT_LEAST_ONCE)
   `


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] bvinayakumar edited a comment on issue #3124: When writing data to S3 using Glue Catalog, current snapshot ID is -1 and not updated in the metadata file generated

Posted by GitBox <gi...@apache.org>.
bvinayakumar edited a comment on issue #3124:
URL: https://github.com/apache/iceberg/issues/3124#issuecomment-921676245


   Is the metadata snapshot file (current-snapshot-id, snapshots) updated for you?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] tuziling commented on issue #3124: When writing data to S3 using Glue Catalog, current snapshot ID is -1 and not updated in the metadata file generated

Posted by GitBox <gi...@apache.org>.
tuziling commented on issue #3124:
URL: https://github.com/apache/iceberg/issues/3124#issuecomment-921694987


   The code of my checkpoint and sink part looks like this:
   
   env.enableCheckpointing(1000 * 60);
       
    FlinkSink.forRowData(rowData)
                   .tableLoader(tableLoader)
                  // .overwrite(true)
                   .build();


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] tuziling commented on issue #3124: When writing data to S3 using Glue Catalog, current snapshot ID is -1 and not updated in the metadata file generated

Posted by GitBox <gi...@apache.org>.
tuziling commented on issue #3124:
URL: https://github.com/apache/iceberg/issues/3124#issuecomment-921651736


   My data is also written in in real time, but I don’t know whether the added data can be updated based on the primary key, using RowKind.UPDATE_BEFORE or RowKind.UPDATE_AFTER?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] tuziling commented on issue #3124: When writing data to S3 using Glue Catalog, current snapshot ID is -1 and not updated in the metadata file generated

Posted by GitBox <gi...@apache.org>.
tuziling commented on issue #3124:
URL: https://github.com/apache/iceberg/issues/3124#issuecomment-921683240


   Updated.Did you not add a checkpoint mechanism to the flink code, so the data cannot be written?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] tuziling commented on issue #3124: When writing data to S3 using Glue Catalog, current snapshot ID is -1 and not updated in the metadata file generated

Posted by GitBox <gi...@apache.org>.
tuziling commented on issue #3124:
URL: https://github.com/apache/iceberg/issues/3124#issuecomment-920772632


   Can your data be written to iceberg in real time?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] tuziling commented on issue #3124: When writing data to S3 using Glue Catalog, current snapshot ID is -1 and not updated in the metadata file generated

Posted by GitBox <gi...@apache.org>.
tuziling commented on issue #3124:
URL: https://github.com/apache/iceberg/issues/3124#issuecomment-920772320


   Your data can be written to iceberg in real time


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] bvinayakumar commented on issue #3124: When writing data to S3 using Glue Catalog, current snapshot ID is -1 and not updated in the metadata file generated

Posted by GitBox <gi...@apache.org>.
bvinayakumar commented on issue #3124:
URL: https://github.com/apache/iceberg/issues/3124#issuecomment-921069257


   Yes, parquet files generated for me under data folder almost instantaneously.
   
   Get Outlook for Android<https://aka.ms/AAb9ysg>
   
   ________________________________
   From: tuziling ***@***.***>
   Sent: Thursday, September 16, 2021 3:45:45 PM
   To: apache/iceberg ***@***.***>
   Cc: bvinayakumar ***@***.***>; Author ***@***.***>
   Subject: Re: [apache/iceberg] When writing data to S3 using Glue Catalog, current snapshot ID is -1 and not updated in the metadata file generated (#3124)
   
   
   Can your data be written to iceberg in real time?
   
   —
   You are receiving this because you authored the thread.
   Reply to this email directly, view it on GitHub<https://github.com/apache/iceberg/issues/3124#issuecomment-920772632>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AIS37LFKWFNLO56LOASYJALUCG7VDANCNFSM5ECN2TUA>.
   Triage notifications on the go with GitHub Mobile for iOS<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] bvinayakumar commented on issue #3124: When writing data to S3 using Glue Catalog, current snapshot ID is -1 and not updated in the metadata file generated

Posted by GitBox <gi...@apache.org>.
bvinayakumar commented on issue #3124:
URL: https://github.com/apache/iceberg/issues/3124#issuecomment-921691552


   @tuziling Is read working for you also?
   
   ```
   val env= org.apache.flink.streaming.api.environment.StreamExecutionEnvironment.getExecutionEnvironment()
   FlinkSource.forRowData()
         .env(env)
         .tableLoader(tableLoader)
         .streaming(false)
         .build()
         .print()
   env.execute(jobName)
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] tuziling edited a comment on issue #3124: When writing data to S3 using Glue Catalog, current snapshot ID is -1 and not updated in the metadata file generated

Posted by GitBox <gi...@apache.org>.
tuziling edited a comment on issue #3124:
URL: https://github.com/apache/iceberg/issues/3124#issuecomment-921694987


   The code of my checkpoint and sink part looks like this:
   
   env.enableCheckpointing(1000 * 60);
       
    FlinkSink.forRowData(rowData)
                   .tableLoader(tableLoader)
                  // .overwrite(true)
                   .build();
   
   Which country are you from?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] bvinayakumar edited a comment on issue #3124: When writing data to S3 using Glue Catalog, current snapshot ID is -1 and not updated in the metadata file generated

Posted by GitBox <gi...@apache.org>.
bvinayakumar edited a comment on issue #3124:
URL: https://github.com/apache/iceberg/issues/3124#issuecomment-921691552


   @tuziling Is read working for you also?
   
   ```
   val env = org.apache.flink.streaming.api.environment.StreamExecutionEnvironment.getExecutionEnvironment()
   FlinkSource.forRowData()
         .env(env)
         .tableLoader(tableLoader)
         .streaming(false)
         .build()
         .print()
   env.execute(jobName)
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] bvinayakumar commented on issue #3124: When writing data to S3 using Glue Catalog, current snapshot ID is -1 and not updated in the metadata file generated

Posted by GitBox <gi...@apache.org>.
bvinayakumar commented on issue #3124:
URL: https://github.com/apache/iceberg/issues/3124#issuecomment-921676245


   Is the metadata snapshot file updated for you?
   
   Get Outlook for Android<https://aka.ms/AAb9ysg>
   ________________________________
   From: tuziling ***@***.***>
   Sent: Friday, September 17, 2021 3:00:22 PM
   To: apache/iceberg ***@***.***>
   Cc: bvinayakumar ***@***.***>; Author ***@***.***>
   Subject: Re: [apache/iceberg] When writing data to S3 using Glue Catalog, current snapshot ID is -1 and not updated in the metadata file generated (#3124)
   
   
   My data is also written in in real time, but I don’t know whether the added data can be updated based on the primary key, using RowKind.UPDATE_BEFORE or RowKind.UPDATE_AFTER?
   
   —
   You are receiving this because you authored the thread.
   Reply to this email directly, view it on GitHub<https://github.com/apache/iceberg/issues/3124#issuecomment-921651736>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AIS37LCLXLY6ZMGKMEBVAFTUCMDC5ANCNFSM5ECN2TUA>.
   Triage notifications on the go with GitHub Mobile for iOS<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org