You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2022/08/13 11:43:55 UTC

[GitHub] [iceberg] VasilyMelnik opened a new issue, #5515: "Table not found" error while using rewriteDataFiles

VasilyMelnik opened a new issue, #5515:
URL: https://github.com/apache/iceberg/issues/5515

   ### Query engine
   
   Spark, Java API
   
   ### Question
   
   Hi all.
   I created iceberg table from spark-shell like this:
   
   ```
    spark.sql(""" create table wrk.my_table
         using iceberg
         as select ...""")
   ```
   
   I see my_table in Hive metastore and all SQL queries via Spark-shell work fine. 
   But when i try to use Java API:
   ```
   	import org.apache.iceberg.hive.HiveCatalog
   	import org.apache.iceberg.catalog.TableIdentifier
   	import org.apache.iceberg.Table
   	import org.apache.iceberg.spark.actions.SparkActions
   	import org.apache.iceberg.expressions.Expressions
   	
   	val catalog = new HiveCatalog();
   	catalog.setConf(spark.sparkContext.hadoopConfiguration);
   	val properties =  new java.util.HashMap[String,String];
   	catalog.initialize("hive", properties);		
   	val name = TableIdentifier.of("wrk", "my_table");
   	val table = catalog.loadTable(name)	
   	SparkActions.get().rewriteDataFiles(table).filter(Expressions.alwaysTrue()).option("target-file-size-bytes", (500 * 1024 * 1024).toString).option("min-input-files","2").execute()
   ```
   i get an error:
   	
   >  ERROR actions.BaseRewriteDataFilesSparkAction: Cannot complete rewrite, partial-progress.enabled is not enabled and one of the file set groups failed to be rewritten. This error occurred during the writing of new files, not during the commit process. This indicates something is wrong that doesn't involve conflicts with other Iceberg operations. Enabling partial-progress.enabled may help in this case but the root cause should be investigated. Cleaning up 0 groups which finished being written.
   > java.lang.RuntimeException: org.apache.spark.sql.catalyst.analysis.NoSuchTableException: Table hive.wrk.my_table not found
   
   Table object is valid: i can get location and currentSnapshot with it.
   What is the problem?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] xiongliang1989 commented on issue #5515: "Table not found" error while using rewriteDataFiles

Posted by "xiongliang1989 (via GitHub)" <gi...@apache.org>.

xiongliang1989 commented on issue #5515:
URL: https://github.com/apache/iceberg/issues/5515#issuecomment-1489598133

   Query engine
   Spark 3.1.1, Iceberg 1.1.0
   
   Question
   Hi all.
   I created iceberg table using hadoop catalog by java code like this:
   `
   //  get sparkSession；
           SparkSession session = SparkSession
                   .builder()
                   .appName("spark-iceberg")
                   .config("spark.sql.extensions", "org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions")
                   .config("spark.sql.catalog.local", "org.apache.iceberg.spark.SparkCatalog")
                   .config("spark.sql.catalog.local.type", "hadoop")
                   .config("spark.sql.catalog.local.warehouse", "hdfs://bigdata/bigdata/warehouse")
                   .master("local[1]")
                   .getOrCreate();
   
           Configuration hadoopConf = new Configuration();
           HadoopCatalog catalog = new HadoopCatalog(hadoopConf, "hdfs://bigdata/bigdata/warehouse");
           TableIdentifier name = TableIdentifier.of("db", "n1_log");
   
           // create table
           if (!catalog.tableExists(name)) {
               session.sql("CREATE table local.db.n1_log (ran_ip string, ran_port int, amf_ip string, amf_port int, procedure_id int, procedure_bitmap int,cause_type int, cause_value int, ue_id_type  int, ue_id string, start_time long,end_time long, ran_ue_ngap_id string, amf_ue_ngap_id string, switch_off int, date_time string) USING iceberg PARTITIONED BY (date_time) TBLPROPERTIES ('write.metadata.delete-after-commit.enabled'='true', 'write.metadata.previous-versions-max'='1')");
           }
   
           // get Data source
           Dataset<Row> data = session.readStream()
                   .format("kafka")
                   .option("kafka.bootstrap.servers", "127.0.0.1:9092")
                   .option("kafka.group.id", "groupId")
                   .option("startingOffsets", "latest")
                   .option("failOnDataLoss", "false")
                   .option("maxOffsetsPerTrigger", "15000000")
                   .option("subscribe", "n1_log")
                   .load()
                   .selectExpr("CAST(value AS STRING)")
                   .toDF("data")
                   .map(new N1LogMapFunction(),
                           Encoders.bean(N1Log.class))
                   .select("ranIp", "ranPort", "amfIp", "amfPort", "procedureId", "procedureBitmap", "causeType", "causeValue", "ueIdType", "ueId", "startTime", "endTime", "ranUeNgapId", "amfUeNgapId", "switchOff", "dateTime")
                   .toDF("ran_ip", "ran_port", "amf_ip", "amf_port", "procedure_id", "procedure_bitmap", "cause_type", "cause_value", "ue_id_type", "ue_id", "start_time", "end_time", "ran_ue_ngap_id", "amf_ue_ngap_id", "switch_off", "date_time");
   
           // write
           StreamingQuery query = data.writeStream()
                   .format("iceberg")
                   .outputMode("append")
                   .partitionBy("date_time")
                   .trigger(Trigger.ProcessingTime(60, TimeUnit.SECONDS))
                   .option("path", "hdfs://bigdata/bigdata/warehouse/db/n1_log")
                   .option("checkpointLocation", "hdfs://bigdata/bigdata/checkpointLocation/n1_log")
                   .start();
   
           query.awaitTermination();
           query.stop();
   `
   
   I can query data from iceberg table; But I use the small file merge function, get some error;
   `
   SparkSession session = SparkSession
                   .builder()
                   .appName("spark-iceberg")
                   .config("spark.sql.extensions", "org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions")
                   .config("spark.sql.iceberg.handle-timestamp-without-timezone", "true")
                   .config("spark.sql.catalog.local", "org.apache.iceberg.spark.SparkCatalog")
                   .config("spark.sql.catalog.local.type", "hadoop")
                   .config("spark.sql.catalog.local.warehouse", "hdfs://sugon-bigdata/sugon-bigdata/warehouse")
                   .master("local[1]")
                   .getOrCreate();
   
   HadoopCatalog hadoopCatalog = new HadoopCatalog(new Configuration(), "hdfs://sugon-bigdata/sugon-bigdata/warehouse");
   Table table = hadoopCatalog.loadTable(TableIdentifier.of("db", "n1_log"));
   
   SparkActions.get(session)
                       .rewriteDataFiles(table)
                       .filter(Expressions.equal("date_time", "2023-03-30"))
                       .option("target-file-size-bytes", Long.toString(500 * 1024 * 1024))
                       .execute();
   `
   
   After I run the code, get the error:
   `
   23/03/30 10:32:38 WARN Query: Query for candidates of org.apache.hadoop.hive.metastore.model.MPartitionColumnStatistics and subclasses resulted in no possible candidates
   Required table missing : "CDS" in Catalog "" Schema "". DataNucleus requires this table to perform its persistence operations. Either your MetaData is incorrect, or you need to enable "datanucleus.schema.autoCreateTables"
   org.datanucleus.store.rdbms.exceptions.MissingTableException: Required table missing : "CDS" in Catalog "" Schema "". DataNucleus requires this table to perform its persistence operations. Either your MetaData is incorrect, or you need to enable "datanucleus.schema.autoCreateTables"
   
   23/03/30 10:32:38 WARN MetaStoreDirectSql: Self-test query [select "DB_ID" from "DBS"] failed; direct SQL is disabled
   javax.jdo.JDODataStoreException: Error executing SQL query "select "DB_ID" from "DBS"".
   
   NestedThrowablesStackTrace:
   java.sql.SQLSyntaxErrorException: Table/View 'DBS' does not exist.
   `
   Could you give me some help?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] kbendick commented on issue #5515: "Table not found" error while using rewriteDataFiles

Posted by GitBox <gi...@apache.org>.

kbendick commented on issue #5515:
URL: https://github.com/apache/iceberg/issues/5515#issuecomment-1215361026

You mentioned that you can get the location and the `currentSnapshot`.

I think there isn't anything wrong with your table. If you look at the error message, it says that one of the file set groups failed to be rewritten. This _usually_ happens when there's a concurrent write or other operation on the table that prevents the data rewrite from proceeding (if it would break ACID compliance).

However, I do see the final error message shows that Table `hive.wrk.my_table` not found.

When you use `spark.sql("show tables in hive")` or `spark.sql("show tables in hive.wrk")`, are you able to see the table?

Given that you didn't set the `uri` property as found in the first example here, https://iceberg.apache.org/docs/latest/spark-configuration/#catalogs, I think you need to be sure that the table is registered as `hive.wrk.my_table`.

Additionally, I don't see in the code where you configured the spark session. In order to use `SparkActions.get()`, you need to have an active properly initialized Spark session. Otherwise, it defaults to the [current active spark session](https://github.com/apache/iceberg/blob/ce5128f09cc697455e76af08ce6ce3c9c5b08b70/spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/actions/SparkActions.java#L46-L48).

So I'd suggest:
1. Ensure that the program, as written, can see the table when you use `spark.sql("SHOW TABLES IN ....")`.
2. Consider passing the URI directly to the catalog properties when configuring it.
3. Make sure that `wrk` is really the namespace, and that the table isn't _named_ `wrk.my_table`.

If you provide the way you configured the program and submitted it, as well as check the output of `spark.sql("SHOW TABLES IN ....")` and then run `DESCRIBE EXTENDED TABLE ...` on that table, that should help.

I think it's likely that you just need to properly initialize the Spark session (in order to use the SparkActions provider), aka `SparkSession.builder().....getOrCreate()` properly for the `spark` object.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] Fokko commented on issue #5515: "Table not found" error while using rewriteDataFiles

Posted by GitBox <gi...@apache.org>.

Fokko commented on issue #5515:
URL: https://github.com/apache/iceberg/issues/5515#issuecomment-1236891747

   Thanks for letting us know that it works @VasilyMelnik 👍🏻 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] Fokko closed issue #5515: "Table not found" error while using rewriteDataFiles

Posted by GitBox <gi...@apache.org>.

Fokko closed issue #5515: "Table not found" error while using rewriteDataFiles
URL: https://github.com/apache/iceberg/issues/5515


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] kbendick commented on issue #5515: "Table not found" error while using rewriteDataFiles

Posted by GitBox <gi...@apache.org>.

kbendick commented on issue #5515:
URL: https://github.com/apache/iceberg/issues/5515#issuecomment-1215362318

   But TLDR:
   
   If the table object is valid, I think that your spark session is not 100% correctly initialized.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] VasilyMelnik commented on issue #5515: "Table not found" error while using rewriteDataFiles

Posted by GitBox <gi...@apache.org>.

VasilyMelnik commented on issue #5515:
URL: https://github.com/apache/iceberg/issues/5515#issuecomment-1216248194

   Thanks a lot!
   Problem was in wrong catalog properties. I changed to 
   ```
   spark.sql.catalog.hive_prod = org.apache.iceberg.spark.SparkCatalog
   spark.sql.catalog.hive_prod.type = hive
   ```
   and now everything is fine!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org