You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2022/08/13 11:43:55 UTC
[GitHub] [iceberg] VasilyMelnik opened a new issue, #5515: "Table not found" error while using rewriteDataFiles
VasilyMelnik opened a new issue, #5515:
URL: https://github.com/apache/iceberg/issues/5515
### Query engine
Spark, Java API
### Question
Hi all.
I created iceberg table from spark-shell like this:
```
spark.sql(""" create table wrk.my_table
using iceberg
as select ...""")
```
I see my_table in Hive metastore and all SQL queries via Spark-shell work fine.
But when i try to use Java API:
```
import org.apache.iceberg.hive.HiveCatalog
import org.apache.iceberg.catalog.TableIdentifier
import org.apache.iceberg.Table
import org.apache.iceberg.spark.actions.SparkActions
import org.apache.iceberg.expressions.Expressions
val catalog = new HiveCatalog();
catalog.setConf(spark.sparkContext.hadoopConfiguration);
val properties = new java.util.HashMap[String,String];
catalog.initialize("hive", properties);
val name = TableIdentifier.of("wrk", "my_table");
val table = catalog.loadTable(name)
SparkActions.get().rewriteDataFiles(table).filter(Expressions.alwaysTrue()).option("target-file-size-bytes", (500 * 1024 * 1024).toString).option("min-input-files","2").execute()
```
i get an error:
> ERROR actions.BaseRewriteDataFilesSparkAction: Cannot complete rewrite, partial-progress.enabled is not enabled and one of the file set groups failed to be rewritten. This error occurred during the writing of new files, not during the commit process. This indicates something is wrong that doesn't involve conflicts with other Iceberg operations. Enabling partial-progress.enabled may help in this case but the root cause should be investigated. Cleaning up 0 groups which finished being written.
> java.lang.RuntimeException: org.apache.spark.sql.catalyst.analysis.NoSuchTableException: Table hive.wrk.my_table not found
Table object is valid: i can get location and currentSnapshot with it.
What is the problem?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org
[GitHub] [iceberg] xiongliang1989 commented on issue #5515: "Table not found" error while using rewriteDataFiles
Posted by "xiongliang1989 (via GitHub)" <gi...@apache.org>.
xiongliang1989 commented on issue #5515:
URL: https://github.com/apache/iceberg/issues/5515#issuecomment-1489598133
Query engine
Spark 3.1.1, Iceberg 1.1.0
Question
Hi all.
I created iceberg table using hadoop catalog by java code like this:
`
// get sparkSession;
SparkSession session = SparkSession
.builder()
.appName("spark-iceberg")
.config("spark.sql.extensions", "org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions")
.config("spark.sql.catalog.local", "org.apache.iceberg.spark.SparkCatalog")
.config("spark.sql.catalog.local.type", "hadoop")
.config("spark.sql.catalog.local.warehouse", "hdfs://bigdata/bigdata/warehouse")
.master("local[1]")
.getOrCreate();
Configuration hadoopConf = new Configuration();
HadoopCatalog catalog = new HadoopCatalog(hadoopConf, "hdfs://bigdata/bigdata/warehouse");
TableIdentifier name = TableIdentifier.of("db", "n1_log");
// create table
if (!catalog.tableExists(name)) {
session.sql("CREATE table local.db.n1_log (ran_ip string, ran_port int, amf_ip string, amf_port int, procedure_id int, procedure_bitmap int,cause_type int, cause_value int, ue_id_type int, ue_id string, start_time long,end_time long, ran_ue_ngap_id string, amf_ue_ngap_id string, switch_off int, date_time string) USING iceberg PARTITIONED BY (date_time) TBLPROPERTIES ('write.metadata.delete-after-commit.enabled'='true', 'write.metadata.previous-versions-max'='1')");
}
// get Data source
Dataset<Row> data = session.readStream()
.format("kafka")
.option("kafka.bootstrap.servers", "127.0.0.1:9092")
.option("kafka.group.id", "groupId")
.option("startingOffsets", "latest")
.option("failOnDataLoss", "false")
.option("maxOffsetsPerTrigger", "15000000")
.option("subscribe", "n1_log")
.load()
.selectExpr("CAST(value AS STRING)")
.toDF("data")
.map(new N1LogMapFunction(),
Encoders.bean(N1Log.class))
.select("ranIp", "ranPort", "amfIp", "amfPort", "procedureId", "procedureBitmap", "causeType", "causeValue", "ueIdType", "ueId", "startTime", "endTime", "ranUeNgapId", "amfUeNgapId", "switchOff", "dateTime")
.toDF("ran_ip", "ran_port", "amf_ip", "amf_port", "procedure_id", "procedure_bitmap", "cause_type", "cause_value", "ue_id_type", "ue_id", "start_time", "end_time", "ran_ue_ngap_id", "amf_ue_ngap_id", "switch_off", "date_time");
// write
StreamingQuery query = data.writeStream()
.format("iceberg")
.outputMode("append")
.partitionBy("date_time")
.trigger(Trigger.ProcessingTime(60, TimeUnit.SECONDS))
.option("path", "hdfs://bigdata/bigdata/warehouse/db/n1_log")
.option("checkpointLocation", "hdfs://bigdata/bigdata/checkpointLocation/n1_log")
.start();
query.awaitTermination();
query.stop();
`
I can query data from iceberg table; But I use the small file merge function, get some error;
`
SparkSession session = SparkSession
.builder()
.appName("spark-iceberg")
.config("spark.sql.extensions", "org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions")
.config("spark.sql.iceberg.handle-timestamp-without-timezone", "true")
.config("spark.sql.catalog.local", "org.apache.iceberg.spark.SparkCatalog")
.config("spark.sql.catalog.local.type", "hadoop")
.config("spark.sql.catalog.local.warehouse", "hdfs://sugon-bigdata/sugon-bigdata/warehouse")
.master("local[1]")
.getOrCreate();
HadoopCatalog hadoopCatalog = new HadoopCatalog(new Configuration(), "hdfs://sugon-bigdata/sugon-bigdata/warehouse");
Table table = hadoopCatalog.loadTable(TableIdentifier.of("db", "n1_log"));
SparkActions.get(session)
.rewriteDataFiles(table)
.filter(Expressions.equal("date_time", "2023-03-30"))
.option("target-file-size-bytes", Long.toString(500 * 1024 * 1024))
.execute();
`
After I run the code, get the error:
`
23/03/30 10:32:38 WARN Query: Query for candidates of org.apache.hadoop.hive.metastore.model.MPartitionColumnStatistics and subclasses resulted in no possible candidates
Required table missing : "CDS" in Catalog "" Schema "". DataNucleus requires this table to perform its persistence operations. Either your MetaData is incorrect, or you need to enable "datanucleus.schema.autoCreateTables"
org.datanucleus.store.rdbms.exceptions.MissingTableException: Required table missing : "CDS" in Catalog "" Schema "". DataNucleus requires this table to perform its persistence operations. Either your MetaData is incorrect, or you need to enable "datanucleus.schema.autoCreateTables"
23/03/30 10:32:38 WARN MetaStoreDirectSql: Self-test query [select "DB_ID" from "DBS"] failed; direct SQL is disabled
javax.jdo.JDODataStoreException: Error executing SQL query "select "DB_ID" from "DBS"".
NestedThrowablesStackTrace:
java.sql.SQLSyntaxErrorException: Table/View 'DBS' does not exist.
`
Could you give me some help?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org
[GitHub] [iceberg] kbendick commented on issue #5515: "Table not found" error while using rewriteDataFiles
Posted by GitBox <gi...@apache.org>.
kbendick commented on issue #5515:
URL: https://github.com/apache/iceberg/issues/5515#issuecomment-1215361026
You mentioned that you can get the location and the `currentSnapshot`.
I think there isn't anything wrong with your table. If you look at the error message, it says that one of the file set groups failed to be rewritten. This _usually_ happens when there's a concurrent write or other operation on the table that prevents the data rewrite from proceeding (if it would break ACID compliance).
However, I do see the final error message shows that Table `hive.wrk.my_table` not found.
When you use `spark.sql("show tables in hive")` or `spark.sql("show tables in hive.wrk")`, are you able to see the table?
Given that you didn't set the `uri` property as found in the first example here, https://iceberg.apache.org/docs/latest/spark-configuration/#catalogs, I think you need to be sure that the table is registered as `hive.wrk.my_table`.
Additionally, I don't see in the code where you configured the spark session. In order to use `SparkActions.get()`, you need to have an active properly initialized Spark session. Otherwise, it defaults to the [current active spark session](https://github.com/apache/iceberg/blob/ce5128f09cc697455e76af08ce6ce3c9c5b08b70/spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/actions/SparkActions.java#L46-L48).
So I'd suggest:
1. Ensure that the program, as written, can see the table when you use `spark.sql("SHOW TABLES IN ....")`.
2. Consider passing the URI directly to the catalog properties when configuring it.
3. Make sure that `wrk` is really the namespace, and that the table isn't _named_ `wrk.my_table`.
If you provide the way you configured the program and submitted it, as well as check the output of `spark.sql("SHOW TABLES IN ....")` and then run `DESCRIBE EXTENDED TABLE ...` on that table, that should help.
I think it's likely that you just need to properly initialize the Spark session (in order to use the SparkActions provider), aka `SparkSession.builder().....getOrCreate()` properly for the `spark` object.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org
[GitHub] [iceberg] Fokko commented on issue #5515: "Table not found" error while using rewriteDataFiles
Posted by GitBox <gi...@apache.org>.
Fokko commented on issue #5515:
URL: https://github.com/apache/iceberg/issues/5515#issuecomment-1236891747
Thanks for letting us know that it works @VasilyMelnik 👍🏻
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org
[GitHub] [iceberg] Fokko closed issue #5515: "Table not found" error while using rewriteDataFiles
Posted by GitBox <gi...@apache.org>.
Fokko closed issue #5515: "Table not found" error while using rewriteDataFiles
URL: https://github.com/apache/iceberg/issues/5515
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org
[GitHub] [iceberg] kbendick commented on issue #5515: "Table not found" error while using rewriteDataFiles
Posted by GitBox <gi...@apache.org>.
kbendick commented on issue #5515:
URL: https://github.com/apache/iceberg/issues/5515#issuecomment-1215362318
But TLDR:
If the table object is valid, I think that your spark session is not 100% correctly initialized.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org
[GitHub] [iceberg] VasilyMelnik commented on issue #5515: "Table not found" error while using rewriteDataFiles
Posted by GitBox <gi...@apache.org>.
VasilyMelnik commented on issue #5515:
URL: https://github.com/apache/iceberg/issues/5515#issuecomment-1216248194
Thanks a lot!
Problem was in wrong catalog properties. I changed to
```
spark.sql.catalog.hive_prod = org.apache.iceberg.spark.SparkCatalog
spark.sql.catalog.hive_prod.type = hive
```
and now everything is fine!
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org