You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@iceberg.apache.org by Huadong Liu <hu...@gmail.com> on 2021/04/22 05:34:12 UTC
Spark configuration on hive catalog
Hello Iceberg Dev,
I am not sure I follow the discussion on Spark configurations on hive
catalogs <https://iceberg.apache.org/spark-configuration/#catalogs>. I
created an iceberg table with the hive catalog.
Configuration conf = new Configuration();
conf.set("hive.metastore.uris", args[0]);
conf.set("hive.metastore.warehouse.dir", args[1]);
HiveCatalog catalog = new HiveCatalog(conf);
ImmutableMap meta = ImmutableMap.of(...);
Schema schema = new Schema(...);
PartitionSpec spec = PartitionSpec.builderFor(schema)...build();
TableIdentifier name = TableIdentifier.of("my_db", "my_table");
Table table = catalog.createTable(name, schema, spec);
On a box with *hive.metastore.uris *set correctly in *hive-site.xml,* spark-sql
runs fine with
spark-sql --packages org.apache.iceberg:iceberg-spark3-runtime:0.11.1
--conf
spark.sql.catalog.spark_catalog=org.apache.iceberg.spark.SparkSessionCatalog
--conf spark.sql.catalog.spark_catalog.type=hive
spark-sql> INSERT INTO my_db.my_table VALUES ("111", timestamp 'today', 1),
("333", timestamp 'today', 3);
spark-sql> SELECT * FROM my_db.my_table ;
However, if I follow the Spark hive configuration above to add a table
catalog,
spark-sql --packages org.apache.iceberg:iceberg-spark3-runtime:0.11.1
--conf
spark.sql.catalog.spark_catalog=org.apache.iceberg.spark.SparkSessionCatalog
--conf spark.sql.catalog.spark_catalog.type=hive
--conf spark.sql.catalog.my_db=org.apache.iceberg.spark.SparkCatalog
--conf spark.sql.catalog.my_db.type=hive
spark-sql> INSERT INTO my_db.my_table VALUES ("111", timestamp 'today', 1),
("333", timestamp 'today', 3);
Error in query: Table not found: my_db.my_table;
https://iceberg.apache.org/spark/#reading-an-iceberg-table states that "To
use Iceberg in Spark, first configure Spark catalogs." Did I misunderstand
anything? Do I have to configure catalog/namespace? Thanks for your time on
this.
--
Huadong
Re: Spark configuration on hive catalog
Posted by Russell Spitzer <ru...@gmail.com>.
One thing to double check is that you have setup your spark client to use a Hive Catalog for the session catalog. It is possible you are using a derby based session catalog which the iceberg catalog is wrapping. See
https://github.com/apache/iceberg/issues/2488 <https://github.com/apache/iceberg/issues/2488>
Make sure that
spark.sql.catalogimplementation = hive
> On Apr 22, 2021, at 12:30 PM, Szehon Ho <sz...@apple.com.INVALID> wrote:
>
> Hi Huadong, nice to see you again :). The syntax is spark-sql is ‘insert into <catalog>.<db>.<table> …”, here you defined your db as a catalog?
>
> You just need to define one catalog and use it when referring to your table.
>
>
>
>> On 22 Apr 2021, at 07:34, Huadong Liu <huadongliu@gmail.com <ma...@gmail.com>> wrote:
>>
>> Hello Iceberg Dev,
>>
>> I am not sure I follow the discussion on Spark configurations on hive catalogs <https://iceberg.apache.org/spark-configuration/#catalogs>. I created an iceberg table with the hive catalog.
>> Configuration conf = new Configuration();
>> conf.set("hive.metastore.uris", args[0]);
>> conf.set("hive.metastore.warehouse.dir", args[1]);
>>
>> HiveCatalog catalog = new HiveCatalog(conf);
>> ImmutableMap meta = ImmutableMap.of(...);
>> Schema schema = new Schema(...);
>> PartitionSpec spec = PartitionSpec.builderFor(schema)...build();
>>
>> TableIdentifier name = TableIdentifier.of("my_db", "my_table");
>> Table table = catalog.createTable(name, schema, spec);
>> On a box with hive.metastore.uris set correctly in hive-site.xml, spark-sql runs fine with
>>
>> spark-sql --packages org.apache.iceberg:iceberg-spark3-runtime:0.11.1
>> --conf spark.sql.catalog.spark_catalog=org.apache.iceberg.spark.SparkSessionCatalog
>> --conf spark.sql.catalog.spark_catalog.type=hive
>> spark-sql> INSERT INTO my_db.my_table VALUES ("111", timestamp 'today', 1), ("333", timestamp 'today', 3);
>> spark-sql> SELECT * FROM my_db.my_table ;
>>
>> However, if I follow the Spark hive configuration above to add a table catalog,
>>
>> spark-sql --packages org.apache.iceberg:iceberg-spark3-runtime:0.11.1
>> --conf spark.sql.catalog.spark_catalog=org.apache.iceberg.spark.SparkSessionCatalog
>> --conf spark.sql.catalog.spark_catalog.type=hive
>> --conf spark.sql.catalog.my_db=org.apache.iceberg.spark.SparkCatalog
>> --conf spark.sql.catalog.my_db.type=hive
>> spark-sql> INSERT INTO my_db.my_table VALUES ("111", timestamp 'today', 1), ("333", timestamp 'today', 3);
>> Error in query: Table not found: my_db.my_table;
>>
>> https://iceberg.apache.org/spark/#reading-an-iceberg-table <https://iceberg.apache.org/spark/#reading-an-iceberg-table> states that "To use Iceberg in Spark, first configure Spark catalogs." Did I misunderstand anything? Do I have to configure catalog/namespace? Thanks for your time on this.
>>
>> --
>> Huadong
>
Re: Spark configuration on hive catalog
Posted by Huadong Liu <hu...@gmail.com>.
Thank you Szehon, Russell. Yeah, glad to see you here, Szehon!
My bad! I was confused by the .db surfix when a catalog namespace is
created (createDatabase internally). Things work as expected when I use
<catalog_name>.<db_name_without_db_surfix>.<table_name>.
--
Huadong
On Thu, Apr 22, 2021 at 10:31 AM Szehon Ho <sz...@apple.com.invalid>
wrote:
> Hi Huadong, nice to see you again :). The syntax is spark-sql is ‘insert
> into <catalog>.<db>.<table> …”, here you defined your db as a catalog?
>
> You just need to define one catalog and use it when referring to your
> table.
>
>
>
> On 22 Apr 2021, at 07:34, Huadong Liu <hu...@gmail.com> wrote:
>
> Hello Iceberg Dev,
>
> I am not sure I follow the discussion on Spark configurations on hive
> catalogs <https://iceberg.apache.org/spark-configuration/#catalogs>. I
> created an iceberg table with the hive catalog.
>
> Configuration conf = new Configuration();
> conf.set("hive.metastore.uris", args[0]);
> conf.set("hive.metastore.warehouse.dir", args[1]);
>
> HiveCatalog catalog = new HiveCatalog(conf);
> ImmutableMap meta = ImmutableMap.of(...);
> Schema schema = new Schema(...);
> PartitionSpec spec = PartitionSpec.builderFor(schema)...build();
>
> TableIdentifier name = TableIdentifier.of("my_db", "my_table");
> Table table = catalog.createTable(name, schema, spec);
>
> On a box with *hive.metastore.uris *set correctly in *hive-site.xml,* spark-sql
> runs fine with
>
> spark-sql --packages org.apache.iceberg:iceberg-spark3-runtime:0.11.1
> --conf
> spark.sql.catalog.spark_catalog=org.apache.iceberg.spark.SparkSessionCatalog
> --conf spark.sql.catalog.spark_catalog.type=hive
> spark-sql> INSERT INTO my_db.my_table VALUES ("111", timestamp 'today',
> 1), ("333", timestamp 'today', 3);
> spark-sql> SELECT * FROM my_db.my_table ;
>
> However, if I follow the Spark hive configuration above to add a table
> catalog,
>
> spark-sql --packages org.apache.iceberg:iceberg-spark3-runtime:0.11.1
> --conf
> spark.sql.catalog.spark_catalog=org.apache.iceberg.spark.SparkSessionCatalog
> --conf spark.sql.catalog.spark_catalog.type=hive
> --conf spark.sql.catalog.my_db=org.apache.iceberg.spark.SparkCatalog
> --conf spark.sql.catalog.my_db.type=hive
> spark-sql> INSERT INTO my_db.my_table VALUES ("111", timestamp 'today',
> 1), ("333", timestamp 'today', 3);
> Error in query: Table not found: my_db.my_table;
>
> https://iceberg.apache.org/spark/#reading-an-iceberg-table states
> that "To use Iceberg in Spark, first configure Spark catalogs." Did I
> misunderstand anything? Do I have to configure catalog/namespace? Thanks
> for your time on this.
>
> --
> Huadong
>
>
>
Re: Spark configuration on hive catalog
Posted by Szehon Ho <sz...@apple.com.INVALID>.
Hi Huadong, nice to see you again :). The syntax is spark-sql is ‘insert into <catalog>.<db>.<table> …”, here you defined your db as a catalog?
You just need to define one catalog and use it when referring to your table.
> On 22 Apr 2021, at 07:34, Huadong Liu <hu...@gmail.com> wrote:
>
> Hello Iceberg Dev,
>
> I am not sure I follow the discussion on Spark configurations on hive catalogs <https://iceberg.apache.org/spark-configuration/#catalogs>. I created an iceberg table with the hive catalog.
> Configuration conf = new Configuration();
> conf.set("hive.metastore.uris", args[0]);
> conf.set("hive.metastore.warehouse.dir", args[1]);
>
> HiveCatalog catalog = new HiveCatalog(conf);
> ImmutableMap meta = ImmutableMap.of(...);
> Schema schema = new Schema(...);
> PartitionSpec spec = PartitionSpec.builderFor(schema)...build();
>
> TableIdentifier name = TableIdentifier.of("my_db", "my_table");
> Table table = catalog.createTable(name, schema, spec);
> On a box with hive.metastore.uris set correctly in hive-site.xml, spark-sql runs fine with
>
> spark-sql --packages org.apache.iceberg:iceberg-spark3-runtime:0.11.1
> --conf spark.sql.catalog.spark_catalog=org.apache.iceberg.spark.SparkSessionCatalog
> --conf spark.sql.catalog.spark_catalog.type=hive
> spark-sql> INSERT INTO my_db.my_table VALUES ("111", timestamp 'today', 1), ("333", timestamp 'today', 3);
> spark-sql> SELECT * FROM my_db.my_table ;
>
> However, if I follow the Spark hive configuration above to add a table catalog,
>
> spark-sql --packages org.apache.iceberg:iceberg-spark3-runtime:0.11.1
> --conf spark.sql.catalog.spark_catalog=org.apache.iceberg.spark.SparkSessionCatalog
> --conf spark.sql.catalog.spark_catalog.type=hive
> --conf spark.sql.catalog.my_db=org.apache.iceberg.spark.SparkCatalog
> --conf spark.sql.catalog.my_db.type=hive
> spark-sql> INSERT INTO my_db.my_table VALUES ("111", timestamp 'today', 1), ("333", timestamp 'today', 3);
> Error in query: Table not found: my_db.my_table;
>
> https://iceberg.apache.org/spark/#reading-an-iceberg-table <https://iceberg.apache.org/spark/#reading-an-iceberg-table> states that "To use Iceberg in Spark, first configure Spark catalogs." Did I misunderstand anything? Do I have to configure catalog/namespace? Thanks for your time on this.
>
> --
> Huadong