You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@iceberg.apache.org by Huadong Liu <hu...@gmail.com> on 2021/04/22 05:34:12 UTC

Spark configuration on hive catalog

Hello Iceberg Dev,

I am not sure I follow the discussion on Spark configurations on hive
catalogs <https://iceberg.apache.org/spark-configuration/#catalogs>. I
created an iceberg table with the hive catalog.

Configuration conf = new Configuration();
conf.set("hive.metastore.uris", args[0]);
conf.set("hive.metastore.warehouse.dir", args[1]);

HiveCatalog catalog = new HiveCatalog(conf);
ImmutableMap meta = ImmutableMap.of(...);
Schema schema = new Schema(...);
PartitionSpec spec = PartitionSpec.builderFor(schema)...build();

TableIdentifier name = TableIdentifier.of("my_db", "my_table");
Table table = catalog.createTable(name, schema, spec);

On a box with *hive.metastore.uris *set correctly in *hive-site.xml,* spark-sql
runs fine with

spark-sql --packages org.apache.iceberg:iceberg-spark3-runtime:0.11.1
--conf
spark.sql.catalog.spark_catalog=org.apache.iceberg.spark.SparkSessionCatalog
--conf spark.sql.catalog.spark_catalog.type=hive

spark-sql> INSERT INTO my_db.my_table VALUES ("111", timestamp 'today', 1),
("333", timestamp 'today', 3);

spark-sql> SELECT * FROM my_db.my_table ;

However, if I follow the Spark hive configuration above to add a table
catalog,

spark-sql --packages org.apache.iceberg:iceberg-spark3-runtime:0.11.1

--conf
spark.sql.catalog.spark_catalog=org.apache.iceberg.spark.SparkSessionCatalog

--conf spark.sql.catalog.spark_catalog.type=hive

--conf spark.sql.catalog.my_db=org.apache.iceberg.spark.SparkCatalog

--conf spark.sql.catalog.my_db.type=hive

spark-sql> INSERT INTO my_db.my_table VALUES ("111", timestamp 'today', 1),
("333", timestamp 'today', 3);

Error in query: Table not found: my_db.my_table;

https://iceberg.apache.org/spark/#reading-an-iceberg-table states that "To
use Iceberg in Spark, first configure Spark catalogs." Did I misunderstand
anything? Do I have to configure catalog/namespace? Thanks for your time on
this.

--
Huadong

Re: Spark configuration on hive catalog

Posted by Russell Spitzer <ru...@gmail.com>.
One thing to double check is that you have setup your spark client to use a Hive Catalog for the session catalog. It is possible you are using a derby based session catalog which the iceberg catalog is wrapping. See

https://github.com/apache/iceberg/issues/2488 <https://github.com/apache/iceberg/issues/2488>

Make sure that 

spark.sql.catalogimplementation = hive



> On Apr 22, 2021, at 12:30 PM, Szehon Ho <sz...@apple.com.INVALID> wrote:
> 
> Hi Huadong, nice to see you again :).  The syntax is spark-sql is ‘insert into <catalog>.<db>.<table> …”, here you defined your db as a catalog?  
> 
> You just need to define one catalog and use it when referring to your table.
> 
> 
> 
>> On 22 Apr 2021, at 07:34, Huadong Liu <huadongliu@gmail.com <ma...@gmail.com>> wrote:
>> 
>> Hello Iceberg Dev,
>> 
>> I am not sure I follow the discussion on Spark configurations on hive catalogs <https://iceberg.apache.org/spark-configuration/#catalogs>. I created an iceberg table with the hive catalog.
>> Configuration conf = new Configuration();
>> conf.set("hive.metastore.uris", args[0]);
>> conf.set("hive.metastore.warehouse.dir", args[1]);
>> 
>> HiveCatalog catalog = new HiveCatalog(conf);
>> ImmutableMap meta = ImmutableMap.of(...);
>> Schema schema = new Schema(...);
>> PartitionSpec spec = PartitionSpec.builderFor(schema)...build();
>> 
>> TableIdentifier name = TableIdentifier.of("my_db", "my_table");
>> Table table = catalog.createTable(name, schema, spec);
>> On a box with hive.metastore.uris set correctly in hive-site.xml, spark-sql runs fine with 
>> 
>> spark-sql --packages org.apache.iceberg:iceberg-spark3-runtime:0.11.1
>> --conf spark.sql.catalog.spark_catalog=org.apache.iceberg.spark.SparkSessionCatalog
>> --conf spark.sql.catalog.spark_catalog.type=hive
>> spark-sql> INSERT INTO my_db.my_table VALUES ("111", timestamp 'today', 1), ("333", timestamp 'today', 3);
>> spark-sql> SELECT * FROM my_db.my_table ;
>> 
>> However, if I follow the Spark hive configuration above to add a table catalog,
>> 
>> spark-sql --packages org.apache.iceberg:iceberg-spark3-runtime:0.11.1
>> --conf spark.sql.catalog.spark_catalog=org.apache.iceberg.spark.SparkSessionCatalog
>> --conf spark.sql.catalog.spark_catalog.type=hive 
>> --conf spark.sql.catalog.my_db=org.apache.iceberg.spark.SparkCatalog 
>> --conf spark.sql.catalog.my_db.type=hive
>> spark-sql> INSERT INTO my_db.my_table VALUES ("111", timestamp 'today', 1), ("333", timestamp 'today', 3);
>> Error in query: Table not found: my_db.my_table;
>> 
>> https://iceberg.apache.org/spark/#reading-an-iceberg-table <https://iceberg.apache.org/spark/#reading-an-iceberg-table> states that "To use Iceberg in Spark, first configure Spark catalogs." Did I misunderstand anything? Do I have to configure catalog/namespace? Thanks for your time on this.
>> 
>> --
>> Huadong
> 


Re: Spark configuration on hive catalog

Posted by Huadong Liu <hu...@gmail.com>.
Thank you Szehon, Russell. Yeah, glad to see you here, Szehon!

My bad! I was confused by the .db surfix when a catalog namespace is
created (createDatabase internally). Things work as expected when I use
<catalog_name>.<db_name_without_db_surfix>.<table_name>.

--
Huadong

On Thu, Apr 22, 2021 at 10:31 AM Szehon Ho <sz...@apple.com.invalid>
wrote:

> Hi Huadong, nice to see you again :).  The syntax is spark-sql is ‘insert
> into <catalog>.<db>.<table> …”, here you defined your db as a catalog?
>
> You just need to define one catalog and use it when referring to your
> table.
>
>
>
> On 22 Apr 2021, at 07:34, Huadong Liu <hu...@gmail.com> wrote:
>
> Hello Iceberg Dev,
>
> I am not sure I follow the discussion on Spark configurations on hive
> catalogs <https://iceberg.apache.org/spark-configuration/#catalogs>. I
> created an iceberg table with the hive catalog.
>
> Configuration conf = new Configuration();
> conf.set("hive.metastore.uris", args[0]);
> conf.set("hive.metastore.warehouse.dir", args[1]);
>
> HiveCatalog catalog = new HiveCatalog(conf);
> ImmutableMap meta = ImmutableMap.of(...);
> Schema schema = new Schema(...);
> PartitionSpec spec = PartitionSpec.builderFor(schema)...build();
>
> TableIdentifier name = TableIdentifier.of("my_db", "my_table");
> Table table = catalog.createTable(name, schema, spec);
>
> On a box with *hive.metastore.uris *set correctly in *hive-site.xml,* spark-sql
> runs fine with
>
> spark-sql --packages org.apache.iceberg:iceberg-spark3-runtime:0.11.1
> --conf
> spark.sql.catalog.spark_catalog=org.apache.iceberg.spark.SparkSessionCatalog
> --conf spark.sql.catalog.spark_catalog.type=hive
> spark-sql> INSERT INTO my_db.my_table VALUES ("111", timestamp 'today',
> 1), ("333", timestamp 'today', 3);
> spark-sql> SELECT * FROM my_db.my_table ;
>
> However, if I follow the Spark hive configuration above to add a table
> catalog,
>
> spark-sql --packages org.apache.iceberg:iceberg-spark3-runtime:0.11.1
> --conf
> spark.sql.catalog.spark_catalog=org.apache.iceberg.spark.SparkSessionCatalog
> --conf spark.sql.catalog.spark_catalog.type=hive
> --conf spark.sql.catalog.my_db=org.apache.iceberg.spark.SparkCatalog
> --conf spark.sql.catalog.my_db.type=hive
> spark-sql> INSERT INTO my_db.my_table VALUES ("111", timestamp 'today',
> 1), ("333", timestamp 'today', 3);
> Error in query: Table not found: my_db.my_table;
>
> https://iceberg.apache.org/spark/#reading-an-iceberg-table states
> that "To use Iceberg in Spark, first configure Spark catalogs." Did I
> misunderstand anything? Do I have to configure catalog/namespace? Thanks
> for your time on this.
>
> --
> Huadong
>
>
>

Re: Spark configuration on hive catalog

Posted by Szehon Ho <sz...@apple.com.INVALID>.
Hi Huadong, nice to see you again :).  The syntax is spark-sql is ‘insert into <catalog>.<db>.<table> …”, here you defined your db as a catalog?  

You just need to define one catalog and use it when referring to your table.



> On 22 Apr 2021, at 07:34, Huadong Liu <hu...@gmail.com> wrote:
> 
> Hello Iceberg Dev,
> 
> I am not sure I follow the discussion on Spark configurations on hive catalogs <https://iceberg.apache.org/spark-configuration/#catalogs>. I created an iceberg table with the hive catalog.
> Configuration conf = new Configuration();
> conf.set("hive.metastore.uris", args[0]);
> conf.set("hive.metastore.warehouse.dir", args[1]);
> 
> HiveCatalog catalog = new HiveCatalog(conf);
> ImmutableMap meta = ImmutableMap.of(...);
> Schema schema = new Schema(...);
> PartitionSpec spec = PartitionSpec.builderFor(schema)...build();
> 
> TableIdentifier name = TableIdentifier.of("my_db", "my_table");
> Table table = catalog.createTable(name, schema, spec);
> On a box with hive.metastore.uris set correctly in hive-site.xml, spark-sql runs fine with 
> 
> spark-sql --packages org.apache.iceberg:iceberg-spark3-runtime:0.11.1
> --conf spark.sql.catalog.spark_catalog=org.apache.iceberg.spark.SparkSessionCatalog
> --conf spark.sql.catalog.spark_catalog.type=hive
> spark-sql> INSERT INTO my_db.my_table VALUES ("111", timestamp 'today', 1), ("333", timestamp 'today', 3);
> spark-sql> SELECT * FROM my_db.my_table ;
> 
> However, if I follow the Spark hive configuration above to add a table catalog,
> 
> spark-sql --packages org.apache.iceberg:iceberg-spark3-runtime:0.11.1
> --conf spark.sql.catalog.spark_catalog=org.apache.iceberg.spark.SparkSessionCatalog
> --conf spark.sql.catalog.spark_catalog.type=hive 
> --conf spark.sql.catalog.my_db=org.apache.iceberg.spark.SparkCatalog 
> --conf spark.sql.catalog.my_db.type=hive
> spark-sql> INSERT INTO my_db.my_table VALUES ("111", timestamp 'today', 1), ("333", timestamp 'today', 3);
> Error in query: Table not found: my_db.my_table;
> 
> https://iceberg.apache.org/spark/#reading-an-iceberg-table <https://iceberg.apache.org/spark/#reading-an-iceberg-table> states that "To use Iceberg in Spark, first configure Spark catalogs." Did I misunderstand anything? Do I have to configure catalog/namespace? Thanks for your time on this.
> 
> --
> Huadong