You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@iceberg.apache.org by Lian Jiang <ji...@gmail.com> on 2021/08/09 05:36:46 UTC

Writing iceberg table to S3

I am reading https://iceberg.apache.org/spark-writes/#spark-writes and
wondering if it is possible to create an iceberg table on S3. This guide
seems to say only write to a hive table (backed up by HDFS if I understand
correctly). Hudi and Delta can write to s3 with a specified S3 path. How
can I do it using iceberg? Thanks for any clarification.

Re: Writing iceberg table to S3

Posted by Lian Jiang <ji...@gmail.com>.

`SET iceberg.mr.catalog=hive` works!!! Thanks Ryan, you rock!!! You may
consider adding the below into iceberg document to help other newcomers.

Add `SET iceberg.mr.catalog=hive` to https://iceberg.apache.org/hive/.
Add `.tableProperty("location", filePath)` to
https://iceberg.apache.org/spark-writes/.



On Wed, Aug 11, 2021 at 3:56 PM Ryan Blue <bl...@tabular.io> wrote:

> Looks like the table is set up correctly. I think the problem might be how
> Hive is configured. I think by default it will try to load tables by
> location in 0.11.1. You need to tell it to load tables as metastore tables,
> not HDFS tables by running `SET iceberg.mr.catalog=hive`.
>
> On Wed, Aug 11, 2021 at 3:51 PM Lian Jiang <ji...@gmail.com> wrote:
>
>> hive> describe formatted mytable3;
>> OK
>> # col_name              data_type               comment
>> value                   int
>>
>> # Detailed Table Information
>> Database:               mydb
>> OwnerType:              USER
>> Owner:                  root
>> CreateTime:             Wed Aug 11 20:02:14 UTC 2021
>> LastAccessTime:         Sun Jan 11 15:25:29 UTC 1970
>> Retention:              2147483647
>> Location:               hdfs://namenode:8020/tmp/test3.ice
>> Table Type:             EXTERNAL_TABLE
>> Table Parameters:
>>         EXTERNAL                TRUE
>>         metadata_location
>> hdfs://namenode:8020/tmp/test3.ice/metadata/00000-0918c08e-16b0-4484-87f3-3c263f0e7d55.metadata.json
>>         numFiles                8
>>         storage_handler
>> org.apache.iceberg.mr.hive.HiveIcebergStorageHandler
>>         table_type              ICEBERG
>>         totalSize               12577
>>         transient_lastDdlTime   1628712134
>>
>> # Storage Information
>> SerDe Library:          org.apache.iceberg.mr.hive.HiveIcebergSerDe
>> InputFormat:            org.apache.iceberg.mr.hive.HiveIcebergInputFormat
>>
>> OutputFormat:
>> org.apache.iceberg.mr.hive.HiveIcebergOutputFormat
>> Compressed:             No
>> Num Buckets:            0
>> Bucket Columns:         []
>> Sort Columns:           []
>> Time taken: 0.319 seconds, Fetched: 29 row(s)
>>
>> hive> select * from mytable3;
>> FAILED: SemanticException Table does not exist at location:
>> hdfs://namenode:8020/tmp/test3.ice
>>
>> Thanks!
>>
>>
>>
>> On Wed, Aug 11, 2021 at 2:00 PM Ryan Blue <bl...@tabular.io> wrote:
>>
>>> Can you run `DESCRIBE FORMATTED` for the table? Then we can see if there
>>> is a storage handler set up for it.
>>>
>>> On Wed, Aug 11, 2021 at 1:46 PM Lian Jiang <ji...@gmail.com>
>>> wrote:
>>>
>>>> Thanks guys. tableProperty("location", ...) works.
>>>>
>>>> I have trouble making hive query an iceberg table by following
>>>> https://iceberg.apache.org/hive/.
>>>>
>>>> I have done:
>>>> *  in Hive shell, do `add jar /path/to/iceberg-hive-runtime.jar;`
>>>> * in hive-site.xml, add hive.vectorized.execution.enabled=false and iceberg.engine.hive.enabled=true.
>>>> The same hive-site.xml is used by both hive server and spark.
>>>>
>>>>
>>>> This is my code:
>>>> val table = "hive_test.mydb.mytable3"
>>>> val filePath = "hdfs://namenode:8020/tmp/test3.ice"
>>>> df.writeTo(table)
>>>>     .tableProperty("write.format.default", "parquet")
>>>>     .tableProperty("location", filePath)
>>>>     .createOrReplace()
>>>>
>>>>
>>>> The iceberg file is created in the specified location. It can be
>>>> queried in spark sql.
>>>> root@datanode:/# hdfs dfs -ls /tmp/test3.ice/
>>>> Found 2 items
>>>> drwxrwxr-x   - root supergroup          0 2021-08-11 20:02
>>>> /tmp/test3.ice/data
>>>> drwxrwxr-x   - root supergroup          0 2021-08-11 20:02
>>>> /tmp/test3.ice/metadata
>>>>
>>>> This hive table is created but cannot be queried:
>>>> hive> select * from mytable3;
>>>> FAILED: SemanticException Table does not exist at location:
>>>> hdfs://namenode:8020/tmp/test3.ice
>>>>
>>>> I am using spark 3.1.1 and hive 3.1.2. What else am I missing? I am
>>>> very close to having a happy path for migrating parquet to iceberg. Thanks.
>>>>
>>>>
>>>>
>>>> On Wed, Aug 11, 2021 at 12:40 PM Ryan Blue <bl...@tabular.io> wrote:
>>>>
>>>>> The problem for #3 is how Spark handles the options. The option
>>>>> method sets write options, not table properties. The write options aren’t
>>>>> passed when creating the table. Instead, you should use tableProperty("location",
>>>>> ...).
>>>>>
>>>>> Ryan
>>>>>
>>>>> On Wed, Aug 11, 2021 at 9:17 AM Russell Spitzer <
>>>>> russell.spitzer@gmail.com> wrote:
>>>>>
>>>>>> 2) Hive cannot read Iceberg tables without configuring the MR Hive
>>>>>> integration from iceberg. So you shouldn't see it in hive unless you have
>>>>>> configured that, see https://iceberg.apache.org/hive/.
>>>>>>
>>>>>> 3)
>>>>>> https://github.com/apache/iceberg/blob/master/spark3/src/main/java/org/apache/iceberg/spark/SparkCatalog.java#L137
>>>>>> I would check what properties are set in the table to see why that
>>>>>> wasn't set. But "location" would be the correct way of setting the table.
>>>>>> Unless the property is being ignored by Spark, I'm assuming you are using
>>>>>> the latest build possible of Spark. There is a bug in 3.0 of Spark which
>>>>>> ignores options passed to the V2 api sometimes,
>>>>>> https://issues.apache.org/jira/browse/SPARK-32592 . Which is fixed
>>>>>> in 3.1
>>>>>>
>>>>>> On Aug 11, 2021, at 11:00 AM, Lian Jiang <ji...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>> Any help is highly appreciated!
>>>>>>
>>>>>> On Tue, Aug 10, 2021 at 11:06 AM Lian Jiang <ji...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Thanks Russell.
>>>>>>>
>>>>>>> I tried:
>>>>>>>
>>>>>>> /spark/bin/spark-shell --packages
>>>>>>> org.apache.iceberg:iceberg-hive-runtime:0.11.1,org.apache.iceberg:iceberg-spark3-runtime:0.11.1
>>>>>>> --conf spark.sql.catalog.hive_test=org.apache.iceberg.spark.SparkCatalog
>>>>>>> --conf spark.sql.catalog.hive_test.type=hive
>>>>>>>
>>>>>>> import org.apache.spark.sql.SparkSession
>>>>>>> val values = List(1,2,3,4,5)
>>>>>>>
>>>>>>> val spark = SparkSession.builder().master("local").getOrCreate()
>>>>>>> import spark.implicits._
>>>>>>> val df = values.toDF()
>>>>>>>
>>>>>>> val table = "hive_test.mydb.mytable3"
>>>>>>> df.writeTo(table)
>>>>>>>     .tableProperty("write.format.default", "parquet")
>>>>>>> *    .option("location", "hdfs://namenode:8020/tmp/test.ice")*
>>>>>>>     .createOrReplace()
>>>>>>>
>>>>>>> spark.table(table).show()
>>>>>>>
>>>>>>> *Observations*:
>>>>>>> 1. spark.table(table).show() does show the table correctly.
>>>>>>> +-----+
>>>>>>> |value|
>>>>>>> +-----+
>>>>>>> |    1|
>>>>>>> |    2|
>>>>>>> |    3|
>>>>>>> |    4|
>>>>>>> |    5|
>>>>>>> +-----+
>>>>>>>
>>>>>>> 2. mydb.mytable3 is created in HIVE but it is empty:
>>>>>>> hive> select * from mytable3;
>>>>>>> OK
>>>>>>> Time taken: 0.158 seconds
>>>>>>>
>>>>>>> 3. test.ice is not generated in the HDFS folder /tmp.
>>>>>>>
>>>>>>> Any idea about 2 and 3? Thanks very much.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Tue, Aug 10, 2021 at 9:38 AM Russell Spitzer <
>>>>>>> russell.spitzer@gmail.com> wrote:
>>>>>>>
>>>>>>>> Specify a property of "location" when creating the table. Just add
>>>>>>>> a ".option("location", "path")"
>>>>>>>>
>>>>>>>> On Aug 10, 2021, at 11:15 AM, Lian Jiang <ji...@gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>> Thanks Russell. This helps a lot.
>>>>>>>>
>>>>>>>> I want to specify a HDFS location when creating an iceberg dataset
>>>>>>>> using dataframe api. All examples using warehouse location are SQL. Do you
>>>>>>>> have an example for dataframe API? For example, how to support HDFS/S3
>>>>>>>> location in the query below? The reason I ask is that my current code all
>>>>>>>> uses spark API. It will be much easier if I can use spark API when
>>>>>>>> migrating parquet to iceberg. Hope it makes sense.
>>>>>>>>
>>>>>>>> data.writeTo("prod.db.table")
>>>>>>>>     .tableProperty("write.format.default", "orc")
>>>>>>>>     .partitionBy($"level", days($"ts"))
>>>>>>>>     .createOrReplace()
>>>>>>>>
>>>>>>>>
>>>>>>>> On Mon, Aug 9, 2021 at 4:22 PM Russell Spitzer <
>>>>>>>> russell.spitzer@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> The config you used specified a catalog named "hive_prod", so to
>>>>>>>>> reference it you need to either "use hive_prod" or refer to the table with
>>>>>>>>> the catalog identifier "CREATE TABLE hive_prod.default.mytable"
>>>>>>>>>
>>>>>>>>> On Mon, Aug 9, 2021 at 6:15 PM Lian Jiang <ji...@gmail.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Thanks Ryan.
>>>>>>>>>>
>>>>>>>>>> Using this command (uri is omitted because the uri is in
>>>>>>>>>> hive-site.xml):
>>>>>>>>>> spark-shell --conf
>>>>>>>>>> spark.sql.catalog.hive_prod=org.apache.iceberg.spark.SparkCatalog --conf
>>>>>>>>>> spark.sql.catalog.hive_prod.type=hive
>>>>>>>>>>
>>>>>>>>>> This statement:
>>>>>>>>>> spark.sql("CREATE TABLE default.mytable (uuid string) USING
>>>>>>>>>> iceberg")
>>>>>>>>>>
>>>>>>>>>> caused warning:
>>>>>>>>>> WARN HiveExternalCatalog: Couldn't find corresponding Hive SerDe
>>>>>>>>>> for data source provider iceberg.
>>>>>>>>>>
>>>>>>>>>> I tried:
>>>>>>>>>> * the solution (put iceberg-hive-runtime.jar and
>>>>>>>>>> iceberg-spark3-runtime.jar to spark/jars) mentioned in
>>>>>>>>>> https://github.com/apache/iceberg/issues/2260
>>>>>>>>>> * use --packages
>>>>>>>>>> org.apache.iceberg:iceberg-hive-runtime:0.11.1,org.apache.iceberg:iceberg-spark3-runtime:0.11.1
>>>>>>>>>>
>>>>>>>>>> but they did not help. This warning blocks inserting any data
>>>>>>>>>> into this table. Any ideas are appreciated!
>>>>>>>>>>
>>>>>>>>>> On Mon, Aug 9, 2021 at 10:15 AM Ryan Blue <bl...@tabular.io>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> Lian,
>>>>>>>>>>>
>>>>>>>>>>> I think we should improve the docs for catalogs since it isn’t
>>>>>>>>>>> clear. We have a few configuration pages that are helpful, but it looks
>>>>>>>>>>> like they assume you know what your options are already. Take a look at the
>>>>>>>>>>> Spark docs for catalogs, which is the closest we have right now:
>>>>>>>>>>> https://iceberg.apache.org/spark-configuration/#catalog-configuration
>>>>>>>>>>>
>>>>>>>>>>> What you’ll want to do is to configure a catalog like the first
>>>>>>>>>>> example:
>>>>>>>>>>>
>>>>>>>>>>> spark.sql.catalog.hive_prod = org.apache.iceberg.spark.SparkCatalog
>>>>>>>>>>> spark.sql.catalog.hive_prod.type = hive
>>>>>>>>>>> spark.sql.catalog.hive_prod.uri = thrift://metastore-host:port
>>>>>>>>>>> # omit uri to use the same URI as Spark: hive.metastore.uris in hive-site.xml
>>>>>>>>>>>
>>>>>>>>>>> For MERGE INTO, the DataFrame API is not present in Spark,
>>>>>>>>>>> which is why it can’t be used by SQL. This is something that should
>>>>>>>>>>> probably be added to Spark and not Iceberg since it is just a different way
>>>>>>>>>>> to build the same underlying Spark plan.
>>>>>>>>>>>
>>>>>>>>>>> To your question about dataframes vs SQL, I highly recommend SQL
>>>>>>>>>>> over DataFrames so that you don’t end up needing to use Jars produced by
>>>>>>>>>>> compiling Scala code. I think it’s easier to just use SQL. But Iceberg
>>>>>>>>>>> should support both because DataFrames are useful for customization in some
>>>>>>>>>>> cases. It really should be up to you and what you want to use, as far as
>>>>>>>>>>> Iceberg is concerned.
>>>>>>>>>>>
>>>>>>>>>>> Ryan
>>>>>>>>>>>
>>>>>>>>>>> On Mon, Aug 9, 2021 at 9:31 AM Lian Jiang <ji...@gmail.com>
>>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Thanks Eduard and Ryan.
>>>>>>>>>>>>
>>>>>>>>>>>> I use spark on a K8S cluster to write parquet on s3 and then
>>>>>>>>>>>> add an external table in hive metastore for this parquet. In the future,
>>>>>>>>>>>> when using iceberg, I prefer hive metadata store since it is my
>>>>>>>>>>>> centralized metastore for batch and streaming datasets. I don't see that
>>>>>>>>>>>> hive metastore is supported in iceberg AWS integration on
>>>>>>>>>>>> https://iceberg.apache.org/aws/. Is there another link for
>>>>>>>>>>>> that?
>>>>>>>>>>>>
>>>>>>>>>>>> Most of the examples use spark sql to write/read iceberg. For
>>>>>>>>>>>> example, there is no "sql merge into" like support for spark API. Is spark
>>>>>>>>>>>> sql preferred over spark dataframe/dataset API in Iceberg? If so, could you
>>>>>>>>>>>> clarify the rationale behind? I personally feel spark API is more dev
>>>>>>>>>>>> friendly and scalable. Thanks very much!
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Mon, Aug 9, 2021 at 8:53 AM Ryan Blue <bl...@tabular.io>
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Lian,
>>>>>>>>>>>>>
>>>>>>>>>>>>> Iceberg tables work great in S3. When creating the table, just
>>>>>>>>>>>>> pass the `LOCATION` clause with an S3 path, or set your catalog's warehouse
>>>>>>>>>>>>> location to S3 so tables are automatically created there.
>>>>>>>>>>>>>
>>>>>>>>>>>>> The only restriction for S3 is that you need a metastore to
>>>>>>>>>>>>> track the table metadata location because S3 doesn't have a way to
>>>>>>>>>>>>> implement a metadata commit. For a metastore, there are implementations
>>>>>>>>>>>>> backed by the Hive MetaStore, Glue/DynamoDB, and Nessie. And the upcoming
>>>>>>>>>>>>> release adds support for DynamoDB without Glue and JDBC.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Ryan
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Mon, Aug 9, 2021 at 2:24 AM Eduard Tudenhoefner <
>>>>>>>>>>>>> eduard@dremio.com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Lian you can have a look at https://iceberg.apache.org/aws/.
>>>>>>>>>>>>>> It should contain all the info that you need. The codebase contains a *S3FileIO
>>>>>>>>>>>>>> *class, which is an implementation that is backed by S3.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Mon, Aug 9, 2021 at 7:37 AM Lian Jiang <
>>>>>>>>>>>>>> jiangok2006@gmail.com> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I am reading
>>>>>>>>>>>>>>> https://iceberg.apache.org/spark-writes/#spark-writes and
>>>>>>>>>>>>>>> wondering if it is possible to create an iceberg table on S3. This guide
>>>>>>>>>>>>>>> seems to say only write to a hive table (backed up by HDFS if I understand
>>>>>>>>>>>>>>> correctly). Hudi and Delta can write to s3 with a specified S3 path. How
>>>>>>>>>>>>>>> can I do it using iceberg? Thanks for any clarification.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> --
>>>>>>>>>>>>> Ryan Blue
>>>>>>>>>>>>> Tabular
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> --
>>>>>>>>>>>>
>>>>>>>>>>>> Create your own email signature
>>>>>>>>>>>> <https://www.wisestamp.com/signature-in-email/?utm_source=promotion&utm_medium=signature&utm_campaign=create_your_own&srcid=5234462839406592>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> --
>>>>>>>>>>> Ryan Blue
>>>>>>>>>>> Tabular
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>>
>>>>>>>>>> Create your own email signature
>>>>>>>>>> <https://www.wisestamp.com/signature-in-email/?utm_source=promotion&utm_medium=signature&utm_campaign=create_your_own&srcid=5234462839406592>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>>
>>>>>>>> Create your own email signature
>>>>>>>> <https://www.wisestamp.com/signature-in-email/?utm_source=promotion&utm_medium=signature&utm_campaign=create_your_own&srcid=5234462839406592>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>>
>>>>>>> Create your own email signature
>>>>>>> <https://www.wisestamp.com/signature-in-email/?utm_source=promotion&utm_medium=signature&utm_campaign=create_your_own&srcid=5234462839406592>
>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>>
>>>>>> Create your own email signature
>>>>>> <https://www.wisestamp.com/signature-in-email/?utm_source=promotion&utm_medium=signature&utm_campaign=create_your_own&srcid=5234462839406592>
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>> --
>>>>> Ryan Blue
>>>>> Tabular
>>>>>
>>>>
>>>>
>>>> --
>>>>
>>>> Create your own email signature
>>>> <https://www.wisestamp.com/signature-in-email/?utm_source=promotion&utm_medium=signature&utm_campaign=create_your_own&srcid=5234462839406592>
>>>>
>>>
>>>
>>> --
>>> Ryan Blue
>>> Tabular
>>>
>>
>>
>> --
>>
>> Create your own email signature
>> <https://www.wisestamp.com/signature-in-email/?utm_source=promotion&utm_medium=signature&utm_campaign=create_your_own&srcid=5234462839406592>
>>
>
>
> --
> Ryan Blue
> Tabular
>


-- 

Create your own email signature
<https://www.wisestamp.com/signature-in-email/?utm_source=promotion&utm_medium=signature&utm_campaign=create_your_own&srcid=5234462839406592>

Re: Writing iceberg table to S3

Posted by Ryan Blue <bl...@tabular.io>.

Looks like the table is set up correctly. I think the problem might be how
Hive is configured. I think by default it will try to load tables by
location in 0.11.1. You need to tell it to load tables as metastore tables,
not HDFS tables by running `SET iceberg.mr.catalog=hive`.

On Wed, Aug 11, 2021 at 3:51 PM Lian Jiang <ji...@gmail.com> wrote:

> hive> describe formatted mytable3;
> OK
> # col_name              data_type               comment
> value                   int
>
> # Detailed Table Information
> Database:               mydb
> OwnerType:              USER
> Owner:                  root
> CreateTime:             Wed Aug 11 20:02:14 UTC 2021
> LastAccessTime:         Sun Jan 11 15:25:29 UTC 1970
> Retention:              2147483647
> Location:               hdfs://namenode:8020/tmp/test3.ice
> Table Type:             EXTERNAL_TABLE
> Table Parameters:
>         EXTERNAL                TRUE
>         metadata_location
> hdfs://namenode:8020/tmp/test3.ice/metadata/00000-0918c08e-16b0-4484-87f3-3c263f0e7d55.metadata.json
>         numFiles                8
>         storage_handler
> org.apache.iceberg.mr.hive.HiveIcebergStorageHandler
>         table_type              ICEBERG
>         totalSize               12577
>         transient_lastDdlTime   1628712134
>
> # Storage Information
> SerDe Library:          org.apache.iceberg.mr.hive.HiveIcebergSerDe
> InputFormat:            org.apache.iceberg.mr.hive.HiveIcebergInputFormat
>
> OutputFormat:           org.apache.iceberg.mr.hive.HiveIcebergOutputFormat
>
> Compressed:             No
> Num Buckets:            0
> Bucket Columns:         []
> Sort Columns:           []
> Time taken: 0.319 seconds, Fetched: 29 row(s)
>
> hive> select * from mytable3;
> FAILED: SemanticException Table does not exist at location:
> hdfs://namenode:8020/tmp/test3.ice
>
> Thanks!
>
>
>
> On Wed, Aug 11, 2021 at 2:00 PM Ryan Blue <bl...@tabular.io> wrote:
>
>> Can you run `DESCRIBE FORMATTED` for the table? Then we can see if there
>> is a storage handler set up for it.
>>
>> On Wed, Aug 11, 2021 at 1:46 PM Lian Jiang <ji...@gmail.com> wrote:
>>
>>> Thanks guys. tableProperty("location", ...) works.
>>>
>>> I have trouble making hive query an iceberg table by following
>>> https://iceberg.apache.org/hive/.
>>>
>>> I have done:
>>> *  in Hive shell, do `add jar /path/to/iceberg-hive-runtime.jar;`
>>> * in hive-site.xml, add hive.vectorized.execution.enabled=false and iceberg.engine.hive.enabled=true.
>>> The same hive-site.xml is used by both hive server and spark.
>>>
>>>
>>> This is my code:
>>> val table = "hive_test.mydb.mytable3"
>>> val filePath = "hdfs://namenode:8020/tmp/test3.ice"
>>> df.writeTo(table)
>>>     .tableProperty("write.format.default", "parquet")
>>>     .tableProperty("location", filePath)
>>>     .createOrReplace()
>>>
>>>
>>> The iceberg file is created in the specified location. It can be queried
>>> in spark sql.
>>> root@datanode:/# hdfs dfs -ls /tmp/test3.ice/
>>> Found 2 items
>>> drwxrwxr-x   - root supergroup          0 2021-08-11 20:02
>>> /tmp/test3.ice/data
>>> drwxrwxr-x   - root supergroup          0 2021-08-11 20:02
>>> /tmp/test3.ice/metadata
>>>
>>> This hive table is created but cannot be queried:
>>> hive> select * from mytable3;
>>> FAILED: SemanticException Table does not exist at location:
>>> hdfs://namenode:8020/tmp/test3.ice
>>>
>>> I am using spark 3.1.1 and hive 3.1.2. What else am I missing? I am very
>>> close to having a happy path for migrating parquet to iceberg. Thanks.
>>>
>>>
>>>
>>> On Wed, Aug 11, 2021 at 12:40 PM Ryan Blue <bl...@tabular.io> wrote:
>>>
>>>> The problem for #3 is how Spark handles the options. The option method
>>>> sets write options, not table properties. The write options aren’t passed
>>>> when creating the table. Instead, you should use tableProperty("location",
>>>> ...).
>>>>
>>>> Ryan
>>>>
>>>> On Wed, Aug 11, 2021 at 9:17 AM Russell Spitzer <
>>>> russell.spitzer@gmail.com> wrote:
>>>>
>>>>> 2) Hive cannot read Iceberg tables without configuring the MR Hive
>>>>> integration from iceberg. So you shouldn't see it in hive unless you have
>>>>> configured that, see https://iceberg.apache.org/hive/.
>>>>>
>>>>> 3)
>>>>> https://github.com/apache/iceberg/blob/master/spark3/src/main/java/org/apache/iceberg/spark/SparkCatalog.java#L137
>>>>> I would check what properties are set in the table to see why that
>>>>> wasn't set. But "location" would be the correct way of setting the table.
>>>>> Unless the property is being ignored by Spark, I'm assuming you are using
>>>>> the latest build possible of Spark. There is a bug in 3.0 of Spark which
>>>>> ignores options passed to the V2 api sometimes,
>>>>> https://issues.apache.org/jira/browse/SPARK-32592 . Which is fixed in
>>>>> 3.1
>>>>>
>>>>> On Aug 11, 2021, at 11:00 AM, Lian Jiang <ji...@gmail.com>
>>>>> wrote:
>>>>>
>>>>> Any help is highly appreciated!
>>>>>
>>>>> On Tue, Aug 10, 2021 at 11:06 AM Lian Jiang <ji...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Thanks Russell.
>>>>>>
>>>>>> I tried:
>>>>>>
>>>>>> /spark/bin/spark-shell --packages
>>>>>> org.apache.iceberg:iceberg-hive-runtime:0.11.1,org.apache.iceberg:iceberg-spark3-runtime:0.11.1
>>>>>> --conf spark.sql.catalog.hive_test=org.apache.iceberg.spark.SparkCatalog
>>>>>> --conf spark.sql.catalog.hive_test.type=hive
>>>>>>
>>>>>> import org.apache.spark.sql.SparkSession
>>>>>> val values = List(1,2,3,4,5)
>>>>>>
>>>>>> val spark = SparkSession.builder().master("local").getOrCreate()
>>>>>> import spark.implicits._
>>>>>> val df = values.toDF()
>>>>>>
>>>>>> val table = "hive_test.mydb.mytable3"
>>>>>> df.writeTo(table)
>>>>>>     .tableProperty("write.format.default", "parquet")
>>>>>> *    .option("location", "hdfs://namenode:8020/tmp/test.ice")*
>>>>>>     .createOrReplace()
>>>>>>
>>>>>> spark.table(table).show()
>>>>>>
>>>>>> *Observations*:
>>>>>> 1. spark.table(table).show() does show the table correctly.
>>>>>> +-----+
>>>>>> |value|
>>>>>> +-----+
>>>>>> |    1|
>>>>>> |    2|
>>>>>> |    3|
>>>>>> |    4|
>>>>>> |    5|
>>>>>> +-----+
>>>>>>
>>>>>> 2. mydb.mytable3 is created in HIVE but it is empty:
>>>>>> hive> select * from mytable3;
>>>>>> OK
>>>>>> Time taken: 0.158 seconds
>>>>>>
>>>>>> 3. test.ice is not generated in the HDFS folder /tmp.
>>>>>>
>>>>>> Any idea about 2 and 3? Thanks very much.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Tue, Aug 10, 2021 at 9:38 AM Russell Spitzer <
>>>>>> russell.spitzer@gmail.com> wrote:
>>>>>>
>>>>>>> Specify a property of "location" when creating the table. Just add a
>>>>>>> ".option("location", "path")"
>>>>>>>
>>>>>>> On Aug 10, 2021, at 11:15 AM, Lian Jiang <ji...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>> Thanks Russell. This helps a lot.
>>>>>>>
>>>>>>> I want to specify a HDFS location when creating an iceberg dataset
>>>>>>> using dataframe api. All examples using warehouse location are SQL. Do you
>>>>>>> have an example for dataframe API? For example, how to support HDFS/S3
>>>>>>> location in the query below? The reason I ask is that my current code all
>>>>>>> uses spark API. It will be much easier if I can use spark API when
>>>>>>> migrating parquet to iceberg. Hope it makes sense.
>>>>>>>
>>>>>>> data.writeTo("prod.db.table")
>>>>>>>     .tableProperty("write.format.default", "orc")
>>>>>>>     .partitionBy($"level", days($"ts"))
>>>>>>>     .createOrReplace()
>>>>>>>
>>>>>>>
>>>>>>> On Mon, Aug 9, 2021 at 4:22 PM Russell Spitzer <
>>>>>>> russell.spitzer@gmail.com> wrote:
>>>>>>>
>>>>>>>> The config you used specified a catalog named "hive_prod", so to
>>>>>>>> reference it you need to either "use hive_prod" or refer to the table with
>>>>>>>> the catalog identifier "CREATE TABLE hive_prod.default.mytable"
>>>>>>>>
>>>>>>>> On Mon, Aug 9, 2021 at 6:15 PM Lian Jiang <ji...@gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Thanks Ryan.
>>>>>>>>>
>>>>>>>>> Using this command (uri is omitted because the uri is in
>>>>>>>>> hive-site.xml):
>>>>>>>>> spark-shell --conf
>>>>>>>>> spark.sql.catalog.hive_prod=org.apache.iceberg.spark.SparkCatalog --conf
>>>>>>>>> spark.sql.catalog.hive_prod.type=hive
>>>>>>>>>
>>>>>>>>> This statement:
>>>>>>>>> spark.sql("CREATE TABLE default.mytable (uuid string) USING
>>>>>>>>> iceberg")
>>>>>>>>>
>>>>>>>>> caused warning:
>>>>>>>>> WARN HiveExternalCatalog: Couldn't find corresponding Hive SerDe
>>>>>>>>> for data source provider iceberg.
>>>>>>>>>
>>>>>>>>> I tried:
>>>>>>>>> * the solution (put iceberg-hive-runtime.jar and
>>>>>>>>> iceberg-spark3-runtime.jar to spark/jars) mentioned in
>>>>>>>>> https://github.com/apache/iceberg/issues/2260
>>>>>>>>> * use --packages
>>>>>>>>> org.apache.iceberg:iceberg-hive-runtime:0.11.1,org.apache.iceberg:iceberg-spark3-runtime:0.11.1
>>>>>>>>>
>>>>>>>>> but they did not help. This warning blocks inserting any data into
>>>>>>>>> this table. Any ideas are appreciated!
>>>>>>>>>
>>>>>>>>> On Mon, Aug 9, 2021 at 10:15 AM Ryan Blue <bl...@tabular.io> wrote:
>>>>>>>>>
>>>>>>>>>> Lian,
>>>>>>>>>>
>>>>>>>>>> I think we should improve the docs for catalogs since it isn’t
>>>>>>>>>> clear. We have a few configuration pages that are helpful, but it looks
>>>>>>>>>> like they assume you know what your options are already. Take a look at the
>>>>>>>>>> Spark docs for catalogs, which is the closest we have right now:
>>>>>>>>>> https://iceberg.apache.org/spark-configuration/#catalog-configuration
>>>>>>>>>>
>>>>>>>>>> What you’ll want to do is to configure a catalog like the first
>>>>>>>>>> example:
>>>>>>>>>>
>>>>>>>>>> spark.sql.catalog.hive_prod = org.apache.iceberg.spark.SparkCatalog
>>>>>>>>>> spark.sql.catalog.hive_prod.type = hive
>>>>>>>>>> spark.sql.catalog.hive_prod.uri = thrift://metastore-host:port
>>>>>>>>>> # omit uri to use the same URI as Spark: hive.metastore.uris in hive-site.xml
>>>>>>>>>>
>>>>>>>>>> For MERGE INTO, the DataFrame API is not present in Spark, which
>>>>>>>>>> is why it can’t be used by SQL. This is something that should probably be
>>>>>>>>>> added to Spark and not Iceberg since it is just a different way to build
>>>>>>>>>> the same underlying Spark plan.
>>>>>>>>>>
>>>>>>>>>> To your question about dataframes vs SQL, I highly recommend SQL
>>>>>>>>>> over DataFrames so that you don’t end up needing to use Jars produced by
>>>>>>>>>> compiling Scala code. I think it’s easier to just use SQL. But Iceberg
>>>>>>>>>> should support both because DataFrames are useful for customization in some
>>>>>>>>>> cases. It really should be up to you and what you want to use, as far as
>>>>>>>>>> Iceberg is concerned.
>>>>>>>>>>
>>>>>>>>>> Ryan
>>>>>>>>>>
>>>>>>>>>> On Mon, Aug 9, 2021 at 9:31 AM Lian Jiang <ji...@gmail.com>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> Thanks Eduard and Ryan.
>>>>>>>>>>>
>>>>>>>>>>> I use spark on a K8S cluster to write parquet on s3 and then add
>>>>>>>>>>> an external table in hive metastore for this parquet. In the future, when
>>>>>>>>>>> using iceberg, I prefer hive metadata store since it is my
>>>>>>>>>>> centralized metastore for batch and streaming datasets. I don't see that
>>>>>>>>>>> hive metastore is supported in iceberg AWS integration on
>>>>>>>>>>> https://iceberg.apache.org/aws/. Is there another link for that?
>>>>>>>>>>>
>>>>>>>>>>> Most of the examples use spark sql to write/read iceberg. For
>>>>>>>>>>> example, there is no "sql merge into" like support for spark API. Is spark
>>>>>>>>>>> sql preferred over spark dataframe/dataset API in Iceberg? If so, could you
>>>>>>>>>>> clarify the rationale behind? I personally feel spark API is more dev
>>>>>>>>>>> friendly and scalable. Thanks very much!
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Mon, Aug 9, 2021 at 8:53 AM Ryan Blue <bl...@tabular.io>
>>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Lian,
>>>>>>>>>>>>
>>>>>>>>>>>> Iceberg tables work great in S3. When creating the table, just
>>>>>>>>>>>> pass the `LOCATION` clause with an S3 path, or set your catalog's warehouse
>>>>>>>>>>>> location to S3 so tables are automatically created there.
>>>>>>>>>>>>
>>>>>>>>>>>> The only restriction for S3 is that you need a metastore to
>>>>>>>>>>>> track the table metadata location because S3 doesn't have a way to
>>>>>>>>>>>> implement a metadata commit. For a metastore, there are implementations
>>>>>>>>>>>> backed by the Hive MetaStore, Glue/DynamoDB, and Nessie. And the upcoming
>>>>>>>>>>>> release adds support for DynamoDB without Glue and JDBC.
>>>>>>>>>>>>
>>>>>>>>>>>> Ryan
>>>>>>>>>>>>
>>>>>>>>>>>> On Mon, Aug 9, 2021 at 2:24 AM Eduard Tudenhoefner <
>>>>>>>>>>>> eduard@dremio.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Lian you can have a look at https://iceberg.apache.org/aws/.
>>>>>>>>>>>>> It should contain all the info that you need. The codebase contains a *S3FileIO
>>>>>>>>>>>>> *class, which is an implementation that is backed by S3.
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Mon, Aug 9, 2021 at 7:37 AM Lian Jiang <
>>>>>>>>>>>>> jiangok2006@gmail.com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> I am reading
>>>>>>>>>>>>>> https://iceberg.apache.org/spark-writes/#spark-writes and
>>>>>>>>>>>>>> wondering if it is possible to create an iceberg table on S3. This guide
>>>>>>>>>>>>>> seems to say only write to a hive table (backed up by HDFS if I understand
>>>>>>>>>>>>>> correctly). Hudi and Delta can write to s3 with a specified S3 path. How
>>>>>>>>>>>>>> can I do it using iceberg? Thanks for any clarification.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> --
>>>>>>>>>>>> Ryan Blue
>>>>>>>>>>>> Tabular
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> --
>>>>>>>>>>>
>>>>>>>>>>> Create your own email signature
>>>>>>>>>>> <https://www.wisestamp.com/signature-in-email/?utm_source=promotion&utm_medium=signature&utm_campaign=create_your_own&srcid=5234462839406592>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> Ryan Blue
>>>>>>>>>> Tabular
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>>
>>>>>>>>> Create your own email signature
>>>>>>>>> <https://www.wisestamp.com/signature-in-email/?utm_source=promotion&utm_medium=signature&utm_campaign=create_your_own&srcid=5234462839406592>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>>
>>>>>>> Create your own email signature
>>>>>>> <https://www.wisestamp.com/signature-in-email/?utm_source=promotion&utm_medium=signature&utm_campaign=create_your_own&srcid=5234462839406592>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>> --
>>>>>>
>>>>>> Create your own email signature
>>>>>> <https://www.wisestamp.com/signature-in-email/?utm_source=promotion&utm_medium=signature&utm_campaign=create_your_own&srcid=5234462839406592>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>>
>>>>> Create your own email signature
>>>>> <https://www.wisestamp.com/signature-in-email/?utm_source=promotion&utm_medium=signature&utm_campaign=create_your_own&srcid=5234462839406592>
>>>>>
>>>>>
>>>>>
>>>>
>>>> --
>>>> Ryan Blue
>>>> Tabular
>>>>
>>>
>>>
>>> --
>>>
>>> Create your own email signature
>>> <https://www.wisestamp.com/signature-in-email/?utm_source=promotion&utm_medium=signature&utm_campaign=create_your_own&srcid=5234462839406592>
>>>
>>
>>
>> --
>> Ryan Blue
>> Tabular
>>
>
>
> --
>
> Create your own email signature
> <https://www.wisestamp.com/signature-in-email/?utm_source=promotion&utm_medium=signature&utm_campaign=create_your_own&srcid=5234462839406592>
>


-- 
Ryan Blue
Tabular

Re: Writing iceberg table to S3

Posted by Lian Jiang <ji...@gmail.com>.

hive> describe formatted mytable3;
OK
# col_name              data_type               comment
value                   int

# Detailed Table Information
Database:               mydb
OwnerType:              USER
Owner:                  root
CreateTime:             Wed Aug 11 20:02:14 UTC 2021
LastAccessTime:         Sun Jan 11 15:25:29 UTC 1970
Retention:              2147483647
Location:               hdfs://namenode:8020/tmp/test3.ice
Table Type:             EXTERNAL_TABLE
Table Parameters:
        EXTERNAL                TRUE
        metadata_location
hdfs://namenode:8020/tmp/test3.ice/metadata/00000-0918c08e-16b0-4484-87f3-3c263f0e7d55.metadata.json
        numFiles                8
        storage_handler
org.apache.iceberg.mr.hive.HiveIcebergStorageHandler
        table_type              ICEBERG
        totalSize               12577
        transient_lastDdlTime   1628712134

# Storage Information
SerDe Library:          org.apache.iceberg.mr.hive.HiveIcebergSerDe
InputFormat:            org.apache.iceberg.mr.hive.HiveIcebergInputFormat

OutputFormat:           org.apache.iceberg.mr.hive.HiveIcebergOutputFormat

Compressed:             No
Num Buckets:            0
Bucket Columns:         []
Sort Columns:           []
Time taken: 0.319 seconds, Fetched: 29 row(s)

hive> select * from mytable3;
FAILED: SemanticException Table does not exist at location:
hdfs://namenode:8020/tmp/test3.ice

Thanks!



On Wed, Aug 11, 2021 at 2:00 PM Ryan Blue <bl...@tabular.io> wrote:

> Can you run `DESCRIBE FORMATTED` for the table? Then we can see if there
> is a storage handler set up for it.
>
> On Wed, Aug 11, 2021 at 1:46 PM Lian Jiang <ji...@gmail.com> wrote:
>
>> Thanks guys. tableProperty("location", ...) works.
>>
>> I have trouble making hive query an iceberg table by following
>> https://iceberg.apache.org/hive/.
>>
>> I have done:
>> *  in Hive shell, do `add jar /path/to/iceberg-hive-runtime.jar;`
>> * in hive-site.xml, add hive.vectorized.execution.enabled=false and iceberg.engine.hive.enabled=true.
>> The same hive-site.xml is used by both hive server and spark.
>>
>>
>> This is my code:
>> val table = "hive_test.mydb.mytable3"
>> val filePath = "hdfs://namenode:8020/tmp/test3.ice"
>> df.writeTo(table)
>>     .tableProperty("write.format.default", "parquet")
>>     .tableProperty("location", filePath)
>>     .createOrReplace()
>>
>>
>> The iceberg file is created in the specified location. It can be queried
>> in spark sql.
>> root@datanode:/# hdfs dfs -ls /tmp/test3.ice/
>> Found 2 items
>> drwxrwxr-x   - root supergroup          0 2021-08-11 20:02
>> /tmp/test3.ice/data
>> drwxrwxr-x   - root supergroup          0 2021-08-11 20:02
>> /tmp/test3.ice/metadata
>>
>> This hive table is created but cannot be queried:
>> hive> select * from mytable3;
>> FAILED: SemanticException Table does not exist at location:
>> hdfs://namenode:8020/tmp/test3.ice
>>
>> I am using spark 3.1.1 and hive 3.1.2. What else am I missing? I am very
>> close to having a happy path for migrating parquet to iceberg. Thanks.
>>
>>
>>
>> On Wed, Aug 11, 2021 at 12:40 PM Ryan Blue <bl...@tabular.io> wrote:
>>
>>> The problem for #3 is how Spark handles the options. The option method
>>> sets write options, not table properties. The write options aren’t passed
>>> when creating the table. Instead, you should use tableProperty("location",
>>> ...).
>>>
>>> Ryan
>>>
>>> On Wed, Aug 11, 2021 at 9:17 AM Russell Spitzer <
>>> russell.spitzer@gmail.com> wrote:
>>>
>>>> 2) Hive cannot read Iceberg tables without configuring the MR Hive
>>>> integration from iceberg. So you shouldn't see it in hive unless you have
>>>> configured that, see https://iceberg.apache.org/hive/.
>>>>
>>>> 3)
>>>> https://github.com/apache/iceberg/blob/master/spark3/src/main/java/org/apache/iceberg/spark/SparkCatalog.java#L137
>>>> I would check what properties are set in the table to see why that
>>>> wasn't set. But "location" would be the correct way of setting the table.
>>>> Unless the property is being ignored by Spark, I'm assuming you are using
>>>> the latest build possible of Spark. There is a bug in 3.0 of Spark which
>>>> ignores options passed to the V2 api sometimes,
>>>> https://issues.apache.org/jira/browse/SPARK-32592 . Which is fixed in
>>>> 3.1
>>>>
>>>> On Aug 11, 2021, at 11:00 AM, Lian Jiang <ji...@gmail.com> wrote:
>>>>
>>>> Any help is highly appreciated!
>>>>
>>>> On Tue, Aug 10, 2021 at 11:06 AM Lian Jiang <ji...@gmail.com>
>>>> wrote:
>>>>
>>>>> Thanks Russell.
>>>>>
>>>>> I tried:
>>>>>
>>>>> /spark/bin/spark-shell --packages
>>>>> org.apache.iceberg:iceberg-hive-runtime:0.11.1,org.apache.iceberg:iceberg-spark3-runtime:0.11.1
>>>>> --conf spark.sql.catalog.hive_test=org.apache.iceberg.spark.SparkCatalog
>>>>> --conf spark.sql.catalog.hive_test.type=hive
>>>>>
>>>>> import org.apache.spark.sql.SparkSession
>>>>> val values = List(1,2,3,4,5)
>>>>>
>>>>> val spark = SparkSession.builder().master("local").getOrCreate()
>>>>> import spark.implicits._
>>>>> val df = values.toDF()
>>>>>
>>>>> val table = "hive_test.mydb.mytable3"
>>>>> df.writeTo(table)
>>>>>     .tableProperty("write.format.default", "parquet")
>>>>> *    .option("location", "hdfs://namenode:8020/tmp/test.ice")*
>>>>>     .createOrReplace()
>>>>>
>>>>> spark.table(table).show()
>>>>>
>>>>> *Observations*:
>>>>> 1. spark.table(table).show() does show the table correctly.
>>>>> +-----+
>>>>> |value|
>>>>> +-----+
>>>>> |    1|
>>>>> |    2|
>>>>> |    3|
>>>>> |    4|
>>>>> |    5|
>>>>> +-----+
>>>>>
>>>>> 2. mydb.mytable3 is created in HIVE but it is empty:
>>>>> hive> select * from mytable3;
>>>>> OK
>>>>> Time taken: 0.158 seconds
>>>>>
>>>>> 3. test.ice is not generated in the HDFS folder /tmp.
>>>>>
>>>>> Any idea about 2 and 3? Thanks very much.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Tue, Aug 10, 2021 at 9:38 AM Russell Spitzer <
>>>>> russell.spitzer@gmail.com> wrote:
>>>>>
>>>>>> Specify a property of "location" when creating the table. Just add a
>>>>>> ".option("location", "path")"
>>>>>>
>>>>>> On Aug 10, 2021, at 11:15 AM, Lian Jiang <ji...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>> Thanks Russell. This helps a lot.
>>>>>>
>>>>>> I want to specify a HDFS location when creating an iceberg dataset
>>>>>> using dataframe api. All examples using warehouse location are SQL. Do you
>>>>>> have an example for dataframe API? For example, how to support HDFS/S3
>>>>>> location in the query below? The reason I ask is that my current code all
>>>>>> uses spark API. It will be much easier if I can use spark API when
>>>>>> migrating parquet to iceberg. Hope it makes sense.
>>>>>>
>>>>>> data.writeTo("prod.db.table")
>>>>>>     .tableProperty("write.format.default", "orc")
>>>>>>     .partitionBy($"level", days($"ts"))
>>>>>>     .createOrReplace()
>>>>>>
>>>>>>
>>>>>> On Mon, Aug 9, 2021 at 4:22 PM Russell Spitzer <
>>>>>> russell.spitzer@gmail.com> wrote:
>>>>>>
>>>>>>> The config you used specified a catalog named "hive_prod", so to
>>>>>>> reference it you need to either "use hive_prod" or refer to the table with
>>>>>>> the catalog identifier "CREATE TABLE hive_prod.default.mytable"
>>>>>>>
>>>>>>> On Mon, Aug 9, 2021 at 6:15 PM Lian Jiang <ji...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Thanks Ryan.
>>>>>>>>
>>>>>>>> Using this command (uri is omitted because the uri is in
>>>>>>>> hive-site.xml):
>>>>>>>> spark-shell --conf
>>>>>>>> spark.sql.catalog.hive_prod=org.apache.iceberg.spark.SparkCatalog --conf
>>>>>>>> spark.sql.catalog.hive_prod.type=hive
>>>>>>>>
>>>>>>>> This statement:
>>>>>>>> spark.sql("CREATE TABLE default.mytable (uuid string) USING
>>>>>>>> iceberg")
>>>>>>>>
>>>>>>>> caused warning:
>>>>>>>> WARN HiveExternalCatalog: Couldn't find corresponding Hive SerDe
>>>>>>>> for data source provider iceberg.
>>>>>>>>
>>>>>>>> I tried:
>>>>>>>> * the solution (put iceberg-hive-runtime.jar and
>>>>>>>> iceberg-spark3-runtime.jar to spark/jars) mentioned in
>>>>>>>> https://github.com/apache/iceberg/issues/2260
>>>>>>>> * use --packages
>>>>>>>> org.apache.iceberg:iceberg-hive-runtime:0.11.1,org.apache.iceberg:iceberg-spark3-runtime:0.11.1
>>>>>>>>
>>>>>>>> but they did not help. This warning blocks inserting any data into
>>>>>>>> this table. Any ideas are appreciated!
>>>>>>>>
>>>>>>>> On Mon, Aug 9, 2021 at 10:15 AM Ryan Blue <bl...@tabular.io> wrote:
>>>>>>>>
>>>>>>>>> Lian,
>>>>>>>>>
>>>>>>>>> I think we should improve the docs for catalogs since it isn’t
>>>>>>>>> clear. We have a few configuration pages that are helpful, but it looks
>>>>>>>>> like they assume you know what your options are already. Take a look at the
>>>>>>>>> Spark docs for catalogs, which is the closest we have right now:
>>>>>>>>> https://iceberg.apache.org/spark-configuration/#catalog-configuration
>>>>>>>>>
>>>>>>>>> What you’ll want to do is to configure a catalog like the first
>>>>>>>>> example:
>>>>>>>>>
>>>>>>>>> spark.sql.catalog.hive_prod = org.apache.iceberg.spark.SparkCatalog
>>>>>>>>> spark.sql.catalog.hive_prod.type = hive
>>>>>>>>> spark.sql.catalog.hive_prod.uri = thrift://metastore-host:port
>>>>>>>>> # omit uri to use the same URI as Spark: hive.metastore.uris in hive-site.xml
>>>>>>>>>
>>>>>>>>> For MERGE INTO, the DataFrame API is not present in Spark, which
>>>>>>>>> is why it can’t be used by SQL. This is something that should probably be
>>>>>>>>> added to Spark and not Iceberg since it is just a different way to build
>>>>>>>>> the same underlying Spark plan.
>>>>>>>>>
>>>>>>>>> To your question about dataframes vs SQL, I highly recommend SQL
>>>>>>>>> over DataFrames so that you don’t end up needing to use Jars produced by
>>>>>>>>> compiling Scala code. I think it’s easier to just use SQL. But Iceberg
>>>>>>>>> should support both because DataFrames are useful for customization in some
>>>>>>>>> cases. It really should be up to you and what you want to use, as far as
>>>>>>>>> Iceberg is concerned.
>>>>>>>>>
>>>>>>>>> Ryan
>>>>>>>>>
>>>>>>>>> On Mon, Aug 9, 2021 at 9:31 AM Lian Jiang <ji...@gmail.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Thanks Eduard and Ryan.
>>>>>>>>>>
>>>>>>>>>> I use spark on a K8S cluster to write parquet on s3 and then add
>>>>>>>>>> an external table in hive metastore for this parquet. In the future, when
>>>>>>>>>> using iceberg, I prefer hive metadata store since it is my
>>>>>>>>>> centralized metastore for batch and streaming datasets. I don't see that
>>>>>>>>>> hive metastore is supported in iceberg AWS integration on
>>>>>>>>>> https://iceberg.apache.org/aws/. Is there another link for that?
>>>>>>>>>>
>>>>>>>>>> Most of the examples use spark sql to write/read iceberg. For
>>>>>>>>>> example, there is no "sql merge into" like support for spark API. Is spark
>>>>>>>>>> sql preferred over spark dataframe/dataset API in Iceberg? If so, could you
>>>>>>>>>> clarify the rationale behind? I personally feel spark API is more dev
>>>>>>>>>> friendly and scalable. Thanks very much!
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Mon, Aug 9, 2021 at 8:53 AM Ryan Blue <bl...@tabular.io> wrote:
>>>>>>>>>>
>>>>>>>>>>> Lian,
>>>>>>>>>>>
>>>>>>>>>>> Iceberg tables work great in S3. When creating the table, just
>>>>>>>>>>> pass the `LOCATION` clause with an S3 path, or set your catalog's warehouse
>>>>>>>>>>> location to S3 so tables are automatically created there.
>>>>>>>>>>>
>>>>>>>>>>> The only restriction for S3 is that you need a metastore to
>>>>>>>>>>> track the table metadata location because S3 doesn't have a way to
>>>>>>>>>>> implement a metadata commit. For a metastore, there are implementations
>>>>>>>>>>> backed by the Hive MetaStore, Glue/DynamoDB, and Nessie. And the upcoming
>>>>>>>>>>> release adds support for DynamoDB without Glue and JDBC.
>>>>>>>>>>>
>>>>>>>>>>> Ryan
>>>>>>>>>>>
>>>>>>>>>>> On Mon, Aug 9, 2021 at 2:24 AM Eduard Tudenhoefner <
>>>>>>>>>>> eduard@dremio.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Lian you can have a look at https://iceberg.apache.org/aws/.
>>>>>>>>>>>> It should contain all the info that you need. The codebase contains a *S3FileIO
>>>>>>>>>>>> *class, which is an implementation that is backed by S3.
>>>>>>>>>>>>
>>>>>>>>>>>> On Mon, Aug 9, 2021 at 7:37 AM Lian Jiang <
>>>>>>>>>>>> jiangok2006@gmail.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> I am reading
>>>>>>>>>>>>> https://iceberg.apache.org/spark-writes/#spark-writes and
>>>>>>>>>>>>> wondering if it is possible to create an iceberg table on S3. This guide
>>>>>>>>>>>>> seems to say only write to a hive table (backed up by HDFS if I understand
>>>>>>>>>>>>> correctly). Hudi and Delta can write to s3 with a specified S3 path. How
>>>>>>>>>>>>> can I do it using iceberg? Thanks for any clarification.
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> --
>>>>>>>>>>> Ryan Blue
>>>>>>>>>>> Tabular
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>>
>>>>>>>>>> Create your own email signature
>>>>>>>>>> <https://www.wisestamp.com/signature-in-email/?utm_source=promotion&utm_medium=signature&utm_campaign=create_your_own&srcid=5234462839406592>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Ryan Blue
>>>>>>>>> Tabular
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>>
>>>>>>>> Create your own email signature
>>>>>>>> <https://www.wisestamp.com/signature-in-email/?utm_source=promotion&utm_medium=signature&utm_campaign=create_your_own&srcid=5234462839406592>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>> --
>>>>>>
>>>>>> Create your own email signature
>>>>>> <https://www.wisestamp.com/signature-in-email/?utm_source=promotion&utm_medium=signature&utm_campaign=create_your_own&srcid=5234462839406592>
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>> --
>>>>>
>>>>> Create your own email signature
>>>>> <https://www.wisestamp.com/signature-in-email/?utm_source=promotion&utm_medium=signature&utm_campaign=create_your_own&srcid=5234462839406592>
>>>>>
>>>>
>>>>
>>>> --
>>>>
>>>> Create your own email signature
>>>> <https://www.wisestamp.com/signature-in-email/?utm_source=promotion&utm_medium=signature&utm_campaign=create_your_own&srcid=5234462839406592>
>>>>
>>>>
>>>>
>>>
>>> --
>>> Ryan Blue
>>> Tabular
>>>
>>
>>
>> --
>>
>> Create your own email signature
>> <https://www.wisestamp.com/signature-in-email/?utm_source=promotion&utm_medium=signature&utm_campaign=create_your_own&srcid=5234462839406592>
>>
>
>
> --
> Ryan Blue
> Tabular
>


-- 

Create your own email signature
<https://www.wisestamp.com/signature-in-email/?utm_source=promotion&utm_medium=signature&utm_campaign=create_your_own&srcid=5234462839406592>

Re: Writing iceberg table to S3

Posted by Ryan Blue <bl...@tabular.io>.

Can you run `DESCRIBE FORMATTED` for the table? Then we can see if there is
a storage handler set up for it.

On Wed, Aug 11, 2021 at 1:46 PM Lian Jiang <ji...@gmail.com> wrote:

> Thanks guys. tableProperty("location", ...) works.
>
> I have trouble making hive query an iceberg table by following
> https://iceberg.apache.org/hive/.
>
> I have done:
> *  in Hive shell, do `add jar /path/to/iceberg-hive-runtime.jar;`
> * in hive-site.xml, add hive.vectorized.execution.enabled=false and iceberg.engine.hive.enabled=true.
> The same hive-site.xml is used by both hive server and spark.
>
>
> This is my code:
> val table = "hive_test.mydb.mytable3"
> val filePath = "hdfs://namenode:8020/tmp/test3.ice"
> df.writeTo(table)
>     .tableProperty("write.format.default", "parquet")
>     .tableProperty("location", filePath)
>     .createOrReplace()
>
>
> The iceberg file is created in the specified location. It can be queried
> in spark sql.
> root@datanode:/# hdfs dfs -ls /tmp/test3.ice/
> Found 2 items
> drwxrwxr-x   - root supergroup          0 2021-08-11 20:02
> /tmp/test3.ice/data
> drwxrwxr-x   - root supergroup          0 2021-08-11 20:02
> /tmp/test3.ice/metadata
>
> This hive table is created but cannot be queried:
> hive> select * from mytable3;
> FAILED: SemanticException Table does not exist at location:
> hdfs://namenode:8020/tmp/test3.ice
>
> I am using spark 3.1.1 and hive 3.1.2. What else am I missing? I am very
> close to having a happy path for migrating parquet to iceberg. Thanks.
>
>
>
> On Wed, Aug 11, 2021 at 12:40 PM Ryan Blue <bl...@tabular.io> wrote:
>
>> The problem for #3 is how Spark handles the options. The option method
>> sets write options, not table properties. The write options aren’t passed
>> when creating the table. Instead, you should use tableProperty("location",
>> ...).
>>
>> Ryan
>>
>> On Wed, Aug 11, 2021 at 9:17 AM Russell Spitzer <
>> russell.spitzer@gmail.com> wrote:
>>
>>> 2) Hive cannot read Iceberg tables without configuring the MR Hive
>>> integration from iceberg. So you shouldn't see it in hive unless you have
>>> configured that, see https://iceberg.apache.org/hive/.
>>>
>>> 3)
>>> https://github.com/apache/iceberg/blob/master/spark3/src/main/java/org/apache/iceberg/spark/SparkCatalog.java#L137
>>> I would check what properties are set in the table to see why that
>>> wasn't set. But "location" would be the correct way of setting the table.
>>> Unless the property is being ignored by Spark, I'm assuming you are using
>>> the latest build possible of Spark. There is a bug in 3.0 of Spark which
>>> ignores options passed to the V2 api sometimes,
>>> https://issues.apache.org/jira/browse/SPARK-32592 . Which is fixed in
>>> 3.1
>>>
>>> On Aug 11, 2021, at 11:00 AM, Lian Jiang <ji...@gmail.com> wrote:
>>>
>>> Any help is highly appreciated!
>>>
>>> On Tue, Aug 10, 2021 at 11:06 AM Lian Jiang <ji...@gmail.com>
>>> wrote:
>>>
>>>> Thanks Russell.
>>>>
>>>> I tried:
>>>>
>>>> /spark/bin/spark-shell --packages
>>>> org.apache.iceberg:iceberg-hive-runtime:0.11.1,org.apache.iceberg:iceberg-spark3-runtime:0.11.1
>>>> --conf spark.sql.catalog.hive_test=org.apache.iceberg.spark.SparkCatalog
>>>> --conf spark.sql.catalog.hive_test.type=hive
>>>>
>>>> import org.apache.spark.sql.SparkSession
>>>> val values = List(1,2,3,4,5)
>>>>
>>>> val spark = SparkSession.builder().master("local").getOrCreate()
>>>> import spark.implicits._
>>>> val df = values.toDF()
>>>>
>>>> val table = "hive_test.mydb.mytable3"
>>>> df.writeTo(table)
>>>>     .tableProperty("write.format.default", "parquet")
>>>> *    .option("location", "hdfs://namenode:8020/tmp/test.ice")*
>>>>     .createOrReplace()
>>>>
>>>> spark.table(table).show()
>>>>
>>>> *Observations*:
>>>> 1. spark.table(table).show() does show the table correctly.
>>>> +-----+
>>>> |value|
>>>> +-----+
>>>> |    1|
>>>> |    2|
>>>> |    3|
>>>> |    4|
>>>> |    5|
>>>> +-----+
>>>>
>>>> 2. mydb.mytable3 is created in HIVE but it is empty:
>>>> hive> select * from mytable3;
>>>> OK
>>>> Time taken: 0.158 seconds
>>>>
>>>> 3. test.ice is not generated in the HDFS folder /tmp.
>>>>
>>>> Any idea about 2 and 3? Thanks very much.
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On Tue, Aug 10, 2021 at 9:38 AM Russell Spitzer <
>>>> russell.spitzer@gmail.com> wrote:
>>>>
>>>>> Specify a property of "location" when creating the table. Just add a
>>>>> ".option("location", "path")"
>>>>>
>>>>> On Aug 10, 2021, at 11:15 AM, Lian Jiang <ji...@gmail.com>
>>>>> wrote:
>>>>>
>>>>> Thanks Russell. This helps a lot.
>>>>>
>>>>> I want to specify a HDFS location when creating an iceberg dataset
>>>>> using dataframe api. All examples using warehouse location are SQL. Do you
>>>>> have an example for dataframe API? For example, how to support HDFS/S3
>>>>> location in the query below? The reason I ask is that my current code all
>>>>> uses spark API. It will be much easier if I can use spark API when
>>>>> migrating parquet to iceberg. Hope it makes sense.
>>>>>
>>>>> data.writeTo("prod.db.table")
>>>>>     .tableProperty("write.format.default", "orc")
>>>>>     .partitionBy($"level", days($"ts"))
>>>>>     .createOrReplace()
>>>>>
>>>>>
>>>>> On Mon, Aug 9, 2021 at 4:22 PM Russell Spitzer <
>>>>> russell.spitzer@gmail.com> wrote:
>>>>>
>>>>>> The config you used specified a catalog named "hive_prod", so to
>>>>>> reference it you need to either "use hive_prod" or refer to the table with
>>>>>> the catalog identifier "CREATE TABLE hive_prod.default.mytable"
>>>>>>
>>>>>> On Mon, Aug 9, 2021 at 6:15 PM Lian Jiang <ji...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Thanks Ryan.
>>>>>>>
>>>>>>> Using this command (uri is omitted because the uri is in
>>>>>>> hive-site.xml):
>>>>>>> spark-shell --conf
>>>>>>> spark.sql.catalog.hive_prod=org.apache.iceberg.spark.SparkCatalog --conf
>>>>>>> spark.sql.catalog.hive_prod.type=hive
>>>>>>>
>>>>>>> This statement:
>>>>>>> spark.sql("CREATE TABLE default.mytable (uuid string) USING iceberg")
>>>>>>>
>>>>>>> caused warning:
>>>>>>> WARN HiveExternalCatalog: Couldn't find corresponding Hive SerDe for
>>>>>>> data source provider iceberg.
>>>>>>>
>>>>>>> I tried:
>>>>>>> * the solution (put iceberg-hive-runtime.jar and
>>>>>>> iceberg-spark3-runtime.jar to spark/jars) mentioned in
>>>>>>> https://github.com/apache/iceberg/issues/2260
>>>>>>> * use --packages
>>>>>>> org.apache.iceberg:iceberg-hive-runtime:0.11.1,org.apache.iceberg:iceberg-spark3-runtime:0.11.1
>>>>>>>
>>>>>>> but they did not help. This warning blocks inserting any data into
>>>>>>> this table. Any ideas are appreciated!
>>>>>>>
>>>>>>> On Mon, Aug 9, 2021 at 10:15 AM Ryan Blue <bl...@tabular.io> wrote:
>>>>>>>
>>>>>>>> Lian,
>>>>>>>>
>>>>>>>> I think we should improve the docs for catalogs since it isn’t
>>>>>>>> clear. We have a few configuration pages that are helpful, but it looks
>>>>>>>> like they assume you know what your options are already. Take a look at the
>>>>>>>> Spark docs for catalogs, which is the closest we have right now:
>>>>>>>> https://iceberg.apache.org/spark-configuration/#catalog-configuration
>>>>>>>>
>>>>>>>> What you’ll want to do is to configure a catalog like the first
>>>>>>>> example:
>>>>>>>>
>>>>>>>> spark.sql.catalog.hive_prod = org.apache.iceberg.spark.SparkCatalog
>>>>>>>> spark.sql.catalog.hive_prod.type = hive
>>>>>>>> spark.sql.catalog.hive_prod.uri = thrift://metastore-host:port
>>>>>>>> # omit uri to use the same URI as Spark: hive.metastore.uris in hive-site.xml
>>>>>>>>
>>>>>>>> For MERGE INTO, the DataFrame API is not present in Spark, which
>>>>>>>> is why it can’t be used by SQL. This is something that should probably be
>>>>>>>> added to Spark and not Iceberg since it is just a different way to build
>>>>>>>> the same underlying Spark plan.
>>>>>>>>
>>>>>>>> To your question about dataframes vs SQL, I highly recommend SQL
>>>>>>>> over DataFrames so that you don’t end up needing to use Jars produced by
>>>>>>>> compiling Scala code. I think it’s easier to just use SQL. But Iceberg
>>>>>>>> should support both because DataFrames are useful for customization in some
>>>>>>>> cases. It really should be up to you and what you want to use, as far as
>>>>>>>> Iceberg is concerned.
>>>>>>>>
>>>>>>>> Ryan
>>>>>>>>
>>>>>>>> On Mon, Aug 9, 2021 at 9:31 AM Lian Jiang <ji...@gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Thanks Eduard and Ryan.
>>>>>>>>>
>>>>>>>>> I use spark on a K8S cluster to write parquet on s3 and then add
>>>>>>>>> an external table in hive metastore for this parquet. In the future, when
>>>>>>>>> using iceberg, I prefer hive metadata store since it is my
>>>>>>>>> centralized metastore for batch and streaming datasets. I don't see that
>>>>>>>>> hive metastore is supported in iceberg AWS integration on
>>>>>>>>> https://iceberg.apache.org/aws/. Is there another link for that?
>>>>>>>>>
>>>>>>>>> Most of the examples use spark sql to write/read iceberg. For
>>>>>>>>> example, there is no "sql merge into" like support for spark API. Is spark
>>>>>>>>> sql preferred over spark dataframe/dataset API in Iceberg? If so, could you
>>>>>>>>> clarify the rationale behind? I personally feel spark API is more dev
>>>>>>>>> friendly and scalable. Thanks very much!
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Mon, Aug 9, 2021 at 8:53 AM Ryan Blue <bl...@tabular.io> wrote:
>>>>>>>>>
>>>>>>>>>> Lian,
>>>>>>>>>>
>>>>>>>>>> Iceberg tables work great in S3. When creating the table, just
>>>>>>>>>> pass the `LOCATION` clause with an S3 path, or set your catalog's warehouse
>>>>>>>>>> location to S3 so tables are automatically created there.
>>>>>>>>>>
>>>>>>>>>> The only restriction for S3 is that you need a metastore to track
>>>>>>>>>> the table metadata location because S3 doesn't have a way to implement a
>>>>>>>>>> metadata commit. For a metastore, there are implementations backed by the
>>>>>>>>>> Hive MetaStore, Glue/DynamoDB, and Nessie. And the upcoming release adds
>>>>>>>>>> support for DynamoDB without Glue and JDBC.
>>>>>>>>>>
>>>>>>>>>> Ryan
>>>>>>>>>>
>>>>>>>>>> On Mon, Aug 9, 2021 at 2:24 AM Eduard Tudenhoefner <
>>>>>>>>>> eduard@dremio.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> Lian you can have a look at https://iceberg.apache.org/aws/. It
>>>>>>>>>>> should contain all the info that you need. The codebase contains a *S3FileIO
>>>>>>>>>>> *class, which is an implementation that is backed by S3.
>>>>>>>>>>>
>>>>>>>>>>> On Mon, Aug 9, 2021 at 7:37 AM Lian Jiang <ji...@gmail.com>
>>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> I am reading
>>>>>>>>>>>> https://iceberg.apache.org/spark-writes/#spark-writes and
>>>>>>>>>>>> wondering if it is possible to create an iceberg table on S3. This guide
>>>>>>>>>>>> seems to say only write to a hive table (backed up by HDFS if I understand
>>>>>>>>>>>> correctly). Hudi and Delta can write to s3 with a specified S3 path. How
>>>>>>>>>>>> can I do it using iceberg? Thanks for any clarification.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> Ryan Blue
>>>>>>>>>> Tabular
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>>
>>>>>>>>> Create your own email signature
>>>>>>>>> <https://www.wisestamp.com/signature-in-email/?utm_source=promotion&utm_medium=signature&utm_campaign=create_your_own&srcid=5234462839406592>
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> Ryan Blue
>>>>>>>> Tabular
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>>
>>>>>>> Create your own email signature
>>>>>>> <https://www.wisestamp.com/signature-in-email/?utm_source=promotion&utm_medium=signature&utm_campaign=create_your_own&srcid=5234462839406592>
>>>>>>>
>>>>>>
>>>>>
>>>>> --
>>>>>
>>>>> Create your own email signature
>>>>> <https://www.wisestamp.com/signature-in-email/?utm_source=promotion&utm_medium=signature&utm_campaign=create_your_own&srcid=5234462839406592>
>>>>>
>>>>>
>>>>>
>>>>
>>>> --
>>>>
>>>> Create your own email signature
>>>> <https://www.wisestamp.com/signature-in-email/?utm_source=promotion&utm_medium=signature&utm_campaign=create_your_own&srcid=5234462839406592>
>>>>
>>>
>>>
>>> --
>>>
>>> Create your own email signature
>>> <https://www.wisestamp.com/signature-in-email/?utm_source=promotion&utm_medium=signature&utm_campaign=create_your_own&srcid=5234462839406592>
>>>
>>>
>>>
>>
>> --
>> Ryan Blue
>> Tabular
>>
>
>
> --
>
> Create your own email signature
> <https://www.wisestamp.com/signature-in-email/?utm_source=promotion&utm_medium=signature&utm_campaign=create_your_own&srcid=5234462839406592>
>


-- 
Ryan Blue
Tabular

Re: Writing iceberg table to S3

Posted by Lian Jiang <ji...@gmail.com>.

Thanks guys. tableProperty("location", ...) works.

I have trouble making hive query an iceberg table by following
https://iceberg.apache.org/hive/.

I have done:
*  in Hive shell, do `add jar /path/to/iceberg-hive-runtime.jar;`
* in hive-site.xml, add hive.vectorized.execution.enabled=false and
iceberg.engine.hive.enabled=true.
The same hive-site.xml is used by both hive server and spark.


This is my code:
val table = "hive_test.mydb.mytable3"
val filePath = "hdfs://namenode:8020/tmp/test3.ice"
df.writeTo(table)
    .tableProperty("write.format.default", "parquet")
    .tableProperty("location", filePath)
    .createOrReplace()


The iceberg file is created in the specified location. It can be queried in
spark sql.
root@datanode:/# hdfs dfs -ls /tmp/test3.ice/
Found 2 items
drwxrwxr-x   - root supergroup          0 2021-08-11 20:02
/tmp/test3.ice/data
drwxrwxr-x   - root supergroup          0 2021-08-11 20:02
/tmp/test3.ice/metadata

This hive table is created but cannot be queried:
hive> select * from mytable3;
FAILED: SemanticException Table does not exist at location:
hdfs://namenode:8020/tmp/test3.ice

I am using spark 3.1.1 and hive 3.1.2. What else am I missing? I am very
close to having a happy path for migrating parquet to iceberg. Thanks.



On Wed, Aug 11, 2021 at 12:40 PM Ryan Blue <bl...@tabular.io> wrote:

> The problem for #3 is how Spark handles the options. The option method
> sets write options, not table properties. The write options aren’t passed
> when creating the table. Instead, you should use tableProperty("location",
> ...).
>
> Ryan
>
> On Wed, Aug 11, 2021 at 9:17 AM Russell Spitzer <ru...@gmail.com>
> wrote:
>
>> 2) Hive cannot read Iceberg tables without configuring the MR Hive
>> integration from iceberg. So you shouldn't see it in hive unless you have
>> configured that, see https://iceberg.apache.org/hive/.
>>
>> 3)
>> https://github.com/apache/iceberg/blob/master/spark3/src/main/java/org/apache/iceberg/spark/SparkCatalog.java#L137
>> I would check what properties are set in the table to see why that wasn't
>> set. But "location" would be the correct way of setting the table. Unless
>> the property is being ignored by Spark, I'm assuming you are using the
>> latest build possible of Spark. There is a bug in 3.0 of Spark which
>> ignores options passed to the V2 api sometimes,
>> https://issues.apache.org/jira/browse/SPARK-32592 . Which is fixed in 3.1
>>
>> On Aug 11, 2021, at 11:00 AM, Lian Jiang <ji...@gmail.com> wrote:
>>
>> Any help is highly appreciated!
>>
>> On Tue, Aug 10, 2021 at 11:06 AM Lian Jiang <ji...@gmail.com>
>> wrote:
>>
>>> Thanks Russell.
>>>
>>> I tried:
>>>
>>> /spark/bin/spark-shell --packages
>>> org.apache.iceberg:iceberg-hive-runtime:0.11.1,org.apache.iceberg:iceberg-spark3-runtime:0.11.1
>>> --conf spark.sql.catalog.hive_test=org.apache.iceberg.spark.SparkCatalog
>>> --conf spark.sql.catalog.hive_test.type=hive
>>>
>>> import org.apache.spark.sql.SparkSession
>>> val values = List(1,2,3,4,5)
>>>
>>> val spark = SparkSession.builder().master("local").getOrCreate()
>>> import spark.implicits._
>>> val df = values.toDF()
>>>
>>> val table = "hive_test.mydb.mytable3"
>>> df.writeTo(table)
>>>     .tableProperty("write.format.default", "parquet")
>>> *    .option("location", "hdfs://namenode:8020/tmp/test.ice")*
>>>     .createOrReplace()
>>>
>>> spark.table(table).show()
>>>
>>> *Observations*:
>>> 1. spark.table(table).show() does show the table correctly.
>>> +-----+
>>> |value|
>>> +-----+
>>> |    1|
>>> |    2|
>>> |    3|
>>> |    4|
>>> |    5|
>>> +-----+
>>>
>>> 2. mydb.mytable3 is created in HIVE but it is empty:
>>> hive> select * from mytable3;
>>> OK
>>> Time taken: 0.158 seconds
>>>
>>> 3. test.ice is not generated in the HDFS folder /tmp.
>>>
>>> Any idea about 2 and 3? Thanks very much.
>>>
>>>
>>>
>>>
>>>
>>> On Tue, Aug 10, 2021 at 9:38 AM Russell Spitzer <
>>> russell.spitzer@gmail.com> wrote:
>>>
>>>> Specify a property of "location" when creating the table. Just add a
>>>> ".option("location", "path")"
>>>>
>>>> On Aug 10, 2021, at 11:15 AM, Lian Jiang <ji...@gmail.com> wrote:
>>>>
>>>> Thanks Russell. This helps a lot.
>>>>
>>>> I want to specify a HDFS location when creating an iceberg dataset
>>>> using dataframe api. All examples using warehouse location are SQL. Do you
>>>> have an example for dataframe API? For example, how to support HDFS/S3
>>>> location in the query below? The reason I ask is that my current code all
>>>> uses spark API. It will be much easier if I can use spark API when
>>>> migrating parquet to iceberg. Hope it makes sense.
>>>>
>>>> data.writeTo("prod.db.table")
>>>>     .tableProperty("write.format.default", "orc")
>>>>     .partitionBy($"level", days($"ts"))
>>>>     .createOrReplace()
>>>>
>>>>
>>>> On Mon, Aug 9, 2021 at 4:22 PM Russell Spitzer <
>>>> russell.spitzer@gmail.com> wrote:
>>>>
>>>>> The config you used specified a catalog named "hive_prod", so to
>>>>> reference it you need to either "use hive_prod" or refer to the table with
>>>>> the catalog identifier "CREATE TABLE hive_prod.default.mytable"
>>>>>
>>>>> On Mon, Aug 9, 2021 at 6:15 PM Lian Jiang <ji...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Thanks Ryan.
>>>>>>
>>>>>> Using this command (uri is omitted because the uri is in
>>>>>> hive-site.xml):
>>>>>> spark-shell --conf
>>>>>> spark.sql.catalog.hive_prod=org.apache.iceberg.spark.SparkCatalog --conf
>>>>>> spark.sql.catalog.hive_prod.type=hive
>>>>>>
>>>>>> This statement:
>>>>>> spark.sql("CREATE TABLE default.mytable (uuid string) USING iceberg")
>>>>>>
>>>>>> caused warning:
>>>>>> WARN HiveExternalCatalog: Couldn't find corresponding Hive SerDe for
>>>>>> data source provider iceberg.
>>>>>>
>>>>>> I tried:
>>>>>> * the solution (put iceberg-hive-runtime.jar and
>>>>>> iceberg-spark3-runtime.jar to spark/jars) mentioned in
>>>>>> https://github.com/apache/iceberg/issues/2260
>>>>>> * use --packages
>>>>>> org.apache.iceberg:iceberg-hive-runtime:0.11.1,org.apache.iceberg:iceberg-spark3-runtime:0.11.1
>>>>>>
>>>>>> but they did not help. This warning blocks inserting any data into
>>>>>> this table. Any ideas are appreciated!
>>>>>>
>>>>>> On Mon, Aug 9, 2021 at 10:15 AM Ryan Blue <bl...@tabular.io> wrote:
>>>>>>
>>>>>>> Lian,
>>>>>>>
>>>>>>> I think we should improve the docs for catalogs since it isn’t
>>>>>>> clear. We have a few configuration pages that are helpful, but it looks
>>>>>>> like they assume you know what your options are already. Take a look at the
>>>>>>> Spark docs for catalogs, which is the closest we have right now:
>>>>>>> https://iceberg.apache.org/spark-configuration/#catalog-configuration
>>>>>>>
>>>>>>> What you’ll want to do is to configure a catalog like the first
>>>>>>> example:
>>>>>>>
>>>>>>> spark.sql.catalog.hive_prod = org.apache.iceberg.spark.SparkCatalog
>>>>>>> spark.sql.catalog.hive_prod.type = hive
>>>>>>> spark.sql.catalog.hive_prod.uri = thrift://metastore-host:port
>>>>>>> # omit uri to use the same URI as Spark: hive.metastore.uris in hive-site.xml
>>>>>>>
>>>>>>> For MERGE INTO, the DataFrame API is not present in Spark, which is
>>>>>>> why it can’t be used by SQL. This is something that should probably be
>>>>>>> added to Spark and not Iceberg since it is just a different way to build
>>>>>>> the same underlying Spark plan.
>>>>>>>
>>>>>>> To your question about dataframes vs SQL, I highly recommend SQL
>>>>>>> over DataFrames so that you don’t end up needing to use Jars produced by
>>>>>>> compiling Scala code. I think it’s easier to just use SQL. But Iceberg
>>>>>>> should support both because DataFrames are useful for customization in some
>>>>>>> cases. It really should be up to you and what you want to use, as far as
>>>>>>> Iceberg is concerned.
>>>>>>>
>>>>>>> Ryan
>>>>>>>
>>>>>>> On Mon, Aug 9, 2021 at 9:31 AM Lian Jiang <ji...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Thanks Eduard and Ryan.
>>>>>>>>
>>>>>>>> I use spark on a K8S cluster to write parquet on s3 and then add an
>>>>>>>> external table in hive metastore for this parquet. In the future, when
>>>>>>>> using iceberg, I prefer hive metadata store since it is my
>>>>>>>> centralized metastore for batch and streaming datasets. I don't see that
>>>>>>>> hive metastore is supported in iceberg AWS integration on
>>>>>>>> https://iceberg.apache.org/aws/. Is there another link for that?
>>>>>>>>
>>>>>>>> Most of the examples use spark sql to write/read iceberg. For
>>>>>>>> example, there is no "sql merge into" like support for spark API. Is spark
>>>>>>>> sql preferred over spark dataframe/dataset API in Iceberg? If so, could you
>>>>>>>> clarify the rationale behind? I personally feel spark API is more dev
>>>>>>>> friendly and scalable. Thanks very much!
>>>>>>>>
>>>>>>>>
>>>>>>>> On Mon, Aug 9, 2021 at 8:53 AM Ryan Blue <bl...@tabular.io> wrote:
>>>>>>>>
>>>>>>>>> Lian,
>>>>>>>>>
>>>>>>>>> Iceberg tables work great in S3. When creating the table, just
>>>>>>>>> pass the `LOCATION` clause with an S3 path, or set your catalog's warehouse
>>>>>>>>> location to S3 so tables are automatically created there.
>>>>>>>>>
>>>>>>>>> The only restriction for S3 is that you need a metastore to track
>>>>>>>>> the table metadata location because S3 doesn't have a way to implement a
>>>>>>>>> metadata commit. For a metastore, there are implementations backed by the
>>>>>>>>> Hive MetaStore, Glue/DynamoDB, and Nessie. And the upcoming release adds
>>>>>>>>> support for DynamoDB without Glue and JDBC.
>>>>>>>>>
>>>>>>>>> Ryan
>>>>>>>>>
>>>>>>>>> On Mon, Aug 9, 2021 at 2:24 AM Eduard Tudenhoefner <
>>>>>>>>> eduard@dremio.com> wrote:
>>>>>>>>>
>>>>>>>>>> Lian you can have a look at https://iceberg.apache.org/aws/. It
>>>>>>>>>> should contain all the info that you need. The codebase contains a *S3FileIO
>>>>>>>>>> *class, which is an implementation that is backed by S3.
>>>>>>>>>>
>>>>>>>>>> On Mon, Aug 9, 2021 at 7:37 AM Lian Jiang <ji...@gmail.com>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> I am reading
>>>>>>>>>>> https://iceberg.apache.org/spark-writes/#spark-writes and
>>>>>>>>>>> wondering if it is possible to create an iceberg table on S3. This guide
>>>>>>>>>>> seems to say only write to a hive table (backed up by HDFS if I understand
>>>>>>>>>>> correctly). Hudi and Delta can write to s3 with a specified S3 path. How
>>>>>>>>>>> can I do it using iceberg? Thanks for any clarification.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Ryan Blue
>>>>>>>>> Tabular
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>>
>>>>>>>> Create your own email signature
>>>>>>>> <https://www.wisestamp.com/signature-in-email/?utm_source=promotion&utm_medium=signature&utm_campaign=create_your_own&srcid=5234462839406592>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Ryan Blue
>>>>>>> Tabular
>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>>
>>>>>> Create your own email signature
>>>>>> <https://www.wisestamp.com/signature-in-email/?utm_source=promotion&utm_medium=signature&utm_campaign=create_your_own&srcid=5234462839406592>
>>>>>>
>>>>>
>>>>
>>>> --
>>>>
>>>> Create your own email signature
>>>> <https://www.wisestamp.com/signature-in-email/?utm_source=promotion&utm_medium=signature&utm_campaign=create_your_own&srcid=5234462839406592>
>>>>
>>>>
>>>>
>>>
>>> --
>>>
>>> Create your own email signature
>>> <https://www.wisestamp.com/signature-in-email/?utm_source=promotion&utm_medium=signature&utm_campaign=create_your_own&srcid=5234462839406592>
>>>
>>
>>
>> --
>>
>> Create your own email signature
>> <https://www.wisestamp.com/signature-in-email/?utm_source=promotion&utm_medium=signature&utm_campaign=create_your_own&srcid=5234462839406592>
>>
>>
>>
>
> --
> Ryan Blue
> Tabular
>


-- 

Create your own email signature
<https://www.wisestamp.com/signature-in-email/?utm_source=promotion&utm_medium=signature&utm_campaign=create_your_own&srcid=5234462839406592>

Re: Writing iceberg table to S3

Posted by Ryan Blue <bl...@tabular.io>.

The problem for #3 is how Spark handles the options. The option method sets
write options, not table properties. The write options aren’t passed when
creating the table. Instead, you should use tableProperty("location", ...).

Ryan

On Wed, Aug 11, 2021 at 9:17 AM Russell Spitzer <ru...@gmail.com>
wrote:

> 2) Hive cannot read Iceberg tables without configuring the MR Hive
> integration from iceberg. So you shouldn't see it in hive unless you have
> configured that, see https://iceberg.apache.org/hive/.
>
> 3)
> https://github.com/apache/iceberg/blob/master/spark3/src/main/java/org/apache/iceberg/spark/SparkCatalog.java#L137
> I would check what properties are set in the table to see why that wasn't
> set. But "location" would be the correct way of setting the table. Unless
> the property is being ignored by Spark, I'm assuming you are using the
> latest build possible of Spark. There is a bug in 3.0 of Spark which
> ignores options passed to the V2 api sometimes,
> https://issues.apache.org/jira/browse/SPARK-32592 . Which is fixed in 3.1
>
> On Aug 11, 2021, at 11:00 AM, Lian Jiang <ji...@gmail.com> wrote:
>
> Any help is highly appreciated!
>
> On Tue, Aug 10, 2021 at 11:06 AM Lian Jiang <ji...@gmail.com> wrote:
>
>> Thanks Russell.
>>
>> I tried:
>>
>> /spark/bin/spark-shell --packages
>> org.apache.iceberg:iceberg-hive-runtime:0.11.1,org.apache.iceberg:iceberg-spark3-runtime:0.11.1
>> --conf spark.sql.catalog.hive_test=org.apache.iceberg.spark.SparkCatalog
>> --conf spark.sql.catalog.hive_test.type=hive
>>
>> import org.apache.spark.sql.SparkSession
>> val values = List(1,2,3,4,5)
>>
>> val spark = SparkSession.builder().master("local").getOrCreate()
>> import spark.implicits._
>> val df = values.toDF()
>>
>> val table = "hive_test.mydb.mytable3"
>> df.writeTo(table)
>>     .tableProperty("write.format.default", "parquet")
>> *    .option("location", "hdfs://namenode:8020/tmp/test.ice")*
>>     .createOrReplace()
>>
>> spark.table(table).show()
>>
>> *Observations*:
>> 1. spark.table(table).show() does show the table correctly.
>> +-----+
>> |value|
>> +-----+
>> |    1|
>> |    2|
>> |    3|
>> |    4|
>> |    5|
>> +-----+
>>
>> 2. mydb.mytable3 is created in HIVE but it is empty:
>> hive> select * from mytable3;
>> OK
>> Time taken: 0.158 seconds
>>
>> 3. test.ice is not generated in the HDFS folder /tmp.
>>
>> Any idea about 2 and 3? Thanks very much.
>>
>>
>>
>>
>>
>> On Tue, Aug 10, 2021 at 9:38 AM Russell Spitzer <
>> russell.spitzer@gmail.com> wrote:
>>
>>> Specify a property of "location" when creating the table. Just add a
>>> ".option("location", "path")"
>>>
>>> On Aug 10, 2021, at 11:15 AM, Lian Jiang <ji...@gmail.com> wrote:
>>>
>>> Thanks Russell. This helps a lot.
>>>
>>> I want to specify a HDFS location when creating an iceberg dataset using
>>> dataframe api. All examples using warehouse location are SQL. Do you have
>>> an example for dataframe API? For example, how to support HDFS/S3 location
>>> in the query below? The reason I ask is that my current code all uses spark
>>> API. It will be much easier if I can use spark API when migrating parquet
>>> to iceberg. Hope it makes sense.
>>>
>>> data.writeTo("prod.db.table")
>>>     .tableProperty("write.format.default", "orc")
>>>     .partitionBy($"level", days($"ts"))
>>>     .createOrReplace()
>>>
>>>
>>> On Mon, Aug 9, 2021 at 4:22 PM Russell Spitzer <
>>> russell.spitzer@gmail.com> wrote:
>>>
>>>> The config you used specified a catalog named "hive_prod", so to
>>>> reference it you need to either "use hive_prod" or refer to the table with
>>>> the catalog identifier "CREATE TABLE hive_prod.default.mytable"
>>>>
>>>> On Mon, Aug 9, 2021 at 6:15 PM Lian Jiang <ji...@gmail.com>
>>>> wrote:
>>>>
>>>>> Thanks Ryan.
>>>>>
>>>>> Using this command (uri is omitted because the uri is in
>>>>> hive-site.xml):
>>>>> spark-shell --conf
>>>>> spark.sql.catalog.hive_prod=org.apache.iceberg.spark.SparkCatalog --conf
>>>>> spark.sql.catalog.hive_prod.type=hive
>>>>>
>>>>> This statement:
>>>>> spark.sql("CREATE TABLE default.mytable (uuid string) USING iceberg")
>>>>>
>>>>> caused warning:
>>>>> WARN HiveExternalCatalog: Couldn't find corresponding Hive SerDe for
>>>>> data source provider iceberg.
>>>>>
>>>>> I tried:
>>>>> * the solution (put iceberg-hive-runtime.jar and
>>>>> iceberg-spark3-runtime.jar to spark/jars) mentioned in
>>>>> https://github.com/apache/iceberg/issues/2260
>>>>> * use --packages
>>>>> org.apache.iceberg:iceberg-hive-runtime:0.11.1,org.apache.iceberg:iceberg-spark3-runtime:0.11.1
>>>>>
>>>>> but they did not help. This warning blocks inserting any data into
>>>>> this table. Any ideas are appreciated!
>>>>>
>>>>> On Mon, Aug 9, 2021 at 10:15 AM Ryan Blue <bl...@tabular.io> wrote:
>>>>>
>>>>>> Lian,
>>>>>>
>>>>>> I think we should improve the docs for catalogs since it isn’t clear.
>>>>>> We have a few configuration pages that are helpful, but it looks like they
>>>>>> assume you know what your options are already. Take a look at the Spark
>>>>>> docs for catalogs, which is the closest we have right now:
>>>>>> https://iceberg.apache.org/spark-configuration/#catalog-configuration
>>>>>>
>>>>>> What you’ll want to do is to configure a catalog like the first
>>>>>> example:
>>>>>>
>>>>>> spark.sql.catalog.hive_prod = org.apache.iceberg.spark.SparkCatalog
>>>>>> spark.sql.catalog.hive_prod.type = hive
>>>>>> spark.sql.catalog.hive_prod.uri = thrift://metastore-host:port
>>>>>> # omit uri to use the same URI as Spark: hive.metastore.uris in hive-site.xml
>>>>>>
>>>>>> For MERGE INTO, the DataFrame API is not present in Spark, which is
>>>>>> why it can’t be used by SQL. This is something that should probably be
>>>>>> added to Spark and not Iceberg since it is just a different way to build
>>>>>> the same underlying Spark plan.
>>>>>>
>>>>>> To your question about dataframes vs SQL, I highly recommend SQL over
>>>>>> DataFrames so that you don’t end up needing to use Jars produced by
>>>>>> compiling Scala code. I think it’s easier to just use SQL. But Iceberg
>>>>>> should support both because DataFrames are useful for customization in some
>>>>>> cases. It really should be up to you and what you want to use, as far as
>>>>>> Iceberg is concerned.
>>>>>>
>>>>>> Ryan
>>>>>>
>>>>>> On Mon, Aug 9, 2021 at 9:31 AM Lian Jiang <ji...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Thanks Eduard and Ryan.
>>>>>>>
>>>>>>> I use spark on a K8S cluster to write parquet on s3 and then add an
>>>>>>> external table in hive metastore for this parquet. In the future, when
>>>>>>> using iceberg, I prefer hive metadata store since it is my
>>>>>>> centralized metastore for batch and streaming datasets. I don't see that
>>>>>>> hive metastore is supported in iceberg AWS integration on
>>>>>>> https://iceberg.apache.org/aws/. Is there another link for that?
>>>>>>>
>>>>>>> Most of the examples use spark sql to write/read iceberg. For
>>>>>>> example, there is no "sql merge into" like support for spark API. Is spark
>>>>>>> sql preferred over spark dataframe/dataset API in Iceberg? If so, could you
>>>>>>> clarify the rationale behind? I personally feel spark API is more dev
>>>>>>> friendly and scalable. Thanks very much!
>>>>>>>
>>>>>>>
>>>>>>> On Mon, Aug 9, 2021 at 8:53 AM Ryan Blue <bl...@tabular.io> wrote:
>>>>>>>
>>>>>>>> Lian,
>>>>>>>>
>>>>>>>> Iceberg tables work great in S3. When creating the table, just pass
>>>>>>>> the `LOCATION` clause with an S3 path, or set your catalog's warehouse
>>>>>>>> location to S3 so tables are automatically created there.
>>>>>>>>
>>>>>>>> The only restriction for S3 is that you need a metastore to track
>>>>>>>> the table metadata location because S3 doesn't have a way to implement a
>>>>>>>> metadata commit. For a metastore, there are implementations backed by the
>>>>>>>> Hive MetaStore, Glue/DynamoDB, and Nessie. And the upcoming release adds
>>>>>>>> support for DynamoDB without Glue and JDBC.
>>>>>>>>
>>>>>>>> Ryan
>>>>>>>>
>>>>>>>> On Mon, Aug 9, 2021 at 2:24 AM Eduard Tudenhoefner <
>>>>>>>> eduard@dremio.com> wrote:
>>>>>>>>
>>>>>>>>> Lian you can have a look at https://iceberg.apache.org/aws/. It
>>>>>>>>> should contain all the info that you need. The codebase contains a *S3FileIO
>>>>>>>>> *class, which is an implementation that is backed by S3.
>>>>>>>>>
>>>>>>>>> On Mon, Aug 9, 2021 at 7:37 AM Lian Jiang <ji...@gmail.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> I am reading
>>>>>>>>>> https://iceberg.apache.org/spark-writes/#spark-writes and
>>>>>>>>>> wondering if it is possible to create an iceberg table on S3. This guide
>>>>>>>>>> seems to say only write to a hive table (backed up by HDFS if I understand
>>>>>>>>>> correctly). Hudi and Delta can write to s3 with a specified S3 path. How
>>>>>>>>>> can I do it using iceberg? Thanks for any clarification.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> Ryan Blue
>>>>>>>> Tabular
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>>
>>>>>>> Create your own email signature
>>>>>>> <https://www.wisestamp.com/signature-in-email/?utm_source=promotion&utm_medium=signature&utm_campaign=create_your_own&srcid=5234462839406592>
>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Ryan Blue
>>>>>> Tabular
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>>
>>>>> Create your own email signature
>>>>> <https://www.wisestamp.com/signature-in-email/?utm_source=promotion&utm_medium=signature&utm_campaign=create_your_own&srcid=5234462839406592>
>>>>>
>>>>
>>>
>>> --
>>>
>>> Create your own email signature
>>> <https://www.wisestamp.com/signature-in-email/?utm_source=promotion&utm_medium=signature&utm_campaign=create_your_own&srcid=5234462839406592>
>>>
>>>
>>>
>>
>> --
>>
>> Create your own email signature
>> <https://www.wisestamp.com/signature-in-email/?utm_source=promotion&utm_medium=signature&utm_campaign=create_your_own&srcid=5234462839406592>
>>
>
>
> --
>
> Create your own email signature
> <https://www.wisestamp.com/signature-in-email/?utm_source=promotion&utm_medium=signature&utm_campaign=create_your_own&srcid=5234462839406592>
>
>
>

-- 
Ryan Blue
Tabular

Re: Writing iceberg table to S3

Posted by Russell Spitzer <ru...@gmail.com>.

2) Hive cannot read Iceberg tables without configuring the MR Hive integration from iceberg. So you shouldn't see it in hive unless you have configured that, see https://iceberg.apache.org/hive/. 

3)  https://github.com/apache/iceberg/blob/master/spark3/src/main/java/org/apache/iceberg/spark/SparkCatalog.java#L137 <https://github.com/apache/iceberg/blob/master/spark3/src/main/java/org/apache/iceberg/spark/SparkCatalog.java#L137>
I would check what properties are set in the table to see why that wasn't set. But "location" would be the correct way of setting the table. Unless the property is being ignored by Spark, I'm assuming you are using the latest build possible of Spark. There is a bug in 3.0 of Spark which ignores options passed to the V2 api sometimes, https://issues.apache.org/jira/browse/SPARK-32592 <https://issues.apache.org/jira/browse/SPARK-32592> . Which is fixed in 3.1

> On Aug 11, 2021, at 11:00 AM, Lian Jiang <ji...@gmail.com> wrote:
> 
> Any help is highly appreciated!
> 
> On Tue, Aug 10, 2021 at 11:06 AM Lian Jiang <jiangok2006@gmail.com <ma...@gmail.com>> wrote:
> Thanks Russell.
> 
> I tried:
> 
> /spark/bin/spark-shell --packages org.apache.iceberg:iceberg-hive-runtime:0.11.1,org.apache.iceberg:iceberg-spark3-runtime:0.11.1 --conf spark.sql.catalog.hive_test=org.apache.iceberg.spark.SparkCatalog --conf spark.sql.catalog.hive_test.type=hive
> 
> import org.apache.spark.sql.SparkSession
> val values = List(1,2,3,4,5)
> 
> val spark = SparkSession.builder().master("local").getOrCreate()
> import spark.implicits._
> val df = values.toDF()
> 
> val table = "hive_test.mydb.mytable3"
> df.writeTo(table)
>     .tableProperty("write.format.default", "parquet")
>     .option("location", "hdfs://namenode:8020/tmp/test.ice")
>     .createOrReplace()
> 
> spark.table(table).show()
> 
> Observations:
> 1. spark.table(table).show() does show the table correctly.
> +-----+
> |value|
> +-----+
> |    1|
> |    2|
> |    3|
> |    4|
> |    5|
> +-----+
> 
> 2. mydb.mytable3 is created in HIVE but it is empty:
> hive> select * from mytable3;
> OK
> Time taken: 0.158 seconds
> 
> 3. test.ice is not generated in the HDFS folder /tmp.
> 
> Any idea about 2 and 3? Thanks very much.
> 
> 
> 
> 
> 
> On Tue, Aug 10, 2021 at 9:38 AM Russell Spitzer <russell.spitzer@gmail.com <ma...@gmail.com>> wrote:
> Specify a property of "location" when creating the table. Just add a ".option("location", "path")"
> 
>> On Aug 10, 2021, at 11:15 AM, Lian Jiang <jiangok2006@gmail.com <ma...@gmail.com>> wrote:
>> 
>> Thanks Russell. This helps a lot.
>> 
>> I want to specify a HDFS location when creating an iceberg dataset using dataframe api. All examples using warehouse location are SQL. Do you have an example for dataframe API? For example, how to support HDFS/S3 location in the query below? The reason I ask is that my current code all uses spark API. It will be much easier if I can use spark API when migrating parquet to iceberg. Hope it makes sense.
>> 
>> data.writeTo("prod.db.table")
>>     .tableProperty("write.format.default", "orc")
>>     .partitionBy($"level", days($"ts"))
>>     .createOrReplace()
>> 
>> On Mon, Aug 9, 2021 at 4:22 PM Russell Spitzer <russell.spitzer@gmail.com <ma...@gmail.com>> wrote:
>> The config you used specified a catalog named "hive_prod", so to reference it you need to either "use hive_prod" or refer to the table with the catalog identifier "CREATE TABLE hive_prod.default.mytable"
>> 
>> On Mon, Aug 9, 2021 at 6:15 PM Lian Jiang <jiangok2006@gmail.com <ma...@gmail.com>> wrote:
>> Thanks Ryan.
>> 
>> Using this command (uri is omitted because the uri is in hive-site.xml):
>> spark-shell --conf spark.sql.catalog.hive_prod=org.apache.iceberg.spark.SparkCatalog --conf spark.sql.catalog.hive_prod.type=hive
>> 
>> This statement:
>> spark.sql("CREATE TABLE default.mytable (uuid string) USING iceberg")
>> 
>> caused warning:
>> WARN HiveExternalCatalog: Couldn't find corresponding Hive SerDe for data source provider iceberg.
>> 
>> I tried:
>> * the solution (put iceberg-hive-runtime.jar and iceberg-spark3-runtime.jar to spark/jars) mentioned in https://github.com/apache/iceberg/issues/2260 <https://github.com/apache/iceberg/issues/2260>
>> * use --packages org.apache.iceberg:iceberg-hive-runtime:0.11.1,org.apache.iceberg:iceberg-spark3-runtime:0.11.1
>> 
>> but they did not help. This warning blocks inserting any data into this table. Any ideas are appreciated!
>> 
>> On Mon, Aug 9, 2021 at 10:15 AM Ryan Blue <blue@tabular.io <ma...@tabular.io>> wrote:
>> Lian,
>> 
>> I think we should improve the docs for catalogs since it isn’t clear. We have a few configuration pages that are helpful, but it looks like they assume you know what your options are already. Take a look at the Spark docs for catalogs, which is the closest we have right now: https://iceberg.apache.org/spark-configuration/#catalog-configuration <https://iceberg.apache.org/spark-configuration/#catalog-configuration>
>> What you’ll want to do is to configure a catalog like the first example:
>> 
>> spark.sql.catalog.hive_prod = org.apache.iceberg.spark.SparkCatalog
>> spark.sql.catalog.hive_prod.type = hive
>> spark.sql.catalog.hive_prod.uri = thrift://metastore-host:port <>
>> # omit uri to use the same URI as Spark: hive.metastore.uris in hive-site.xml
>> For MERGE INTO, the DataFrame API is not present in Spark, which is why it can’t be used by SQL. This is something that should probably be added to Spark and not Iceberg since it is just a different way to build the same underlying Spark plan.
>> 
>> To your question about dataframes vs SQL, I highly recommend SQL over DataFrames so that you don’t end up needing to use Jars produced by compiling Scala code. I think it’s easier to just use SQL. But Iceberg should support both because DataFrames are useful for customization in some cases. It really should be up to you and what you want to use, as far as Iceberg is concerned.
>> 
>> Ryan
>> 
>> 
>> On Mon, Aug 9, 2021 at 9:31 AM Lian Jiang <jiangok2006@gmail.com <ma...@gmail.com>> wrote:
>> Thanks Eduard and Ryan.
>> 
>> I use spark on a K8S cluster to write parquet on s3 and then add an external table in hive metastore for this parquet. In the future, when using iceberg, I prefer hive metadata store since it is my centralized metastore for batch and streaming datasets. I don't see that hive metastore is supported in iceberg AWS integration on https://iceberg.apache.org/aws/ <https://iceberg.apache.org/aws/>. Is there another link for that?
>> 
>> Most of the examples use spark sql to write/read iceberg. For example, there is no "sql merge into" like support for spark API. Is spark sql preferred over spark dataframe/dataset API in Iceberg? If so, could you clarify the rationale behind? I personally feel spark API is more dev friendly and scalable. Thanks very much!  
>> 
>> 
>> On Mon, Aug 9, 2021 at 8:53 AM Ryan Blue <blue@tabular.io <ma...@tabular.io>> wrote:
>> Lian,
>> 
>> Iceberg tables work great in S3. When creating the table, just pass the `LOCATION` clause with an S3 path, or set your catalog's warehouse location to S3 so tables are automatically created there.
>> 
>> The only restriction for S3 is that you need a metastore to track the table metadata location because S3 doesn't have a way to implement a metadata commit. For a metastore, there are implementations backed by the Hive MetaStore, Glue/DynamoDB, and Nessie. And the upcoming release adds support for DynamoDB without Glue and JDBC.
>> 
>> Ryan
>> 
>> On Mon, Aug 9, 2021 at 2:24 AM Eduard Tudenhoefner <eduard@dremio.com <ma...@dremio.com>> wrote:
>> Lian you can have a look at https://iceberg.apache.org/aws/ <https://iceberg.apache.org/aws/>. It should contain all the info that you need. The codebase contains a S3FileIO class, which is an implementation that is backed by S3.
>> 
>> On Mon, Aug 9, 2021 at 7:37 AM Lian Jiang <jiangok2006@gmail.com <ma...@gmail.com>> wrote:
>> I am reading https://iceberg.apache.org/spark-writes/#spark-writes <https://iceberg.apache.org/spark-writes/#spark-writes> and wondering if it is possible to create an iceberg table on S3. This guide seems to say only write to a hive table (backed up by HDFS if I understand correctly). Hudi and Delta can write to s3 with a specified S3 path. How can I do it using iceberg? Thanks for any clarification.
>> 
>> 
>> 
>> 
>> -- 
>> Ryan Blue
>> Tabular
>> 
>> 
>> -- 
>> Create your own email signature <https://www.wisestamp.com/signature-in-email/?utm_source=promotion&utm_medium=signature&utm_campaign=create_your_own&srcid=5234462839406592>  
>> 
>> 
>> -- 
>> Ryan Blue
>> Tabular
>> 
>> 
>> -- 
>> Create your own email signature <https://www.wisestamp.com/signature-in-email/?utm_source=promotion&utm_medium=signature&utm_campaign=create_your_own&srcid=5234462839406592>  
>> 
>> 
>> -- 
>> Create your own email signature <https://www.wisestamp.com/signature-in-email/?utm_source=promotion&utm_medium=signature&utm_campaign=create_your_own&srcid=5234462839406592>  
> 
> 
> 
> -- 
> Create your own email signature <https://www.wisestamp.com/signature-in-email/?utm_source=promotion&utm_medium=signature&utm_campaign=create_your_own&srcid=5234462839406592>  
> 
> 
> -- 
> Create your own email signature <https://www.wisestamp.com/signature-in-email/?utm_source=promotion&utm_medium=signature&utm_campaign=create_your_own&srcid=5234462839406592>

Re: Writing iceberg table to S3

Posted by Lian Jiang <ji...@gmail.com>.

Any help is highly appreciated!

On Tue, Aug 10, 2021 at 11:06 AM Lian Jiang <ji...@gmail.com> wrote:

> Thanks Russell.
>
> I tried:
>
> /spark/bin/spark-shell --packages
> org.apache.iceberg:iceberg-hive-runtime:0.11.1,org.apache.iceberg:iceberg-spark3-runtime:0.11.1
> --conf spark.sql.catalog.hive_test=org.apache.iceberg.spark.SparkCatalog
> --conf spark.sql.catalog.hive_test.type=hive
>
> import org.apache.spark.sql.SparkSession
> val values = List(1,2,3,4,5)
>
> val spark = SparkSession.builder().master("local").getOrCreate()
> import spark.implicits._
> val df = values.toDF()
>
> val table = "hive_test.mydb.mytable3"
> df.writeTo(table)
>     .tableProperty("write.format.default", "parquet")
> *    .option("location", "hdfs://namenode:8020/tmp/test.ice")*
>     .createOrReplace()
>
> spark.table(table).show()
>
> *Observations*:
> 1. spark.table(table).show() does show the table correctly.
> +-----+
> |value|
> +-----+
> |    1|
> |    2|
> |    3|
> |    4|
> |    5|
> +-----+
>
> 2. mydb.mytable3 is created in HIVE but it is empty:
> hive> select * from mytable3;
> OK
> Time taken: 0.158 seconds
>
> 3. test.ice is not generated in the HDFS folder /tmp.
>
> Any idea about 2 and 3? Thanks very much.
>
>
>
>
>
> On Tue, Aug 10, 2021 at 9:38 AM Russell Spitzer <ru...@gmail.com>
> wrote:
>
>> Specify a property of "location" when creating the table. Just add a
>> ".option("location", "path")"
>>
>> On Aug 10, 2021, at 11:15 AM, Lian Jiang <ji...@gmail.com> wrote:
>>
>> Thanks Russell. This helps a lot.
>>
>> I want to specify a HDFS location when creating an iceberg dataset using
>> dataframe api. All examples using warehouse location are SQL. Do you have
>> an example for dataframe API? For example, how to support HDFS/S3 location
>> in the query below? The reason I ask is that my current code all uses spark
>> API. It will be much easier if I can use spark API when migrating parquet
>> to iceberg. Hope it makes sense.
>>
>> data.writeTo("prod.db.table")
>>     .tableProperty("write.format.default", "orc")
>>     .partitionBy($"level", days($"ts"))
>>     .createOrReplace()
>>
>>
>> On Mon, Aug 9, 2021 at 4:22 PM Russell Spitzer <ru...@gmail.com>
>> wrote:
>>
>>> The config you used specified a catalog named "hive_prod", so to
>>> reference it you need to either "use hive_prod" or refer to the table with
>>> the catalog identifier "CREATE TABLE hive_prod.default.mytable"
>>>
>>> On Mon, Aug 9, 2021 at 6:15 PM Lian Jiang <ji...@gmail.com> wrote:
>>>
>>>> Thanks Ryan.
>>>>
>>>> Using this command (uri is omitted because the uri is in hive-site.xml):
>>>> spark-shell --conf
>>>> spark.sql.catalog.hive_prod=org.apache.iceberg.spark.SparkCatalog --conf
>>>> spark.sql.catalog.hive_prod.type=hive
>>>>
>>>> This statement:
>>>> spark.sql("CREATE TABLE default.mytable (uuid string) USING iceberg")
>>>>
>>>> caused warning:
>>>> WARN HiveExternalCatalog: Couldn't find corresponding Hive SerDe for
>>>> data source provider iceberg.
>>>>
>>>> I tried:
>>>> * the solution (put iceberg-hive-runtime.jar and
>>>> iceberg-spark3-runtime.jar to spark/jars) mentioned in
>>>> https://github.com/apache/iceberg/issues/2260
>>>> * use --packages
>>>> org.apache.iceberg:iceberg-hive-runtime:0.11.1,org.apache.iceberg:iceberg-spark3-runtime:0.11.1
>>>>
>>>> but they did not help. This warning blocks inserting any data into this
>>>> table. Any ideas are appreciated!
>>>>
>>>> On Mon, Aug 9, 2021 at 10:15 AM Ryan Blue <bl...@tabular.io> wrote:
>>>>
>>>>> Lian,
>>>>>
>>>>> I think we should improve the docs for catalogs since it isn’t clear.
>>>>> We have a few configuration pages that are helpful, but it looks like they
>>>>> assume you know what your options are already. Take a look at the Spark
>>>>> docs for catalogs, which is the closest we have right now:
>>>>> https://iceberg.apache.org/spark-configuration/#catalog-configuration
>>>>>
>>>>> What you’ll want to do is to configure a catalog like the first
>>>>> example:
>>>>>
>>>>> spark.sql.catalog.hive_prod = org.apache.iceberg.spark.SparkCatalog
>>>>> spark.sql.catalog.hive_prod.type = hive
>>>>> spark.sql.catalog.hive_prod.uri = thrift://metastore-host:port
>>>>> # omit uri to use the same URI as Spark: hive.metastore.uris in hive-site.xml
>>>>>
>>>>> For MERGE INTO, the DataFrame API is not present in Spark, which is
>>>>> why it can’t be used by SQL. This is something that should probably be
>>>>> added to Spark and not Iceberg since it is just a different way to build
>>>>> the same underlying Spark plan.
>>>>>
>>>>> To your question about dataframes vs SQL, I highly recommend SQL over
>>>>> DataFrames so that you don’t end up needing to use Jars produced by
>>>>> compiling Scala code. I think it’s easier to just use SQL. But Iceberg
>>>>> should support both because DataFrames are useful for customization in some
>>>>> cases. It really should be up to you and what you want to use, as far as
>>>>> Iceberg is concerned.
>>>>>
>>>>> Ryan
>>>>>
>>>>> On Mon, Aug 9, 2021 at 9:31 AM Lian Jiang <ji...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Thanks Eduard and Ryan.
>>>>>>
>>>>>> I use spark on a K8S cluster to write parquet on s3 and then add an
>>>>>> external table in hive metastore for this parquet. In the future, when
>>>>>> using iceberg, I prefer hive metadata store since it is my
>>>>>> centralized metastore for batch and streaming datasets. I don't see that
>>>>>> hive metastore is supported in iceberg AWS integration on
>>>>>> https://iceberg.apache.org/aws/. Is there another link for that?
>>>>>>
>>>>>> Most of the examples use spark sql to write/read iceberg. For
>>>>>> example, there is no "sql merge into" like support for spark API. Is spark
>>>>>> sql preferred over spark dataframe/dataset API in Iceberg? If so, could you
>>>>>> clarify the rationale behind? I personally feel spark API is more dev
>>>>>> friendly and scalable. Thanks very much!
>>>>>>
>>>>>>
>>>>>> On Mon, Aug 9, 2021 at 8:53 AM Ryan Blue <bl...@tabular.io> wrote:
>>>>>>
>>>>>>> Lian,
>>>>>>>
>>>>>>> Iceberg tables work great in S3. When creating the table, just pass
>>>>>>> the `LOCATION` clause with an S3 path, or set your catalog's warehouse
>>>>>>> location to S3 so tables are automatically created there.
>>>>>>>
>>>>>>> The only restriction for S3 is that you need a metastore to track
>>>>>>> the table metadata location because S3 doesn't have a way to implement a
>>>>>>> metadata commit. For a metastore, there are implementations backed by the
>>>>>>> Hive MetaStore, Glue/DynamoDB, and Nessie. And the upcoming release adds
>>>>>>> support for DynamoDB without Glue and JDBC.
>>>>>>>
>>>>>>> Ryan
>>>>>>>
>>>>>>> On Mon, Aug 9, 2021 at 2:24 AM Eduard Tudenhoefner <
>>>>>>> eduard@dremio.com> wrote:
>>>>>>>
>>>>>>>> Lian you can have a look at https://iceberg.apache.org/aws/. It
>>>>>>>> should contain all the info that you need. The codebase contains a *S3FileIO
>>>>>>>> *class, which is an implementation that is backed by S3.
>>>>>>>>
>>>>>>>> On Mon, Aug 9, 2021 at 7:37 AM Lian Jiang <ji...@gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> I am reading https://iceberg.apache.org/spark-writes/#spark-writes
>>>>>>>>> and wondering if it is possible to create an iceberg table on S3. This
>>>>>>>>> guide seems to say only write to a hive table (backed up by HDFS if I
>>>>>>>>> understand correctly). Hudi and Delta can write to s3 with a specified S3
>>>>>>>>> path. How can I do it using iceberg? Thanks for any clarification.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Ryan Blue
>>>>>>> Tabular
>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>>
>>>>>> Create your own email signature
>>>>>> <https://www.wisestamp.com/signature-in-email/?utm_source=promotion&utm_medium=signature&utm_campaign=create_your_own&srcid=5234462839406592>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Ryan Blue
>>>>> Tabular
>>>>>
>>>>
>>>>
>>>> --
>>>>
>>>> Create your own email signature
>>>> <https://www.wisestamp.com/signature-in-email/?utm_source=promotion&utm_medium=signature&utm_campaign=create_your_own&srcid=5234462839406592>
>>>>
>>>
>>
>> --
>>
>> Create your own email signature
>> <https://www.wisestamp.com/signature-in-email/?utm_source=promotion&utm_medium=signature&utm_campaign=create_your_own&srcid=5234462839406592>
>>
>>
>>
>
> --
>
> Create your own email signature
> <https://www.wisestamp.com/signature-in-email/?utm_source=promotion&utm_medium=signature&utm_campaign=create_your_own&srcid=5234462839406592>
>


-- 

Create your own email signature
<https://www.wisestamp.com/signature-in-email/?utm_source=promotion&utm_medium=signature&utm_campaign=create_your_own&srcid=5234462839406592>

Re: Writing iceberg table to S3

Posted by Lian Jiang <ji...@gmail.com>.

Thanks Russell.

I tried:

/spark/bin/spark-shell --packages
org.apache.iceberg:iceberg-hive-runtime:0.11.1,org.apache.iceberg:iceberg-spark3-runtime:0.11.1
--conf spark.sql.catalog.hive_test=org.apache.iceberg.spark.SparkCatalog
--conf spark.sql.catalog.hive_test.type=hive

import org.apache.spark.sql.SparkSession
val values = List(1,2,3,4,5)

val spark = SparkSession.builder().master("local").getOrCreate()
import spark.implicits._
val df = values.toDF()

val table = "hive_test.mydb.mytable3"
df.writeTo(table)
    .tableProperty("write.format.default", "parquet")
*    .option("location", "hdfs://namenode:8020/tmp/test.ice")*
    .createOrReplace()

spark.table(table).show()

*Observations*:
1. spark.table(table).show() does show the table correctly.
+-----+
|value|
+-----+
|    1|
|    2|
|    3|
|    4|
|    5|
+-----+

2. mydb.mytable3 is created in HIVE but it is empty:
hive> select * from mytable3;
OK
Time taken: 0.158 seconds

3. test.ice is not generated in the HDFS folder /tmp.

Any idea about 2 and 3? Thanks very much.





On Tue, Aug 10, 2021 at 9:38 AM Russell Spitzer <ru...@gmail.com>
wrote:

> Specify a property of "location" when creating the table. Just add a
> ".option("location", "path")"
>
> On Aug 10, 2021, at 11:15 AM, Lian Jiang <ji...@gmail.com> wrote:
>
> Thanks Russell. This helps a lot.
>
> I want to specify a HDFS location when creating an iceberg dataset using
> dataframe api. All examples using warehouse location are SQL. Do you have
> an example for dataframe API? For example, how to support HDFS/S3 location
> in the query below? The reason I ask is that my current code all uses spark
> API. It will be much easier if I can use spark API when migrating parquet
> to iceberg. Hope it makes sense.
>
> data.writeTo("prod.db.table")
>     .tableProperty("write.format.default", "orc")
>     .partitionBy($"level", days($"ts"))
>     .createOrReplace()
>
>
> On Mon, Aug 9, 2021 at 4:22 PM Russell Spitzer <ru...@gmail.com>
> wrote:
>
>> The config you used specified a catalog named "hive_prod", so to
>> reference it you need to either "use hive_prod" or refer to the table with
>> the catalog identifier "CREATE TABLE hive_prod.default.mytable"
>>
>> On Mon, Aug 9, 2021 at 6:15 PM Lian Jiang <ji...@gmail.com> wrote:
>>
>>> Thanks Ryan.
>>>
>>> Using this command (uri is omitted because the uri is in hive-site.xml):
>>> spark-shell --conf
>>> spark.sql.catalog.hive_prod=org.apache.iceberg.spark.SparkCatalog --conf
>>> spark.sql.catalog.hive_prod.type=hive
>>>
>>> This statement:
>>> spark.sql("CREATE TABLE default.mytable (uuid string) USING iceberg")
>>>
>>> caused warning:
>>> WARN HiveExternalCatalog: Couldn't find corresponding Hive SerDe for
>>> data source provider iceberg.
>>>
>>> I tried:
>>> * the solution (put iceberg-hive-runtime.jar and
>>> iceberg-spark3-runtime.jar to spark/jars) mentioned in
>>> https://github.com/apache/iceberg/issues/2260
>>> * use --packages
>>> org.apache.iceberg:iceberg-hive-runtime:0.11.1,org.apache.iceberg:iceberg-spark3-runtime:0.11.1
>>>
>>> but they did not help. This warning blocks inserting any data into this
>>> table. Any ideas are appreciated!
>>>
>>> On Mon, Aug 9, 2021 at 10:15 AM Ryan Blue <bl...@tabular.io> wrote:
>>>
>>>> Lian,
>>>>
>>>> I think we should improve the docs for catalogs since it isn’t clear.
>>>> We have a few configuration pages that are helpful, but it looks like they
>>>> assume you know what your options are already. Take a look at the Spark
>>>> docs for catalogs, which is the closest we have right now:
>>>> https://iceberg.apache.org/spark-configuration/#catalog-configuration
>>>>
>>>> What you’ll want to do is to configure a catalog like the first example:
>>>>
>>>> spark.sql.catalog.hive_prod = org.apache.iceberg.spark.SparkCatalog
>>>> spark.sql.catalog.hive_prod.type = hive
>>>> spark.sql.catalog.hive_prod.uri = thrift://metastore-host:port
>>>> # omit uri to use the same URI as Spark: hive.metastore.uris in hive-site.xml
>>>>
>>>> For MERGE INTO, the DataFrame API is not present in Spark, which is
>>>> why it can’t be used by SQL. This is something that should probably be
>>>> added to Spark and not Iceberg since it is just a different way to build
>>>> the same underlying Spark plan.
>>>>
>>>> To your question about dataframes vs SQL, I highly recommend SQL over
>>>> DataFrames so that you don’t end up needing to use Jars produced by
>>>> compiling Scala code. I think it’s easier to just use SQL. But Iceberg
>>>> should support both because DataFrames are useful for customization in some
>>>> cases. It really should be up to you and what you want to use, as far as
>>>> Iceberg is concerned.
>>>>
>>>> Ryan
>>>>
>>>> On Mon, Aug 9, 2021 at 9:31 AM Lian Jiang <ji...@gmail.com>
>>>> wrote:
>>>>
>>>>> Thanks Eduard and Ryan.
>>>>>
>>>>> I use spark on a K8S cluster to write parquet on s3 and then add an
>>>>> external table in hive metastore for this parquet. In the future, when
>>>>> using iceberg, I prefer hive metadata store since it is my
>>>>> centralized metastore for batch and streaming datasets. I don't see that
>>>>> hive metastore is supported in iceberg AWS integration on
>>>>> https://iceberg.apache.org/aws/. Is there another link for that?
>>>>>
>>>>> Most of the examples use spark sql to write/read iceberg. For example,
>>>>> there is no "sql merge into" like support for spark API. Is spark sql
>>>>> preferred over spark dataframe/dataset API in Iceberg? If so, could you
>>>>> clarify the rationale behind? I personally feel spark API is more dev
>>>>> friendly and scalable. Thanks very much!
>>>>>
>>>>>
>>>>> On Mon, Aug 9, 2021 at 8:53 AM Ryan Blue <bl...@tabular.io> wrote:
>>>>>
>>>>>> Lian,
>>>>>>
>>>>>> Iceberg tables work great in S3. When creating the table, just pass
>>>>>> the `LOCATION` clause with an S3 path, or set your catalog's warehouse
>>>>>> location to S3 so tables are automatically created there.
>>>>>>
>>>>>> The only restriction for S3 is that you need a metastore to track the
>>>>>> table metadata location because S3 doesn't have a way to implement a
>>>>>> metadata commit. For a metastore, there are implementations backed by the
>>>>>> Hive MetaStore, Glue/DynamoDB, and Nessie. And the upcoming release adds
>>>>>> support for DynamoDB without Glue and JDBC.
>>>>>>
>>>>>> Ryan
>>>>>>
>>>>>> On Mon, Aug 9, 2021 at 2:24 AM Eduard Tudenhoefner <ed...@dremio.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Lian you can have a look at https://iceberg.apache.org/aws/. It
>>>>>>> should contain all the info that you need. The codebase contains a *S3FileIO
>>>>>>> *class, which is an implementation that is backed by S3.
>>>>>>>
>>>>>>> On Mon, Aug 9, 2021 at 7:37 AM Lian Jiang <ji...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> I am reading https://iceberg.apache.org/spark-writes/#spark-writes
>>>>>>>> and wondering if it is possible to create an iceberg table on S3. This
>>>>>>>> guide seems to say only write to a hive table (backed up by HDFS if I
>>>>>>>> understand correctly). Hudi and Delta can write to s3 with a specified S3
>>>>>>>> path. How can I do it using iceberg? Thanks for any clarification.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>
>>>>>> --
>>>>>> Ryan Blue
>>>>>> Tabular
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>>
>>>>> Create your own email signature
>>>>> <https://www.wisestamp.com/signature-in-email/?utm_source=promotion&utm_medium=signature&utm_campaign=create_your_own&srcid=5234462839406592>
>>>>>
>>>>
>>>>
>>>> --
>>>> Ryan Blue
>>>> Tabular
>>>>
>>>
>>>
>>> --
>>>
>>> Create your own email signature
>>> <https://www.wisestamp.com/signature-in-email/?utm_source=promotion&utm_medium=signature&utm_campaign=create_your_own&srcid=5234462839406592>
>>>
>>
>
> --
>
> Create your own email signature
> <https://www.wisestamp.com/signature-in-email/?utm_source=promotion&utm_medium=signature&utm_campaign=create_your_own&srcid=5234462839406592>
>
>
>

-- 

Create your own email signature
<https://www.wisestamp.com/signature-in-email/?utm_source=promotion&utm_medium=signature&utm_campaign=create_your_own&srcid=5234462839406592>

Re: Writing iceberg table to S3

Posted by Russell Spitzer <ru...@gmail.com>.

Specify a property of "location" when creating the table. Just add a ".option("location", "path")"

> On Aug 10, 2021, at 11:15 AM, Lian Jiang <ji...@gmail.com> wrote:
> 
> Thanks Russell. This helps a lot.
> 
> I want to specify a HDFS location when creating an iceberg dataset using dataframe api. All examples using warehouse location are SQL. Do you have an example for dataframe API? For example, how to support HDFS/S3 location in the query below? The reason I ask is that my current code all uses spark API. It will be much easier if I can use spark API when migrating parquet to iceberg. Hope it makes sense.
> 
> data.writeTo("prod.db.table")
>     .tableProperty("write.format.default", "orc")
>     .partitionBy($"level", days($"ts"))
>     .createOrReplace()
> 
> On Mon, Aug 9, 2021 at 4:22 PM Russell Spitzer <russell.spitzer@gmail.com <ma...@gmail.com>> wrote:
> The config you used specified a catalog named "hive_prod", so to reference it you need to either "use hive_prod" or refer to the table with the catalog identifier "CREATE TABLE hive_prod.default.mytable"
> 
> On Mon, Aug 9, 2021 at 6:15 PM Lian Jiang <jiangok2006@gmail.com <ma...@gmail.com>> wrote:
> Thanks Ryan.
> 
> Using this command (uri is omitted because the uri is in hive-site.xml):
> spark-shell --conf spark.sql.catalog.hive_prod=org.apache.iceberg.spark.SparkCatalog --conf spark.sql.catalog.hive_prod.type=hive
> 
> This statement:
> spark.sql("CREATE TABLE default.mytable (uuid string) USING iceberg")
> 
> caused warning:
> WARN HiveExternalCatalog: Couldn't find corresponding Hive SerDe for data source provider iceberg.
> 
> I tried:
> * the solution (put iceberg-hive-runtime.jar and iceberg-spark3-runtime.jar to spark/jars) mentioned in https://github.com/apache/iceberg/issues/2260 <https://github.com/apache/iceberg/issues/2260>
> * use --packages org.apache.iceberg:iceberg-hive-runtime:0.11.1,org.apache.iceberg:iceberg-spark3-runtime:0.11.1
> 
> but they did not help. This warning blocks inserting any data into this table. Any ideas are appreciated!
> 
> On Mon, Aug 9, 2021 at 10:15 AM Ryan Blue <blue@tabular.io <ma...@tabular.io>> wrote:
> Lian,
> 
> I think we should improve the docs for catalogs since it isn’t clear. We have a few configuration pages that are helpful, but it looks like they assume you know what your options are already. Take a look at the Spark docs for catalogs, which is the closest we have right now: https://iceberg.apache.org/spark-configuration/#catalog-configuration <https://iceberg.apache.org/spark-configuration/#catalog-configuration>
> What you’ll want to do is to configure a catalog like the first example:
> 
> spark.sql.catalog.hive_prod = org.apache.iceberg.spark.SparkCatalog
> spark.sql.catalog.hive_prod.type = hive
> spark.sql.catalog.hive_prod.uri = thrift://metastore-host:port
> # omit uri to use the same URI as Spark: hive.metastore.uris in hive-site.xml
> For MERGE INTO, the DataFrame API is not present in Spark, which is why it can’t be used by SQL. This is something that should probably be added to Spark and not Iceberg since it is just a different way to build the same underlying Spark plan.
> 
> To your question about dataframes vs SQL, I highly recommend SQL over DataFrames so that you don’t end up needing to use Jars produced by compiling Scala code. I think it’s easier to just use SQL. But Iceberg should support both because DataFrames are useful for customization in some cases. It really should be up to you and what you want to use, as far as Iceberg is concerned.
> 
> Ryan
> 
> 
> On Mon, Aug 9, 2021 at 9:31 AM Lian Jiang <jiangok2006@gmail.com <ma...@gmail.com>> wrote:
> Thanks Eduard and Ryan.
> 
> I use spark on a K8S cluster to write parquet on s3 and then add an external table in hive metastore for this parquet. In the future, when using iceberg, I prefer hive metadata store since it is my centralized metastore for batch and streaming datasets. I don't see that hive metastore is supported in iceberg AWS integration on https://iceberg.apache.org/aws/ <https://iceberg.apache.org/aws/>. Is there another link for that?
> 
> Most of the examples use spark sql to write/read iceberg. For example, there is no "sql merge into" like support for spark API. Is spark sql preferred over spark dataframe/dataset API in Iceberg? If so, could you clarify the rationale behind? I personally feel spark API is more dev friendly and scalable. Thanks very much!  
> 
> 
> On Mon, Aug 9, 2021 at 8:53 AM Ryan Blue <blue@tabular.io <ma...@tabular.io>> wrote:
> Lian,
> 
> Iceberg tables work great in S3. When creating the table, just pass the `LOCATION` clause with an S3 path, or set your catalog's warehouse location to S3 so tables are automatically created there.
> 
> The only restriction for S3 is that you need a metastore to track the table metadata location because S3 doesn't have a way to implement a metadata commit. For a metastore, there are implementations backed by the Hive MetaStore, Glue/DynamoDB, and Nessie. And the upcoming release adds support for DynamoDB without Glue and JDBC.
> 
> Ryan
> 
> On Mon, Aug 9, 2021 at 2:24 AM Eduard Tudenhoefner <eduard@dremio.com <ma...@dremio.com>> wrote:
> Lian you can have a look at https://iceberg.apache.org/aws/ <https://iceberg.apache.org/aws/>. It should contain all the info that you need. The codebase contains a S3FileIO class, which is an implementation that is backed by S3.
> 
> On Mon, Aug 9, 2021 at 7:37 AM Lian Jiang <jiangok2006@gmail.com <ma...@gmail.com>> wrote:
> I am reading https://iceberg.apache.org/spark-writes/#spark-writes <https://iceberg.apache.org/spark-writes/#spark-writes> and wondering if it is possible to create an iceberg table on S3. This guide seems to say only write to a hive table (backed up by HDFS if I understand correctly). Hudi and Delta can write to s3 with a specified S3 path. How can I do it using iceberg? Thanks for any clarification.
> 
> 
> 
> 
> -- 
> Ryan Blue
> Tabular
> 
> 
> -- 
> Create your own email signature <https://www.wisestamp.com/signature-in-email/?utm_source=promotion&utm_medium=signature&utm_campaign=create_your_own&srcid=5234462839406592>  
> 
> 
> -- 
> Ryan Blue
> Tabular
> 
> 
> -- 
> Create your own email signature <https://www.wisestamp.com/signature-in-email/?utm_source=promotion&utm_medium=signature&utm_campaign=create_your_own&srcid=5234462839406592>  
> 
> 
> -- 
> Create your own email signature <https://www.wisestamp.com/signature-in-email/?utm_source=promotion&utm_medium=signature&utm_campaign=create_your_own&srcid=5234462839406592>

Re: Writing iceberg table to S3

Posted by Lian Jiang <ji...@gmail.com>.

Thanks Russell. This helps a lot.

I want to specify a HDFS location when creating an iceberg dataset using
dataframe api. All examples using warehouse location are SQL. Do you have
an example for dataframe API? For example, how to support HDFS/S3 location
in the query below? The reason I ask is that my current code all uses spark
API. It will be much easier if I can use spark API when migrating parquet
to iceberg. Hope it makes sense.

data.writeTo("prod.db.table")
    .tableProperty("write.format.default", "orc")
    .partitionBy($"level", days($"ts"))
    .createOrReplace()


On Mon, Aug 9, 2021 at 4:22 PM Russell Spitzer <ru...@gmail.com>
wrote:

> The config you used specified a catalog named "hive_prod", so to reference
> it you need to either "use hive_prod" or refer to the table with the
> catalog identifier "CREATE TABLE hive_prod.default.mytable"
>
> On Mon, Aug 9, 2021 at 6:15 PM Lian Jiang <ji...@gmail.com> wrote:
>
>> Thanks Ryan.
>>
>> Using this command (uri is omitted because the uri is in hive-site.xml):
>> spark-shell --conf
>> spark.sql.catalog.hive_prod=org.apache.iceberg.spark.SparkCatalog --conf
>> spark.sql.catalog.hive_prod.type=hive
>>
>> This statement:
>> spark.sql("CREATE TABLE default.mytable (uuid string) USING iceberg")
>>
>> caused warning:
>> WARN HiveExternalCatalog: Couldn't find corresponding Hive SerDe for data
>> source provider iceberg.
>>
>> I tried:
>> * the solution (put iceberg-hive-runtime.jar and
>> iceberg-spark3-runtime.jar to spark/jars) mentioned in
>> https://github.com/apache/iceberg/issues/2260
>> * use --packages
>> org.apache.iceberg:iceberg-hive-runtime:0.11.1,org.apache.iceberg:iceberg-spark3-runtime:0.11.1
>>
>> but they did not help. This warning blocks inserting any data into this
>> table. Any ideas are appreciated!
>>
>> On Mon, Aug 9, 2021 at 10:15 AM Ryan Blue <bl...@tabular.io> wrote:
>>
>>> Lian,
>>>
>>> I think we should improve the docs for catalogs since it isn’t clear. We
>>> have a few configuration pages that are helpful, but it looks like they
>>> assume you know what your options are already. Take a look at the Spark
>>> docs for catalogs, which is the closest we have right now:
>>> https://iceberg.apache.org/spark-configuration/#catalog-configuration
>>>
>>> What you’ll want to do is to configure a catalog like the first example:
>>>
>>> spark.sql.catalog.hive_prod = org.apache.iceberg.spark.SparkCatalog
>>> spark.sql.catalog.hive_prod.type = hive
>>> spark.sql.catalog.hive_prod.uri = thrift://metastore-host:port
>>> # omit uri to use the same URI as Spark: hive.metastore.uris in hive-site.xml
>>>
>>> For MERGE INTO, the DataFrame API is not present in Spark, which is why
>>> it can’t be used by SQL. This is something that should probably be added to
>>> Spark and not Iceberg since it is just a different way to build the same
>>> underlying Spark plan.
>>>
>>> To your question about dataframes vs SQL, I highly recommend SQL over
>>> DataFrames so that you don’t end up needing to use Jars produced by
>>> compiling Scala code. I think it’s easier to just use SQL. But Iceberg
>>> should support both because DataFrames are useful for customization in some
>>> cases. It really should be up to you and what you want to use, as far as
>>> Iceberg is concerned.
>>>
>>> Ryan
>>>
>>> On Mon, Aug 9, 2021 at 9:31 AM Lian Jiang <ji...@gmail.com> wrote:
>>>
>>>> Thanks Eduard and Ryan.
>>>>
>>>> I use spark on a K8S cluster to write parquet on s3 and then add an
>>>> external table in hive metastore for this parquet. In the future, when
>>>> using iceberg, I prefer hive metadata store since it is my
>>>> centralized metastore for batch and streaming datasets. I don't see that
>>>> hive metastore is supported in iceberg AWS integration on
>>>> https://iceberg.apache.org/aws/. Is there another link for that?
>>>>
>>>> Most of the examples use spark sql to write/read iceberg. For example,
>>>> there is no "sql merge into" like support for spark API. Is spark sql
>>>> preferred over spark dataframe/dataset API in Iceberg? If so, could you
>>>> clarify the rationale behind? I personally feel spark API is more dev
>>>> friendly and scalable. Thanks very much!
>>>>
>>>>
>>>> On Mon, Aug 9, 2021 at 8:53 AM Ryan Blue <bl...@tabular.io> wrote:
>>>>
>>>>> Lian,
>>>>>
>>>>> Iceberg tables work great in S3. When creating the table, just pass
>>>>> the `LOCATION` clause with an S3 path, or set your catalog's warehouse
>>>>> location to S3 so tables are automatically created there.
>>>>>
>>>>> The only restriction for S3 is that you need a metastore to track the
>>>>> table metadata location because S3 doesn't have a way to implement a
>>>>> metadata commit. For a metastore, there are implementations backed by the
>>>>> Hive MetaStore, Glue/DynamoDB, and Nessie. And the upcoming release adds
>>>>> support for DynamoDB without Glue and JDBC.
>>>>>
>>>>> Ryan
>>>>>
>>>>> On Mon, Aug 9, 2021 at 2:24 AM Eduard Tudenhoefner <ed...@dremio.com>
>>>>> wrote:
>>>>>
>>>>>> Lian you can have a look at https://iceberg.apache.org/aws/. It
>>>>>> should contain all the info that you need. The codebase contains a *S3FileIO
>>>>>> *class, which is an implementation that is backed by S3.
>>>>>>
>>>>>> On Mon, Aug 9, 2021 at 7:37 AM Lian Jiang <ji...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> I am reading https://iceberg.apache.org/spark-writes/#spark-writes
>>>>>>> and wondering if it is possible to create an iceberg table on S3. This
>>>>>>> guide seems to say only write to a hive table (backed up by HDFS if I
>>>>>>> understand correctly). Hudi and Delta can write to s3 with a specified S3
>>>>>>> path. How can I do it using iceberg? Thanks for any clarification.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>
>>>>> --
>>>>> Ryan Blue
>>>>> Tabular
>>>>>
>>>>
>>>>
>>>> --
>>>>
>>>> Create your own email signature
>>>> <https://www.wisestamp.com/signature-in-email/?utm_source=promotion&utm_medium=signature&utm_campaign=create_your_own&srcid=5234462839406592>
>>>>
>>>
>>>
>>> --
>>> Ryan Blue
>>> Tabular
>>>
>>
>>
>> --
>>
>> Create your own email signature
>> <https://www.wisestamp.com/signature-in-email/?utm_source=promotion&utm_medium=signature&utm_campaign=create_your_own&srcid=5234462839406592>
>>
>

-- 

Create your own email signature
<https://www.wisestamp.com/signature-in-email/?utm_source=promotion&utm_medium=signature&utm_campaign=create_your_own&srcid=5234462839406592>

Re: Writing iceberg table to S3

Posted by Russell Spitzer <ru...@gmail.com>.

The config you used specified a catalog named "hive_prod", so to reference
it you need to either "use hive_prod" or refer to the table with the
catalog identifier "CREATE TABLE hive_prod.default.mytable"

On Mon, Aug 9, 2021 at 6:15 PM Lian Jiang <ji...@gmail.com> wrote:

> Thanks Ryan.
>
> Using this command (uri is omitted because the uri is in hive-site.xml):
> spark-shell --conf
> spark.sql.catalog.hive_prod=org.apache.iceberg.spark.SparkCatalog --conf
> spark.sql.catalog.hive_prod.type=hive
>
> This statement:
> spark.sql("CREATE TABLE default.mytable (uuid string) USING iceberg")
>
> caused warning:
> WARN HiveExternalCatalog: Couldn't find corresponding Hive SerDe for data
> source provider iceberg.
>
> I tried:
> * the solution (put iceberg-hive-runtime.jar and
> iceberg-spark3-runtime.jar to spark/jars) mentioned in
> https://github.com/apache/iceberg/issues/2260
> * use --packages
> org.apache.iceberg:iceberg-hive-runtime:0.11.1,org.apache.iceberg:iceberg-spark3-runtime:0.11.1
>
> but they did not help. This warning blocks inserting any data into this
> table. Any ideas are appreciated!
>
> On Mon, Aug 9, 2021 at 10:15 AM Ryan Blue <bl...@tabular.io> wrote:
>
>> Lian,
>>
>> I think we should improve the docs for catalogs since it isn’t clear. We
>> have a few configuration pages that are helpful, but it looks like they
>> assume you know what your options are already. Take a look at the Spark
>> docs for catalogs, which is the closest we have right now:
>> https://iceberg.apache.org/spark-configuration/#catalog-configuration
>>
>> What you’ll want to do is to configure a catalog like the first example:
>>
>> spark.sql.catalog.hive_prod = org.apache.iceberg.spark.SparkCatalog
>> spark.sql.catalog.hive_prod.type = hive
>> spark.sql.catalog.hive_prod.uri = thrift://metastore-host:port
>> # omit uri to use the same URI as Spark: hive.metastore.uris in hive-site.xml
>>
>> For MERGE INTO, the DataFrame API is not present in Spark, which is why
>> it can’t be used by SQL. This is something that should probably be added to
>> Spark and not Iceberg since it is just a different way to build the same
>> underlying Spark plan.
>>
>> To your question about dataframes vs SQL, I highly recommend SQL over
>> DataFrames so that you don’t end up needing to use Jars produced by
>> compiling Scala code. I think it’s easier to just use SQL. But Iceberg
>> should support both because DataFrames are useful for customization in some
>> cases. It really should be up to you and what you want to use, as far as
>> Iceberg is concerned.
>>
>> Ryan
>>
>> On Mon, Aug 9, 2021 at 9:31 AM Lian Jiang <ji...@gmail.com> wrote:
>>
>>> Thanks Eduard and Ryan.
>>>
>>> I use spark on a K8S cluster to write parquet on s3 and then add an
>>> external table in hive metastore for this parquet. In the future, when
>>> using iceberg, I prefer hive metadata store since it is my
>>> centralized metastore for batch and streaming datasets. I don't see that
>>> hive metastore is supported in iceberg AWS integration on
>>> https://iceberg.apache.org/aws/. Is there another link for that?
>>>
>>> Most of the examples use spark sql to write/read iceberg. For example,
>>> there is no "sql merge into" like support for spark API. Is spark sql
>>> preferred over spark dataframe/dataset API in Iceberg? If so, could you
>>> clarify the rationale behind? I personally feel spark API is more dev
>>> friendly and scalable. Thanks very much!
>>>
>>>
>>> On Mon, Aug 9, 2021 at 8:53 AM Ryan Blue <bl...@tabular.io> wrote:
>>>
>>>> Lian,
>>>>
>>>> Iceberg tables work great in S3. When creating the table, just pass the
>>>> `LOCATION` clause with an S3 path, or set your catalog's warehouse location
>>>> to S3 so tables are automatically created there.
>>>>
>>>> The only restriction for S3 is that you need a metastore to track the
>>>> table metadata location because S3 doesn't have a way to implement a
>>>> metadata commit. For a metastore, there are implementations backed by the
>>>> Hive MetaStore, Glue/DynamoDB, and Nessie. And the upcoming release adds
>>>> support for DynamoDB without Glue and JDBC.
>>>>
>>>> Ryan
>>>>
>>>> On Mon, Aug 9, 2021 at 2:24 AM Eduard Tudenhoefner <ed...@dremio.com>
>>>> wrote:
>>>>
>>>>> Lian you can have a look at https://iceberg.apache.org/aws/. It
>>>>> should contain all the info that you need. The codebase contains a *S3FileIO
>>>>> *class, which is an implementation that is backed by S3.
>>>>>
>>>>> On Mon, Aug 9, 2021 at 7:37 AM Lian Jiang <ji...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> I am reading https://iceberg.apache.org/spark-writes/#spark-writes
>>>>>> and wondering if it is possible to create an iceberg table on S3. This
>>>>>> guide seems to say only write to a hive table (backed up by HDFS if I
>>>>>> understand correctly). Hudi and Delta can write to s3 with a specified S3
>>>>>> path. How can I do it using iceberg? Thanks for any clarification.
>>>>>>
>>>>>>
>>>>>>
>>>>
>>>> --
>>>> Ryan Blue
>>>> Tabular
>>>>
>>>
>>>
>>> --
>>>
>>> Create your own email signature
>>> <https://www.wisestamp.com/signature-in-email/?utm_source=promotion&utm_medium=signature&utm_campaign=create_your_own&srcid=5234462839406592>
>>>
>>
>>
>> --
>> Ryan Blue
>> Tabular
>>
>
>
> --
>
> Create your own email signature
> <https://www.wisestamp.com/signature-in-email/?utm_source=promotion&utm_medium=signature&utm_campaign=create_your_own&srcid=5234462839406592>
>

Re: Writing iceberg table to S3

Posted by Lian Jiang <ji...@gmail.com>.

Thanks Ryan.

Using this command (uri is omitted because the uri is in hive-site.xml):
spark-shell --conf
spark.sql.catalog.hive_prod=org.apache.iceberg.spark.SparkCatalog --conf
spark.sql.catalog.hive_prod.type=hive

This statement:
spark.sql("CREATE TABLE default.mytable (uuid string) USING iceberg")

caused warning:
WARN HiveExternalCatalog: Couldn't find corresponding Hive SerDe for data
source provider iceberg.

I tried:
* the solution (put iceberg-hive-runtime.jar and iceberg-spark3-runtime.jar
to spark/jars) mentioned in https://github.com/apache/iceberg/issues/2260
* use --packages
org.apache.iceberg:iceberg-hive-runtime:0.11.1,org.apache.iceberg:iceberg-spark3-runtime:0.11.1

but they did not help. This warning blocks inserting any data into this
table. Any ideas are appreciated!

On Mon, Aug 9, 2021 at 10:15 AM Ryan Blue <bl...@tabular.io> wrote:

> Lian,
>
> I think we should improve the docs for catalogs since it isn’t clear. We
> have a few configuration pages that are helpful, but it looks like they
> assume you know what your options are already. Take a look at the Spark
> docs for catalogs, which is the closest we have right now:
> https://iceberg.apache.org/spark-configuration/#catalog-configuration
>
> What you’ll want to do is to configure a catalog like the first example:
>
> spark.sql.catalog.hive_prod = org.apache.iceberg.spark.SparkCatalog
> spark.sql.catalog.hive_prod.type = hive
> spark.sql.catalog.hive_prod.uri = thrift://metastore-host:port
> # omit uri to use the same URI as Spark: hive.metastore.uris in hive-site.xml
>
> For MERGE INTO, the DataFrame API is not present in Spark, which is why
> it can’t be used by SQL. This is something that should probably be added to
> Spark and not Iceberg since it is just a different way to build the same
> underlying Spark plan.
>
> To your question about dataframes vs SQL, I highly recommend SQL over
> DataFrames so that you don’t end up needing to use Jars produced by
> compiling Scala code. I think it’s easier to just use SQL. But Iceberg
> should support both because DataFrames are useful for customization in some
> cases. It really should be up to you and what you want to use, as far as
> Iceberg is concerned.
>
> Ryan
>
> On Mon, Aug 9, 2021 at 9:31 AM Lian Jiang <ji...@gmail.com> wrote:
>
>> Thanks Eduard and Ryan.
>>
>> I use spark on a K8S cluster to write parquet on s3 and then add an
>> external table in hive metastore for this parquet. In the future, when
>> using iceberg, I prefer hive metadata store since it is my
>> centralized metastore for batch and streaming datasets. I don't see that
>> hive metastore is supported in iceberg AWS integration on
>> https://iceberg.apache.org/aws/. Is there another link for that?
>>
>> Most of the examples use spark sql to write/read iceberg. For example,
>> there is no "sql merge into" like support for spark API. Is spark sql
>> preferred over spark dataframe/dataset API in Iceberg? If so, could you
>> clarify the rationale behind? I personally feel spark API is more dev
>> friendly and scalable. Thanks very much!
>>
>>
>> On Mon, Aug 9, 2021 at 8:53 AM Ryan Blue <bl...@tabular.io> wrote:
>>
>>> Lian,
>>>
>>> Iceberg tables work great in S3. When creating the table, just pass the
>>> `LOCATION` clause with an S3 path, or set your catalog's warehouse location
>>> to S3 so tables are automatically created there.
>>>
>>> The only restriction for S3 is that you need a metastore to track the
>>> table metadata location because S3 doesn't have a way to implement a
>>> metadata commit. For a metastore, there are implementations backed by the
>>> Hive MetaStore, Glue/DynamoDB, and Nessie. And the upcoming release adds
>>> support for DynamoDB without Glue and JDBC.
>>>
>>> Ryan
>>>
>>> On Mon, Aug 9, 2021 at 2:24 AM Eduard Tudenhoefner <ed...@dremio.com>
>>> wrote:
>>>
>>>> Lian you can have a look at https://iceberg.apache.org/aws/. It should
>>>> contain all the info that you need. The codebase contains a *S3FileIO *class,
>>>> which is an implementation that is backed by S3.
>>>>
>>>> On Mon, Aug 9, 2021 at 7:37 AM Lian Jiang <ji...@gmail.com>
>>>> wrote:
>>>>
>>>>> I am reading https://iceberg.apache.org/spark-writes/#spark-writes
>>>>> and wondering if it is possible to create an iceberg table on S3. This
>>>>> guide seems to say only write to a hive table (backed up by HDFS if I
>>>>> understand correctly). Hudi and Delta can write to s3 with a specified S3
>>>>> path. How can I do it using iceberg? Thanks for any clarification.
>>>>>
>>>>>
>>>>>
>>>
>>> --
>>> Ryan Blue
>>> Tabular
>>>
>>
>>
>> --
>>
>> Create your own email signature
>> <https://www.wisestamp.com/signature-in-email/?utm_source=promotion&utm_medium=signature&utm_campaign=create_your_own&srcid=5234462839406592>
>>
>
>
> --
> Ryan Blue
> Tabular
>


-- 

Create your own email signature
<https://www.wisestamp.com/signature-in-email/?utm_source=promotion&utm_medium=signature&utm_campaign=create_your_own&srcid=5234462839406592>

Re: Writing iceberg table to S3

Posted by Ryan Blue <bl...@tabular.io>.

Lian,

I think we should improve the docs for catalogs since it isn’t clear. We
have a few configuration pages that are helpful, but it looks like they
assume you know what your options are already. Take a look at the Spark
docs for catalogs, which is the closest we have right now:
https://iceberg.apache.org/spark-configuration/#catalog-configuration

What you’ll want to do is to configure a catalog like the first example:

spark.sql.catalog.hive_prod = org.apache.iceberg.spark.SparkCatalog
spark.sql.catalog.hive_prod.type = hive
spark.sql.catalog.hive_prod.uri = thrift://metastore-host:port
# omit uri to use the same URI as Spark: hive.metastore.uris in hive-site.xml

For MERGE INTO, the DataFrame API is not present in Spark, which is why it
can’t be used by SQL. This is something that should probably be added to
Spark and not Iceberg since it is just a different way to build the same
underlying Spark plan.

To your question about dataframes vs SQL, I highly recommend SQL over
DataFrames so that you don’t end up needing to use Jars produced by
compiling Scala code. I think it’s easier to just use SQL. But Iceberg
should support both because DataFrames are useful for customization in some
cases. It really should be up to you and what you want to use, as far as
Iceberg is concerned.

Ryan

On Mon, Aug 9, 2021 at 9:31 AM Lian Jiang <ji...@gmail.com> wrote:

> Thanks Eduard and Ryan.
>
> I use spark on a K8S cluster to write parquet on s3 and then add an
> external table in hive metastore for this parquet. In the future, when
> using iceberg, I prefer hive metadata store since it is my
> centralized metastore for batch and streaming datasets. I don't see that
> hive metastore is supported in iceberg AWS integration on
> https://iceberg.apache.org/aws/. Is there another link for that?
>
> Most of the examples use spark sql to write/read iceberg. For example,
> there is no "sql merge into" like support for spark API. Is spark sql
> preferred over spark dataframe/dataset API in Iceberg? If so, could you
> clarify the rationale behind? I personally feel spark API is more dev
> friendly and scalable. Thanks very much!
>
>
> On Mon, Aug 9, 2021 at 8:53 AM Ryan Blue <bl...@tabular.io> wrote:
>
>> Lian,
>>
>> Iceberg tables work great in S3. When creating the table, just pass the
>> `LOCATION` clause with an S3 path, or set your catalog's warehouse location
>> to S3 so tables are automatically created there.
>>
>> The only restriction for S3 is that you need a metastore to track the
>> table metadata location because S3 doesn't have a way to implement a
>> metadata commit. For a metastore, there are implementations backed by the
>> Hive MetaStore, Glue/DynamoDB, and Nessie. And the upcoming release adds
>> support for DynamoDB without Glue and JDBC.
>>
>> Ryan
>>
>> On Mon, Aug 9, 2021 at 2:24 AM Eduard Tudenhoefner <ed...@dremio.com>
>> wrote:
>>
>>> Lian you can have a look at https://iceberg.apache.org/aws/. It should
>>> contain all the info that you need. The codebase contains a *S3FileIO *class,
>>> which is an implementation that is backed by S3.
>>>
>>> On Mon, Aug 9, 2021 at 7:37 AM Lian Jiang <ji...@gmail.com> wrote:
>>>
>>>> I am reading https://iceberg.apache.org/spark-writes/#spark-writes and
>>>> wondering if it is possible to create an iceberg table on S3. This guide
>>>> seems to say only write to a hive table (backed up by HDFS if I understand
>>>> correctly). Hudi and Delta can write to s3 with a specified S3 path. How
>>>> can I do it using iceberg? Thanks for any clarification.
>>>>
>>>>
>>>>
>>
>> --
>> Ryan Blue
>> Tabular
>>
>
>
> --
>
> Create your own email signature
> <https://www.wisestamp.com/signature-in-email/?utm_source=promotion&utm_medium=signature&utm_campaign=create_your_own&srcid=5234462839406592>
>


-- 
Ryan Blue
Tabular

Re: Writing iceberg table to S3

Posted by Lian Jiang <ji...@gmail.com>.

Thanks Eduard and Ryan.

I use spark on a K8S cluster to write parquet on s3 and then add an
external table in hive metastore for this parquet. In the future, when
using iceberg, I prefer hive metadata store since it is my
centralized metastore for batch and streaming datasets. I don't see that
hive metastore is supported in iceberg AWS integration on
https://iceberg.apache.org/aws/. Is there another link for that?

Most of the examples use spark sql to write/read iceberg. For example,
there is no "sql merge into" like support for spark API. Is spark sql
preferred over spark dataframe/dataset API in Iceberg? If so, could you
clarify the rationale behind? I personally feel spark API is more dev
friendly and scalable. Thanks very much!

On Mon, Aug 9, 2021 at 8:53 AM Ryan Blue <bl...@tabular.io> wrote:

> Lian,
>
> Iceberg tables work great in S3. When creating the table, just pass the
> `LOCATION` clause with an S3 path, or set your catalog's warehouse location
> to S3 so tables are automatically created there.
>
> The only restriction for S3 is that you need a metastore to track the
> table metadata location because S3 doesn't have a way to implement a
> metadata commit. For a metastore, there are implementations backed by the
> Hive MetaStore, Glue/DynamoDB, and Nessie. And the upcoming release adds
> support for DynamoDB without Glue and JDBC.
>
> Ryan
>
> On Mon, Aug 9, 2021 at 2:24 AM Eduard Tudenhoefner <ed...@dremio.com>
> wrote:
>
>> Lian you can have a look at https://iceberg.apache.org/aws/. It should
>> contain all the info that you need. The codebase contains a *S3FileIO *class,
>> which is an implementation that is backed by S3.
>>
>> On Mon, Aug 9, 2021 at 7:37 AM Lian Jiang <ji...@gmail.com> wrote:
>>
>>> I am reading https://iceberg.apache.org/spark-writes/#spark-writes and
>>> wondering if it is possible to create an iceberg table on S3. This guide
>>> seems to say only write to a hive table (backed up by HDFS if I understand
>>> correctly). Hudi and Delta can write to s3 with a specified S3 path. How
>>> can I do it using iceberg? Thanks for any clarification.
>>>
>>>
>>>
>
> --
> Ryan Blue
> Tabular
>

-- 

Create your own email signature
<https://www.wisestamp.com/signature-in-email/?utm_source=promotion&utm_medium=signature&utm_campaign=create_your_own&srcid=5234462839406592>

Re: Writing iceberg table to S3

Posted by Ryan Blue <bl...@tabular.io>.

Lian,

Iceberg tables work great in S3. When creating the table, just pass the
`LOCATION` clause with an S3 path, or set your catalog's warehouse location
to S3 so tables are automatically created there.

The only restriction for S3 is that you need a metastore to track the table
metadata location because S3 doesn't have a way to implement a metadata
commit. For a metastore, there are implementations backed by the Hive
MetaStore, Glue/DynamoDB, and Nessie. And the upcoming release adds support
for DynamoDB without Glue and JDBC.

Ryan

On Mon, Aug 9, 2021 at 2:24 AM Eduard Tudenhoefner <ed...@dremio.com>
wrote:

> Lian you can have a look at https://iceberg.apache.org/aws/. It should
> contain all the info that you need. The codebase contains a *S3FileIO *class,
> which is an implementation that is backed by S3.
>
> On Mon, Aug 9, 2021 at 7:37 AM Lian Jiang <ji...@gmail.com> wrote:
>
>> I am reading https://iceberg.apache.org/spark-writes/#spark-writes and
>> wondering if it is possible to create an iceberg table on S3. This guide
>> seems to say only write to a hive table (backed up by HDFS if I understand
>> correctly). Hudi and Delta can write to s3 with a specified S3 path. How
>> can I do it using iceberg? Thanks for any clarification.
>>
>>
>>

-- 
Ryan Blue
Tabular

Re: Writing iceberg table to S3

Posted by Eduard Tudenhoefner <ed...@dremio.com>.

Lian you can have a look at https://iceberg.apache.org/aws/. It should
contain all the info that you need. The codebase contains a *S3FileIO *class,
which is an implementation that is backed by S3.

On Mon, Aug 9, 2021 at 7:37 AM Lian Jiang <ji...@gmail.com> wrote:

> I am reading https://iceberg.apache.org/spark-writes/#spark-writes and
> wondering if it is possible to create an iceberg table on S3. This guide
> seems to say only write to a hive table (backed up by HDFS if I understand
> correctly). Hudi and Delta can write to s3 with a specified S3 path. How
> can I do it using iceberg? Thanks for any clarification.
>
>
>