You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@iceberg.apache.org by Iceberg Slack Email Digest <ic...@gmail.com> on 2021/08/27 09:00:11 UTC

Apache Iceberg Daily Slack Digest (2021-08-27)

### _#general_

  
 **@sangarshanan:** Hi all ! just wanted to know if it is possible to create
and update Iceberg tables without using spark for compute ? using just the
java APIs  
**@dweeks:** Yes, you can use the java APIs directly to update tables. You
still need a catalog implementation for handling the commits, but everything
should be possible through the APIs.  
**@dweeks:** Could you share your use case? Just wondering if you're trying to
simply append existing files or you're trying to integrate something else.  
**@sangarshanan:** my usecase is to build a service that can perform CRUD on
iceberg tables backed by Hive as catalog and provide an abstraction for users
to run it without the overhead of spark  
**@sangarshanan:** this is roughly what I came up with but I got hit with some
bugs that I am trying to solve ```Configuration conf = new Configuration();
conf.set("hive.metastore.uris", ""); Catalog catalog = new HiveCatalog(conf);
TableIdentifier name = TableIdentifier.of("logging", "logs"); Table table =
catalog.createTable(name, schema, spec);``` just wanted to know if I am moving
in the right direction since I could not find any examples that used the Java
APIs to directly Create and Update iceberg tables  
**@dweeks:** That all looks right. From the table you should be able to use
the APIs to manipulate the table. Those are the same APIs that
Spark/Trino/Flink use, so you might be able to find examples of the API use
there.  
**@dweeks:** The tests are also a good place to look since they exercise all
of these code paths.  
 **@aman.rawat:** Hi everyone, Is there any high level api/abstraction
available to enable row level deletes as part of delete, update or merge
operations in table format v2.  
**@russell.spitzer:** Row level deletes are still not supported in the Spark
api so there is no way to enabled them there  
**@aman.rawat:** Thanks @russell.spitzer for the update.  
**@gsreeramkumar:** Hi @russell.spitzer, can you pl. throw some light on: 1\.
what is the overall state of record level deletes? Is it supported in any
other engine - or is it just implemented in the core api and is waiting for
engine specific adaption? 2\. Is there a work stream or ongoing work for
implementing this in Spark - where we can come and contribute? Truly
appreciate your inputs.  
 **@mohamed.jouini.pro:** Hi everyone. I tried to use Iceberg release 0.12.0
with Dynamodb catalog, and this is my SparkSession configuration:
```.set("spark.sql.catalog.spark_catalog",
"org.apache.iceberg.spark.SparkSessionCatalog")
.set("spark.sql.catalog.iceberg_dynamo_poc",
"org.apache.iceberg.spark.SparkCatalog")
.set("spark.sql.catalog.iceberg_dynamo_poc.warehouse", "")
.set("spark.sql.catalog.iceberg_dynamo_poc.catalog-impl",
"org.apache.iceberg.aws.dynamodb.DynamoDbCatalog")
.set("spark.sql.catalog.iceberg_dynamo_poc.io-impl",
"org.apache.iceberg.aws.s3.S3FileIO")
.set("spark.sql.catalog.iceberg_dynamo_poc.dynamodb.table-name",
"IcebergCatalog")``` and I got this error when creating table ```spark.sql
("CREATE TABLE iceberg_dynamo_poc.dynamo1 ( \ id bigint, \ pathId string, \ )
\ USING iceberg " )``` ```Py4JJavaError: An error occurred while calling
o1836.sql. : java.lang.NoSuchMethodError:
org.apache.iceberg.aws.AwsProperties.dynamoDbTableName()Ljava/lang/String; at
org.apache.iceberg.aws.dynamodb.DynamoDbCatalog.ensureCatalogTableExistsOrCreate(DynamoDbCatalog.java:537)
at
org.apache.iceberg.aws.dynamodb.DynamoDbCatalog.initialize(DynamoDbCatalog.java:133)
at
org.apache.iceberg.aws.dynamodb.DynamoDbCatalog.initialize(DynamoDbCatalog.java:118)
at org.apache.iceberg.CatalogUtil.loadCatalog(CatalogUtil.java:183)```  
**@russell.spitzer:** Looks like you are missing some things on your
classpath, did you make sure to include iceberg-spark3-runtime and the aws SDK
bundle and connection client?  
**@russell.spitzer:**  
**@dweeks:** I suspect that the issue is actually due to not having the aws-
java-sdk-v2 in your classpath  
**@dweeks:** Dynamo (and most of the native AWS support) uses sdk v2, which is
not part of the bundle  
**@russell.spitzer:** oh do we need that in the docs then?  
**@dweeks:** Probably, though maybe we should just bundle it with `iceberg-
aws`?  
**@russell.spitzer:** As long as it isn't versioned with the other aws libs I
think that's fine  
**@russell.spitzer:** I assumed we didn't include the other libs so that patch
releases would be easier to incorporate for end users  
**@blue:** I think the docs cover adding it to the classpath, but it may not
be called out very clearly  
**@russell.spitzer:** This is what the doc's list in the startup instructions
```# add Iceberg dependency ICEBERG_VERSION=0.12.0
DEPENDENCIES="org.apache.iceberg:iceberg-spark3-runtime:$ICEBERG_VERSION" #
add AWS dependnecy AWS_SDK_VERSION=2.15.40
AWS_MAVEN_GROUP=software.amazon.awssdk AWS_PACKAGES=( "bundle" "url-
connection-client" ) for pkg in "${AWS_PACKAGES[@]}"; do
DEPENDENCIES+=",$AWS_MAVEN_GROUP:$pkg:$AWS_SDK_VERSION" done```  
**@russell.spitzer:** bundle, url-connection, spark3-runtime  
**@blue:** Yeah, but does that require reading a shell script or is it stated
that you need bundle and url-connection-client?  
**@russell.spitzer:** oh sorry I thought you meant it listed aws-java-sdk-v2  
**@dweeks:** no, the sdk v2 is `software.amazon.awssdk`  
**@russell.spitzer:** I thought Daniel was noting that a 4th dependency was
also required that is currently not a part of that script  
**@dweeks:** Looks like all that's required is there, but this step may have
been missed or it didn't get into the classpath correctly  
**@russell.spitzer:** ah so it is just those 2 other libs, then yes I agree we
should pull those out and not write this as a shell script  
**@dweeks:** Yeah, even in the examples it requires a pull from maven, which
is not ideal  
**@russell.spitzer:** I still understand though if we want to support changing
the patch version  
**@russell.spitzer:** but in this example i would just enumerate everything
explictly  
**@blue:** For 3 packages, having a loop doesn't make a ton of sense to me  
**@russell.spitzer:** the loop is just for 2 of them as well
:slightly_smiling_face:  
**@blue:** It's a good script for setting up EMR, but probably better to be
simple for CLI use  
**@dweeks:** I guess we should probably have @mohamed.jouini.pro verify that
this is the issue though  
**@mohamed.jouini.pro:** Please find the entire `pyspark` code ```from
pyspark.sql import SparkSession conf = (sc.getConf() .set("spark.jars",
"iceberg-spark-runtime-0.12.0.jar,bundle-2.15.40.jar,url-connection-
client-2.15.40.jar") .set("spark.jars.packages", "org.apache.spark:spark-
avro_2.12:4.0.0") .set("spark.sql.catalog.spark_catalog",
"org.apache.iceberg.spark.SparkSessionCatalog")
.set("spark.sql.catalog.iceberg_dynamo_poc",
"org.apache.iceberg.spark.SparkCatalog")
.set("spark.sql.catalog.iceberg_dynamo_poc.warehouse", "")
.set("spark.sql.catalog.iceberg_dynamo_poc.catalog-impl",
"org.apache.iceberg.aws.dynamodb.DynamoDbCatalog")
.set("spark.sql.catalog.iceberg_dynamo_poc.io-impl",
"org.apache.iceberg.aws.s3.S3FileIO")
.set("spark.sql.catalog.iceberg_dynamo_poc.dynamodb.table-name",
"IcebergCatalog") .set("spark.sql.extensions",
"org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions")
.set("spark.executor.extraJavaOptions",
"-Dlog4j.configuration=./log4j.properties") ) spark.stop() spark =
SparkSession.builder.config(conf=conf).getOrCreate()```  
**@blue:** @mohamed.jouini.pro, thanks for that. Have you downloaded the AWS
SDK jars as well? I don't think that Spark will download them for you  
**@russell.spitzer:** Yeah if you set it in "jars" it will not download
dependencies of those libraries  
**@dweeks:** Hmm, so `bundle-2.15.40.jar,url-connection-client-2.15.40.jar`
these are the jars in question.  
**@russell.spitzer:** "Packages" will download the library and all their
dependencies  
**@dweeks:** Since their alongside the iceberg runtime, maybe they were
already downloaded?  
**@russell.spitzer:** but if they have secondary deps those would be missing,
I also think this is one of those ones that spark only warns on if the jars
are missing  
**@mohamed.jouini.pro:** jar packages are already in EMR local lib path
```[hadoop@emr~]$ ls -ltr /usr/share/aws/aws-java-sdk/iceberg* -rw-r--r-- 1
root root 19201102 Aug 10 15:49 /usr/share/aws/aws-java-sdk/iceberg-
spark3-runtime-0.11.1.jar -rw-r--r-- 1 root root 25809685 Aug 26 13:51
/usr/share/aws/aws-java-sdk/iceberg-spark-runtime-0.12.0.jar [hadoop@emr ~]$
ls -ltr /usr/share/aws/aws-java-sdk/url-connection* -rw-r--r-- 1 root root
21027 Aug 10 15:49 /usr/share/aws/aws-java-sdk/url-connection-
client-2.15.40.jar [hadoop@emr ~]$ ls -ltr /usr/share/aws/aws-java-sdk/bundle*
-rw-r--r-- 1 root root 257939967 Aug 10 15:49 /usr/share/aws/aws-java-
sdk/bundle-2.15.40.jar```  
**@mohamed.jouini.pro:** I can create a table using aws glue catalog, but the
lem seems to be related to `DynamoDBCatalog`  
**@russell.spitzer:** Neither of those libs have runtime deps according to
maven, although it is amazing to me that bundle is 246 MB!  
**@mohamed.jouini.pro:**  
**@mohamed.jouini.pro:** When reading the error message it looks like Iceberg
can't convert table-name to `String` or something like this, In the
documentation Iceberg, can use a default dynamodbcatalog `iceberg` if it's not
set, but I still get the same output  
**@russell.spitzer:** The error message here ```java.lang.NoSuchMethodError:
org.apache.iceberg.aws.AwsProperties.dynamoDbTableName()Ljava/lang/String```  
**@russell.spitzer:** Says it's looking for a method called
dynamoDbTableName() which returns a String and is not finding it on the
classpath  
**@russell.spitzer:** Specifically this one  
**@russell.spitzer:** Is EMR going to put that older "iceberg-spark3-runtime"
jar on the classpath even if you don't add it to jars? Because that could
cause the issue since the method doesn't exist in 11.1  
**@russell.spitzer:** ```-rw-r--r-- 1 root root 19201102 Aug 10 15:49
/usr/share/aws/aws-java-sdk/iceberg-spark3-runtime-0.11.1.jar << That one```  
**@russell.spitzer:** Also shouldn't you be using Spark3 0.12 runtime? not
Spark-runtime?  
**@mohamed.jouini.pro:** @russell.spitzer, the same errors when using spark3
runtime  
**@russell.spitzer:** yes but are you sure the 0.11.1 jar isn't on the
classpath?  
**@russell.spitzer:** That would cause the error  
**@mohamed.jouini.pro:** Ahh, let me remove the old one from aws classpath  
**@mohamed.jouini.pro:** Thx @russell.spitzer, After removing the old runtime
jars, I see that Iceberg can create DynamoDB table, but I can't create a table
```spark.sql ("CREATE TABLE iceberg_dynamo_poc.iceberg_dynamo_poc.dynamo1 ( \
id bigint, \ pathId string, \ ) \ USING iceberg " )```
```org.apache.iceberg.exceptions.NoSuchNamespaceException: Cannot find default
warehouse location: namespace iceberg_dynamo_poc does not exist at
org.apache.iceberg.aws.dynamodb.DynamoDbCatalog.defaultWarehouseLocation(DynamoDbCatalog.java:158)
at
org.apache.iceberg.BaseMetastoreCatalog$BaseMetastoreCatalogTableBuilder.create(BaseMetastoreCatalog.java:211)
at
org.apache.iceberg.CachingCatalog$CachingTableBuilder.lambda$create$0(CachingCatalog.java:212)
at
org.apache.iceberg.shaded.com.github.benmanes.caffeine.cache.BoundedLocalCache.lambda$doComputeIfAbsent$14(BoundedLocalCache.java:2344)
at
java.util.concurrent.ConcurrentHashMap.compute(ConcurrentHashMap.java:1853)```  
**@russell.spitzer:**?  
**@mohamed.jouini.pro:** Iceberg has to create the namespace
`iceberg_dynamo_poc` if doesn't exist, right ?  
**@russell.spitzer:** You would need to make it if it doesn't exist  
**@russell.spitzer:** looking breifly at the docs, it seems like they don't
use a namespace in their table references  
**@russell.spitzer:** just catalog.tableName  
**@mohamed.jouini.pro:** ```spark.sql ("CREATE TABLE
iceberg_dynamo_poc.dynamo1 ( \ id bigint, \ pathId string, \ ) \ USING iceberg
" ) Py4JJavaError: An error occurred while calling o554.sql. :
org.apache.iceberg.exceptions.ValidationException: Table namespace must not be
empty: dynamo1```  
**@russell.spitzer:** i believe that is because you didn't set a location?
@jackye ^  
**@mohamed.jouini.pro:** I think it should create a namespace, please  
**@mohamed.jouini.pro:** But it should use the `catalog.warehouse` as the
default table location  
**@russell.spitzer:** I don't know how they set this up, but in most systems I
would imagine you would have to first CREATE DATABASE before putting a table
in it, but I don't know how the DynamoCatalog was configured  
**@dweeks:** @russell.spitzer is correct  
**@jackye:** reading the threads now  
**@dweeks:** The namespace needs to exist via a `create database
<catalog>.<database>`  
**@jackye:** +1 for what Daniel says, a database needs to be created before
table creation in Spark.  
**@dweeks:** I think there is some strangeness in spark in terms of catalogs
and databases as well.  
**@jackye:** for ```AWS_PACKAGES=( "bundle" "url-connection-client" )``` the
intention was that bundle includes all AWS service packages and can work if
you just want to test running on EMR with a bootstrap script, but you can
change `bundle` to the list of aws services you use for a script good for
production use.  
**@dweeks:** For example, I believe the `use` command can be in context of
catalog or database  
**@mohamed.jouini.pro:** Thx, it works after creating the database, it's not
the same behavior when working with AWS Glue, I think that namespace is
created if doesn't exist Thank you @russell.spitzer, @dweeks @jackye @blue  
**@jackye:** I think you might already have a database of that name in Glue in
the past, that’s why you did not need to create it. Otherwise the behavior
should be consistent across all Iceberg catalog implementations  
**@mohamed.jouini.pro:** I will test it again with glue  
 **@gsreeramkumar:** Hello folks! Is there a way to specify while writing a
Row to an Iceberg Table from Spark - *that a specific column is non-existent
for that Row*? My question is NOT about `null` - but about `non-existent`? &
if yes, is there a way to consume this in Spark? Truly appreciate any help!