You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@iceberg.apache.org by Lian Jiang <ji...@gmail.com> on 2021/08/13 22:49:51 UTC

create iceberg on minio s3 got "The AWS Access Key Id you provided does not exist in our records."

Hi,

I try to create an iceberg table on minio s3 and hive.

*This is how I launch spark-shell:*

# add Iceberg dependency
export AWS_REGION=us-east-1
export AWS_ACCESS_KEY_ID=minio
export AWS_SECRET_ACCESS_KEY=minio123

ICEBERG_VERSION=0.11.1
DEPENDENCIES="org.apache.iceberg:iceberg-spark3-runtime:$ICEBERG_VERSION"

MINIOSERVER=192.168.160.5


# add AWS dependnecy
AWS_SDK_VERSION=2.15.40
AWS_MAVEN_GROUP=software.amazon.awssdk
AWS_PACKAGES=(
    "bundle"
    "url-connection-client"
)
for pkg in "${AWS_PACKAGES[@]}"; do
    DEPENDENCIES+=",$AWS_MAVEN_GROUP:$pkg:$AWS_SDK_VERSION"
done

# start Spark SQL client shell
/spark/bin/spark-shell --packages $DEPENDENCIES \
    --conf
spark.sql.catalog.hive_test=org.apache.iceberg.spark.SparkCatalog \
    --conf spark.sql.catalog.hive_test.warehouse=s3a://east/prefix \
    --conf spark.sql.catalog.hive_test.type=hive  \
    --conf
spark.sql.catalog.hive_test.io-impl=org.apache.iceberg.aws.s3.S3FileIO \
    --conf spark.hadoop.fs.s3a.endpoint=http://$MINIOSERVER:9000 \
    --conf spark.hadoop.fs.s3a.access.key=minio \
    --conf spark.hadoop.fs.s3a.secret.key=minio123 \
    --conf spark.hadoop.fs.s3a.path.style.access=true \
    --conf spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem

*Here is the spark code to create the iceberg table:*

import org.apache.spark.sql.SparkSession
val values = List(1,2,3,4,5)

val spark = SparkSession.builder().master("local").getOrCreate()
import spark.implicits._
val df = values.toDF()

val core = "mytable8"
val table = s"hive_test.mydb.${core}"
val s3IcePath = s"s3a://spark-test/${core}.ice"

df.writeTo(table)
    .tableProperty("write.format.default", "parquet")
    .tableProperty("location", s3IcePath)
    .createOrReplace()

I got an error "The AWS Access Key Id you provided does not exist in our
records.".

I have verified that I can login minio UI using the same username and
password that I passed to spark-shell via AWS_ACCESS_KEY_ID and
AWS_SECRET_ACCESS_KEY env variables.
https://github.com/apache/iceberg/issues/2168 is related but does not help
me. Not sure why the credential does not work for iceberg + AWS. Any idea
or an example of writing an iceberg table to S3 using hive catalog will be
highly appreciated! Thanks.

Re: create iceberg on minio s3 got "The AWS Access Key Id you provided does not exist in our records."

Posted by Lian Jiang <ji...@gmail.com>.
Thanks for clarifying. I will investigate your solution later for using
S3FileIO.

On Tue, Aug 17, 2021 at 11:40 AM Jack Ye <ye...@gmail.com> wrote:

> Good to hear the issue is fixed!
>
> ACL is optional, as the javadoc says, "If not set, ACL will not be set for
> requests".
>
> But I think to use MinIO you need to use a custom client factory to set
> your S3 endpoint as that MinIO endpoint.
>
> -Jack
>
> On Tue, Aug 17, 2021 at 11:36 AM Lian Jiang <ji...@gmail.com> wrote:
>
>> Hi Ryan,
>>
>> S3FileIO need canned ACL according to:
>>
>>   /**
>>    * Used to configure canned access control list (ACL) for S3 client to
>> use during write.
>>    * If not set, ACL will not be set for requests.
>>    * <p>
>>    * The input must be one of {@link
>> software.amazon.awssdk.services.s3.model.ObjectCannedACL},
>>    * such as 'public-read-write'
>>    * For more details:
>> https://docs.aws.amazon.com/AmazonS3/latest/dev/acl-overview.html
>>    */
>>   public static final String S3FILEIO_ACL = "s3.acl";
>>
>>
>> Minio does not support canned ACL according to
>> https://docs.min.io/docs/minio-server-limits-per-tenant.html:
>>
>> List of Amazon S3 Bucket API's not supported on MinIO
>>
>>    - BucketACL (Use bucket policies
>>    <https://docs.min.io/docs/minio-client-complete-guide#policy> instead)
>>    - BucketCORS (CORS enabled by default on all buckets for all HTTP
>>    verbs)
>>    - BucketWebsite (Use caddy <https://github.com/caddyserver/caddy> or
>>    nginx <https://www.nginx.com/resources/wiki/>)
>>    - BucketAnalytics, BucketMetrics, BucketLogging (Use bucket
>>    notification
>>    <https://docs.min.io/docs/minio-client-complete-guide#events> APIs)
>>    - BucketRequestPayment
>>
>> List of Amazon S3 Object API's not supported on MinIO
>>
>>    - ObjectACL (Use bucket policies
>>    <https://docs.min.io/docs/minio-client-complete-guide#policy> instead)
>>    - ObjectTorrent
>>
>>
>>
>> Hope this makes sense.
>>
>> BTW, iceberg + Hive + S3A works after Hive using S3A issue has been
>> fixed. Thanks Jack for helping debugging.
>>
>>
>>
>> On Tue, Aug 17, 2021 at 8:38 AM Ryan Blue <bl...@tabular.io> wrote:
>>
>>> I'm not sure that I'm following why MinIO won't work with S3FileIO.
>>> S3FileIO assumes that the credentials are handled by a credentials provider
>>> outside of S3FileIO. How does MinIO handle credentials?
>>>
>>> Ryan
>>>
>>> On Mon, Aug 16, 2021 at 7:57 PM Jack Ye <ye...@gmail.com> wrote:
>>>
>>>> Talked with Lian on Slack, the user is using a hadoop 3.2.1 + hive
>>>> (postgres) + spark + minio docker installation. There might be some S3A
>>>> related dependencies missing on the Hive server side based on the stack
>>>> trace. Let's see if that fixes the issue.
>>>> -Jack
>>>>
>>>> On Mon, Aug 16, 2021 at 7:32 PM Lian Jiang <ji...@gmail.com>
>>>> wrote:
>>>>
>>>>> This is my full script launching spark-shell:
>>>>>
>>>>> # add Iceberg dependency
>>>>> export AWS_REGION=us-east-1
>>>>> export AWS_ACCESS_KEY_ID=minio
>>>>> export AWS_SECRET_ACCESS_KEY=minio123
>>>>>
>>>>> ICEBERG_VERSION=0.11.1
>>>>>
>>>>> DEPENDENCIES="org.apache.iceberg:iceberg-spark3-runtime:$ICEBERG_VERSION,org.apache.iceberg:iceberg-hive-runtime:$ICEBERG_VERSION,org.apache.hadoop:hadoop-aws:3.2.0"
>>>>>
>>>>> MINIOSERVER=192.168.176.5
>>>>>
>>>>>
>>>>> # add AWS dependnecy
>>>>> AWS_SDK_VERSION=2.15.40
>>>>> AWS_MAVEN_GROUP=software.amazon.awssdk
>>>>> AWS_PACKAGES=(
>>>>>     "bundle"
>>>>>     "url-connection-client"
>>>>> )
>>>>> for pkg in "${AWS_PACKAGES[@]}"; do
>>>>>     DEPENDENCIES+=",$AWS_MAVEN_GROUP:$pkg:$AWS_SDK_VERSION"
>>>>> done
>>>>>
>>>>> # start Spark SQL client shell
>>>>> /spark/bin/spark-shell --packages $DEPENDENCIES \
>>>>>     --conf
>>>>> spark.sql.catalog.hive_test=org.apache.iceberg.spark.SparkCatalog \
>>>>>     --conf spark.sql.catalog.hive_test.type=hive  \
>>>>>     --conf
>>>>> spark.sql.catalog.hive_test.io-impl=org.apache.iceberg.hadoop.HadoopFileIO \
>>>>>     --conf spark.sql.catalog.hive_test.warehouse=s3a://east/warehouse \
>>>>>     --conf spark.hadoop.fs.s3a.endpoint=http://$MINIOSERVER:9000 \
>>>>>     --conf spark.hadoop.fs.s3a.access.key=minio \
>>>>>     --conf spark.hadoop.fs.s3a.secret.key=minio123 \
>>>>>     --conf spark.hadoop.fs.s3a.path.style.access=true \
>>>>>     --conf
>>>>> spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem
>>>>>
>>>>>
>>>>> Let me know if anything is missing. Thanks.
>>>>>
>>>>> On Mon, Aug 16, 2021 at 7:29 PM Jack Ye <ye...@gmail.com> wrote:
>>>>>
>>>>>> Have you included the hadoop-aws jar?
>>>>>> https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-aws
>>>>>> -Jack
>>>>>>
>>>>>> On Mon, Aug 16, 2021 at 7:09 PM Lian Jiang <ji...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Jack,
>>>>>>>
>>>>>>> You are right. S3FileIO will not work on minio since minio does not
>>>>>>> support ACL:
>>>>>>> https://docs.min.io/docs/minio-server-limits-per-tenant.html
>>>>>>>
>>>>>>> To use iceberg, minio + s3a, I used below script to launch
>>>>>>> spark-shell:
>>>>>>>
>>>>>>> /spark/bin/spark-shell --packages $DEPENDENCIES \
>>>>>>>     --conf
>>>>>>> spark.sql.catalog.hive_test=org.apache.iceberg.spark.SparkCatalog \
>>>>>>>     --conf spark.sql.catalog.hive_test.type=hive  \
>>>>>>> *    --conf
>>>>>>> spark.sql.catalog.hive_test.io-impl=org.apache.iceberg.hadoop.HadoopFileIO
>>>>>>> \*
>>>>>>>     --conf
>>>>>>> spark.sql.catalog.hive_test.warehouse=s3a://east/warehouse \
>>>>>>>     --conf spark.hadoop.fs.s3a.endpoint=http://$MINIOSERVER:9000 \
>>>>>>>     --conf spark.hadoop.fs.s3a.access.key=minio \
>>>>>>>     --conf spark.hadoop.fs.s3a.secret.key=minio123 \
>>>>>>>     --conf spark.hadoop.fs.s3a.path.style.access=true \
>>>>>>>     --conf
>>>>>>> spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> *The spark code:*
>>>>>>>
>>>>>>> import org.apache.spark.sql.SparkSession
>>>>>>> val values = List(1,2,3,4,5)
>>>>>>>
>>>>>>> val spark = SparkSession.builder().master("local").getOrCreate()
>>>>>>> import spark.implicits._
>>>>>>> val df = values.toDF()
>>>>>>>
>>>>>>> val core = "mytable"
>>>>>>> val table = s"hive_test.mydb.${core}"
>>>>>>> val s3IcePath = s"s3a://east/${core}.ice"
>>>>>>>
>>>>>>> df.writeTo(table)
>>>>>>>     .tableProperty("write.format.default", "parquet")
>>>>>>>     .tableProperty("location", s3IcePath)
>>>>>>>     .createOrReplace()
>>>>>>>
>>>>>>>
>>>>>>> *Still the same error:*
>>>>>>> java.lang.ClassNotFoundException: Class
>>>>>>> org.apache.hadoop.fs.s3a.S3AFileSystem not found
>>>>>>>
>>>>>>>
>>>>>>> What else could be wrong? Thanks for any clue.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Mon, Aug 16, 2021 at 9:35 AM Jack Ye <ye...@gmail.com> wrote:
>>>>>>>
>>>>>>>> Sorry for the late reply, I thought I replied on Friday but the
>>>>>>>> email did not send successfully.
>>>>>>>>
>>>>>>>> As Daniel said, you don't need to setup S3A if you are using
>>>>>>>> S3FileIO.
>>>>>>>>
>>>>>>>> Th S3FileIO by default reads the default credentials chain to check
>>>>>>>> credential setups one by one:
>>>>>>>> https://docs.aws.amazon.com/sdk-for-java/latest/developer-guide/credentials.html#credentials-chain
>>>>>>>>
>>>>>>>> If you would like to use a specialized credential provider, you can
>>>>>>>> directly customize your S3 client:
>>>>>>>> https://iceberg.apache.org/aws/#aws-client-customization
>>>>>>>>
>>>>>>>> It looks like you are trying to use MinIO to mount S3A file system?
>>>>>>>> If you have to use MinIO then there is not a way to integrate with S3FileIO
>>>>>>>> right now. (maybe I am wrong on this, I don't know much about MinIO)
>>>>>>>>
>>>>>>>> To directly use S3FileIO with HiveCatalog, simply do:
>>>>>>>>
>>>>>>>> /spark/bin/spark-shell --packages $DEPENDENCIES \
>>>>>>>>     --conf
>>>>>>>> spark.sql.catalog.hive_test=org.apache.iceberg.spark.SparkCatalog \
>>>>>>>>     --conf spark.sql.catalog.hive_test.type=hive  \
>>>>>>>>     --conf
>>>>>>>> spark.sql.catalog.hive_test.io-impl=org.apache.iceberg.aws.s3.S3FileIO \
>>>>>>>>     --conf spark.sql.catalog.hive_test.warehouse=s3://bucket
>>>>>>>>
>>>>>>>> Best,
>>>>>>>> Jack Ye
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Sun, Aug 15, 2021 at 2:53 PM Lian Jiang <ji...@gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Thanks. I prefer S3FileIO as it is recommended by iceberg. Do you
>>>>>>>>> have a sample using hive catalog, s3FileIO, spark API (as opposed to SQL),
>>>>>>>>> S3 access.key and secret.key? It is hard to get all settings right for this
>>>>>>>>> combination without an example. Appreciate any help.
>>>>>>>>>
>>>>>>>>> On Fri, Aug 13, 2021 at 6:01 PM Daniel Weeks <
>>>>>>>>> daniel.c.weeks@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> So, if I recall correctly, the hive server does need access to
>>>>>>>>>> check and create paths for table locations.
>>>>>>>>>>
>>>>>>>>>> There may be an option to disable this behavior, but otherwise
>>>>>>>>>> the fs implementation probably needs to be available to the hive metastore.
>>>>>>>>>>
>>>>>>>>>> -Dan
>>>>>>>>>>
>>>>>>>>>> On Fri, Aug 13, 2021, 4:48 PM Lian Jiang <ji...@gmail.com>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> Thanks Daniel.
>>>>>>>>>>>
>>>>>>>>>>> After modifying the script to,
>>>>>>>>>>>
>>>>>>>>>>> export AWS_REGION=us-east-1
>>>>>>>>>>> export AWS_ACCESS_KEY_ID=minio
>>>>>>>>>>> export AWS_SECRET_ACCESS_KEY=minio123
>>>>>>>>>>>
>>>>>>>>>>> ICEBERG_VERSION=0.11.1
>>>>>>>>>>>
>>>>>>>>>>> DEPENDENCIES="org.apache.iceberg:iceberg-spark3-runtime:$ICEBERG_VERSION,org.apache.iceberg:iceberg-hive-runtime:$ICEBERG_VERSION,org.apache.hadoop:hadoop-aws:3.2.0"
>>>>>>>>>>>
>>>>>>>>>>> MINIOSERVER=192.168.160.5
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> # add AWS dependnecy
>>>>>>>>>>> AWS_SDK_VERSION=2.15.40
>>>>>>>>>>> AWS_MAVEN_GROUP=software.amazon.awssdk
>>>>>>>>>>> AWS_PACKAGES=(
>>>>>>>>>>>     "bundle"
>>>>>>>>>>>     "url-connection-client"
>>>>>>>>>>> )
>>>>>>>>>>> for pkg in "${AWS_PACKAGES[@]}"; do
>>>>>>>>>>>     DEPENDENCIES+=",$AWS_MAVEN_GROUP:$pkg:$AWS_SDK_VERSION"
>>>>>>>>>>> done
>>>>>>>>>>>
>>>>>>>>>>> # start Spark SQL client shell
>>>>>>>>>>> /spark/bin/spark-shell --packages $DEPENDENCIES \
>>>>>>>>>>>     --conf
>>>>>>>>>>> spark.sql.catalog.hive_test=org.apache.iceberg.spark.SparkCatalog \
>>>>>>>>>>>     --conf spark.sql.catalog.hive_test.type=hive  \
>>>>>>>>>>>     --conf spark.hadoop.fs.s3a.endpoint=http://$MINIOSERVER:9000
>>>>>>>>>>> \
>>>>>>>>>>>     --conf spark.hadoop.fs.s3a.access.key=minio \
>>>>>>>>>>>     --conf spark.hadoop.fs.s3a.secret.key=minio123 \
>>>>>>>>>>>     --conf spark.hadoop.fs.s3a.path.style.access=true \
>>>>>>>>>>>     --conf
>>>>>>>>>>> spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem
>>>>>>>>>>>
>>>>>>>>>>> I got: MetaException: java.lang.RuntimeException:
>>>>>>>>>>> java.lang.ClassNotFoundException: Class
>>>>>>>>>>> org.apache.hadoop.fs.s3a.S3AFileSystem not found. My hive server is not
>>>>>>>>>>> using s3 and should not cause this error. Any ideas? Thanks.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> I got "ClassNotFoundException: Class
>>>>>>>>>>> org.apache.hadoop.fs.s3a.S3AFileSystem not found". Any idea what dependency
>>>>>>>>>>> could I miss?
>>>>>>>>>>>
>>>>>>>>>>> On Fri, Aug 13, 2021 at 4:03 PM Daniel Weeks <
>>>>>>>>>>> daniel.c.weeks@gmail.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Hey Lian,
>>>>>>>>>>>>
>>>>>>>>>>>> At a cursory glance, it appears that you might be mixing two
>>>>>>>>>>>> different FileIO implementations, which may be why you are not getting the
>>>>>>>>>>>> expected result.
>>>>>>>>>>>>
>>>>>>>>>>>> When you set: --conf
>>>>>>>>>>>> spark.sql.catalog.hive_test.io-impl=org.apache.iceberg.aws.s3.S3FileIO you're
>>>>>>>>>>>> actually switching over to the native S3 implementation within Iceberg (as
>>>>>>>>>>>> opposed to S3AFileSystem via HadoopFileIO).  However, all of the following
>>>>>>>>>>>> settings to setup access are then set for the S3AFileSystem (which would
>>>>>>>>>>>> not be used with S3FileIO).
>>>>>>>>>>>>
>>>>>>>>>>>> You might try just removing that line since it should use the
>>>>>>>>>>>> HadoopFileIO at that point and may work.
>>>>>>>>>>>>
>>>>>>>>>>>> Hope that's helpful,
>>>>>>>>>>>> -Dan
>>>>>>>>>>>>
>>>>>>>>>>>> On Fri, Aug 13, 2021 at 3:50 PM Lian Jiang <
>>>>>>>>>>>> jiangok2006@gmail.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>
>>>>>>>>>>>>> I try to create an iceberg table on minio s3 and hive.
>>>>>>>>>>>>>
>>>>>>>>>>>>> *This is how I launch spark-shell:*
>>>>>>>>>>>>>
>>>>>>>>>>>>> # add Iceberg dependency
>>>>>>>>>>>>> export AWS_REGION=us-east-1
>>>>>>>>>>>>> export AWS_ACCESS_KEY_ID=minio
>>>>>>>>>>>>> export AWS_SECRET_ACCESS_KEY=minio123
>>>>>>>>>>>>>
>>>>>>>>>>>>> ICEBERG_VERSION=0.11.1
>>>>>>>>>>>>>
>>>>>>>>>>>>> DEPENDENCIES="org.apache.iceberg:iceberg-spark3-runtime:$ICEBERG_VERSION"
>>>>>>>>>>>>>
>>>>>>>>>>>>> MINIOSERVER=192.168.160.5
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> # add AWS dependnecy
>>>>>>>>>>>>> AWS_SDK_VERSION=2.15.40
>>>>>>>>>>>>> AWS_MAVEN_GROUP=software.amazon.awssdk
>>>>>>>>>>>>> AWS_PACKAGES=(
>>>>>>>>>>>>>     "bundle"
>>>>>>>>>>>>>     "url-connection-client"
>>>>>>>>>>>>> )
>>>>>>>>>>>>> for pkg in "${AWS_PACKAGES[@]}"; do
>>>>>>>>>>>>>     DEPENDENCIES+=",$AWS_MAVEN_GROUP:$pkg:$AWS_SDK_VERSION"
>>>>>>>>>>>>> done
>>>>>>>>>>>>>
>>>>>>>>>>>>> # start Spark SQL client shell
>>>>>>>>>>>>> /spark/bin/spark-shell --packages $DEPENDENCIES \
>>>>>>>>>>>>>     --conf
>>>>>>>>>>>>> spark.sql.catalog.hive_test=org.apache.iceberg.spark.SparkCatalog \
>>>>>>>>>>>>>     --conf
>>>>>>>>>>>>> spark.sql.catalog.hive_test.warehouse=s3a://east/prefix \
>>>>>>>>>>>>>     --conf spark.sql.catalog.hive_test.type=hive  \
>>>>>>>>>>>>>     --conf
>>>>>>>>>>>>> spark.sql.catalog.hive_test.io-impl=org.apache.iceberg.aws.s3.S3FileIO \
>>>>>>>>>>>>>     --conf spark.hadoop.fs.s3a.endpoint=http://$MINIOSERVER:9000
>>>>>>>>>>>>> \
>>>>>>>>>>>>>     --conf spark.hadoop.fs.s3a.access.key=minio \
>>>>>>>>>>>>>     --conf spark.hadoop.fs.s3a.secret.key=minio123 \
>>>>>>>>>>>>>     --conf spark.hadoop.fs.s3a.path.style.access=true \
>>>>>>>>>>>>>     --conf
>>>>>>>>>>>>> spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem
>>>>>>>>>>>>>
>>>>>>>>>>>>> *Here is the spark code to create the iceberg table:*
>>>>>>>>>>>>>
>>>>>>>>>>>>> import org.apache.spark.sql.SparkSession
>>>>>>>>>>>>> val values = List(1,2,3,4,5)
>>>>>>>>>>>>>
>>>>>>>>>>>>> val spark =
>>>>>>>>>>>>> SparkSession.builder().master("local").getOrCreate()
>>>>>>>>>>>>> import spark.implicits._
>>>>>>>>>>>>> val df = values.toDF()
>>>>>>>>>>>>>
>>>>>>>>>>>>> val core = "mytable8"
>>>>>>>>>>>>> val table = s"hive_test.mydb.${core}"
>>>>>>>>>>>>> val s3IcePath = s"s3a://spark-test/${core}.ice"
>>>>>>>>>>>>>
>>>>>>>>>>>>> df.writeTo(table)
>>>>>>>>>>>>>     .tableProperty("write.format.default", "parquet")
>>>>>>>>>>>>>     .tableProperty("location", s3IcePath)
>>>>>>>>>>>>>     .createOrReplace()
>>>>>>>>>>>>>
>>>>>>>>>>>>> I got an error "The AWS Access Key Id you provided does not
>>>>>>>>>>>>> exist in our records.".
>>>>>>>>>>>>>
>>>>>>>>>>>>> I have verified that I can login minio UI using the same
>>>>>>>>>>>>> username and password that I passed to spark-shell via AWS_ACCESS_KEY_ID
>>>>>>>>>>>>> and AWS_SECRET_ACCESS_KEY env variables.
>>>>>>>>>>>>> https://github.com/apache/iceberg/issues/2168 is related but
>>>>>>>>>>>>> does not help me. Not sure why the credential does not work for iceberg +
>>>>>>>>>>>>> AWS. Any idea or an example of writing an iceberg table to S3 using hive
>>>>>>>>>>>>> catalog will be highly appreciated! Thanks.
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> --
>>>>>>>>>>>
>>>>>>>>>>> Create your own email signature
>>>>>>>>>>> <https://www.wisestamp.com/signature-in-email/?utm_source=promotion&utm_medium=signature&utm_campaign=create_your_own&srcid=5234462839406592>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>>
>>>>>>>>> Create your own email signature
>>>>>>>>> <https://www.wisestamp.com/signature-in-email/?utm_source=promotion&utm_medium=signature&utm_campaign=create_your_own&srcid=5234462839406592>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>>
>>>>>>> Create your own email signature
>>>>>>> <https://www.wisestamp.com/signature-in-email/?utm_source=promotion&utm_medium=signature&utm_campaign=create_your_own&srcid=5234462839406592>
>>>>>>>
>>>>>>
>>>>>
>>>>> --
>>>>>
>>>>> Create your own email signature
>>>>> <https://www.wisestamp.com/signature-in-email/?utm_source=promotion&utm_medium=signature&utm_campaign=create_your_own&srcid=5234462839406592>
>>>>>
>>>>
>>>
>>> --
>>> Ryan Blue
>>> Tabular
>>>
>>
>>
>> --
>>
>> Create your own email signature
>> <https://www.wisestamp.com/signature-in-email/?utm_source=promotion&utm_medium=signature&utm_campaign=create_your_own&srcid=5234462839406592>
>>
>

-- 

Create your own email signature
<https://www.wisestamp.com/signature-in-email/?utm_source=promotion&utm_medium=signature&utm_campaign=create_your_own&srcid=5234462839406592>

Re: create iceberg on minio s3 got "The AWS Access Key Id you provided does not exist in our records."

Posted by Jack Ye <ye...@gmail.com>.
Good to hear the issue is fixed!

ACL is optional, as the javadoc says, "If not set, ACL will not be set for
requests".

But I think to use MinIO you need to use a custom client factory to set
your S3 endpoint as that MinIO endpoint.

-Jack

On Tue, Aug 17, 2021 at 11:36 AM Lian Jiang <ji...@gmail.com> wrote:

> Hi Ryan,
>
> S3FileIO need canned ACL according to:
>
>   /**
>    * Used to configure canned access control list (ACL) for S3 client to
> use during write.
>    * If not set, ACL will not be set for requests.
>    * <p>
>    * The input must be one of {@link
> software.amazon.awssdk.services.s3.model.ObjectCannedACL},
>    * such as 'public-read-write'
>    * For more details:
> https://docs.aws.amazon.com/AmazonS3/latest/dev/acl-overview.html
>    */
>   public static final String S3FILEIO_ACL = "s3.acl";
>
>
> Minio does not support canned ACL according to
> https://docs.min.io/docs/minio-server-limits-per-tenant.html:
>
> List of Amazon S3 Bucket API's not supported on MinIO
>
>    - BucketACL (Use bucket policies
>    <https://docs.min.io/docs/minio-client-complete-guide#policy> instead)
>    - BucketCORS (CORS enabled by default on all buckets for all HTTP
>    verbs)
>    - BucketWebsite (Use caddy <https://github.com/caddyserver/caddy> or
>    nginx <https://www.nginx.com/resources/wiki/>)
>    - BucketAnalytics, BucketMetrics, BucketLogging (Use bucket
>    notification
>    <https://docs.min.io/docs/minio-client-complete-guide#events> APIs)
>    - BucketRequestPayment
>
> List of Amazon S3 Object API's not supported on MinIO
>
>    - ObjectACL (Use bucket policies
>    <https://docs.min.io/docs/minio-client-complete-guide#policy> instead)
>    - ObjectTorrent
>
>
>
> Hope this makes sense.
>
> BTW, iceberg + Hive + S3A works after Hive using S3A issue has been fixed.
> Thanks Jack for helping debugging.
>
>
>
> On Tue, Aug 17, 2021 at 8:38 AM Ryan Blue <bl...@tabular.io> wrote:
>
>> I'm not sure that I'm following why MinIO won't work with S3FileIO.
>> S3FileIO assumes that the credentials are handled by a credentials provider
>> outside of S3FileIO. How does MinIO handle credentials?
>>
>> Ryan
>>
>> On Mon, Aug 16, 2021 at 7:57 PM Jack Ye <ye...@gmail.com> wrote:
>>
>>> Talked with Lian on Slack, the user is using a hadoop 3.2.1 + hive
>>> (postgres) + spark + minio docker installation. There might be some S3A
>>> related dependencies missing on the Hive server side based on the stack
>>> trace. Let's see if that fixes the issue.
>>> -Jack
>>>
>>> On Mon, Aug 16, 2021 at 7:32 PM Lian Jiang <ji...@gmail.com>
>>> wrote:
>>>
>>>> This is my full script launching spark-shell:
>>>>
>>>> # add Iceberg dependency
>>>> export AWS_REGION=us-east-1
>>>> export AWS_ACCESS_KEY_ID=minio
>>>> export AWS_SECRET_ACCESS_KEY=minio123
>>>>
>>>> ICEBERG_VERSION=0.11.1
>>>>
>>>> DEPENDENCIES="org.apache.iceberg:iceberg-spark3-runtime:$ICEBERG_VERSION,org.apache.iceberg:iceberg-hive-runtime:$ICEBERG_VERSION,org.apache.hadoop:hadoop-aws:3.2.0"
>>>>
>>>> MINIOSERVER=192.168.176.5
>>>>
>>>>
>>>> # add AWS dependnecy
>>>> AWS_SDK_VERSION=2.15.40
>>>> AWS_MAVEN_GROUP=software.amazon.awssdk
>>>> AWS_PACKAGES=(
>>>>     "bundle"
>>>>     "url-connection-client"
>>>> )
>>>> for pkg in "${AWS_PACKAGES[@]}"; do
>>>>     DEPENDENCIES+=",$AWS_MAVEN_GROUP:$pkg:$AWS_SDK_VERSION"
>>>> done
>>>>
>>>> # start Spark SQL client shell
>>>> /spark/bin/spark-shell --packages $DEPENDENCIES \
>>>>     --conf
>>>> spark.sql.catalog.hive_test=org.apache.iceberg.spark.SparkCatalog \
>>>>     --conf spark.sql.catalog.hive_test.type=hive  \
>>>>     --conf
>>>> spark.sql.catalog.hive_test.io-impl=org.apache.iceberg.hadoop.HadoopFileIO \
>>>>     --conf spark.sql.catalog.hive_test.warehouse=s3a://east/warehouse \
>>>>     --conf spark.hadoop.fs.s3a.endpoint=http://$MINIOSERVER:9000 \
>>>>     --conf spark.hadoop.fs.s3a.access.key=minio \
>>>>     --conf spark.hadoop.fs.s3a.secret.key=minio123 \
>>>>     --conf spark.hadoop.fs.s3a.path.style.access=true \
>>>>     --conf
>>>> spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem
>>>>
>>>>
>>>> Let me know if anything is missing. Thanks.
>>>>
>>>> On Mon, Aug 16, 2021 at 7:29 PM Jack Ye <ye...@gmail.com> wrote:
>>>>
>>>>> Have you included the hadoop-aws jar?
>>>>> https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-aws
>>>>> -Jack
>>>>>
>>>>> On Mon, Aug 16, 2021 at 7:09 PM Lian Jiang <ji...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Jack,
>>>>>>
>>>>>> You are right. S3FileIO will not work on minio since minio does not
>>>>>> support ACL:
>>>>>> https://docs.min.io/docs/minio-server-limits-per-tenant.html
>>>>>>
>>>>>> To use iceberg, minio + s3a, I used below script to launch
>>>>>> spark-shell:
>>>>>>
>>>>>> /spark/bin/spark-shell --packages $DEPENDENCIES \
>>>>>>     --conf
>>>>>> spark.sql.catalog.hive_test=org.apache.iceberg.spark.SparkCatalog \
>>>>>>     --conf spark.sql.catalog.hive_test.type=hive  \
>>>>>> *    --conf
>>>>>> spark.sql.catalog.hive_test.io-impl=org.apache.iceberg.hadoop.HadoopFileIO
>>>>>> \*
>>>>>>     --conf spark.sql.catalog.hive_test.warehouse=s3a://east/warehouse
>>>>>> \
>>>>>>     --conf spark.hadoop.fs.s3a.endpoint=http://$MINIOSERVER:9000 \
>>>>>>     --conf spark.hadoop.fs.s3a.access.key=minio \
>>>>>>     --conf spark.hadoop.fs.s3a.secret.key=minio123 \
>>>>>>     --conf spark.hadoop.fs.s3a.path.style.access=true \
>>>>>>     --conf
>>>>>> spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem
>>>>>>
>>>>>>
>>>>>>
>>>>>> *The spark code:*
>>>>>>
>>>>>> import org.apache.spark.sql.SparkSession
>>>>>> val values = List(1,2,3,4,5)
>>>>>>
>>>>>> val spark = SparkSession.builder().master("local").getOrCreate()
>>>>>> import spark.implicits._
>>>>>> val df = values.toDF()
>>>>>>
>>>>>> val core = "mytable"
>>>>>> val table = s"hive_test.mydb.${core}"
>>>>>> val s3IcePath = s"s3a://east/${core}.ice"
>>>>>>
>>>>>> df.writeTo(table)
>>>>>>     .tableProperty("write.format.default", "parquet")
>>>>>>     .tableProperty("location", s3IcePath)
>>>>>>     .createOrReplace()
>>>>>>
>>>>>>
>>>>>> *Still the same error:*
>>>>>> java.lang.ClassNotFoundException: Class
>>>>>> org.apache.hadoop.fs.s3a.S3AFileSystem not found
>>>>>>
>>>>>>
>>>>>> What else could be wrong? Thanks for any clue.
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Mon, Aug 16, 2021 at 9:35 AM Jack Ye <ye...@gmail.com> wrote:
>>>>>>
>>>>>>> Sorry for the late reply, I thought I replied on Friday but the
>>>>>>> email did not send successfully.
>>>>>>>
>>>>>>> As Daniel said, you don't need to setup S3A if you are using
>>>>>>> S3FileIO.
>>>>>>>
>>>>>>> Th S3FileIO by default reads the default credentials chain to check
>>>>>>> credential setups one by one:
>>>>>>> https://docs.aws.amazon.com/sdk-for-java/latest/developer-guide/credentials.html#credentials-chain
>>>>>>>
>>>>>>> If you would like to use a specialized credential provider, you can
>>>>>>> directly customize your S3 client:
>>>>>>> https://iceberg.apache.org/aws/#aws-client-customization
>>>>>>>
>>>>>>> It looks like you are trying to use MinIO to mount S3A file system?
>>>>>>> If you have to use MinIO then there is not a way to integrate with S3FileIO
>>>>>>> right now. (maybe I am wrong on this, I don't know much about MinIO)
>>>>>>>
>>>>>>> To directly use S3FileIO with HiveCatalog, simply do:
>>>>>>>
>>>>>>> /spark/bin/spark-shell --packages $DEPENDENCIES \
>>>>>>>     --conf
>>>>>>> spark.sql.catalog.hive_test=org.apache.iceberg.spark.SparkCatalog \
>>>>>>>     --conf spark.sql.catalog.hive_test.type=hive  \
>>>>>>>     --conf
>>>>>>> spark.sql.catalog.hive_test.io-impl=org.apache.iceberg.aws.s3.S3FileIO \
>>>>>>>     --conf spark.sql.catalog.hive_test.warehouse=s3://bucket
>>>>>>>
>>>>>>> Best,
>>>>>>> Jack Ye
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Sun, Aug 15, 2021 at 2:53 PM Lian Jiang <ji...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Thanks. I prefer S3FileIO as it is recommended by iceberg. Do you
>>>>>>>> have a sample using hive catalog, s3FileIO, spark API (as opposed to SQL),
>>>>>>>> S3 access.key and secret.key? It is hard to get all settings right for this
>>>>>>>> combination without an example. Appreciate any help.
>>>>>>>>
>>>>>>>> On Fri, Aug 13, 2021 at 6:01 PM Daniel Weeks <
>>>>>>>> daniel.c.weeks@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> So, if I recall correctly, the hive server does need access to
>>>>>>>>> check and create paths for table locations.
>>>>>>>>>
>>>>>>>>> There may be an option to disable this behavior, but otherwise the
>>>>>>>>> fs implementation probably needs to be available to the hive metastore.
>>>>>>>>>
>>>>>>>>> -Dan
>>>>>>>>>
>>>>>>>>> On Fri, Aug 13, 2021, 4:48 PM Lian Jiang <ji...@gmail.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Thanks Daniel.
>>>>>>>>>>
>>>>>>>>>> After modifying the script to,
>>>>>>>>>>
>>>>>>>>>> export AWS_REGION=us-east-1
>>>>>>>>>> export AWS_ACCESS_KEY_ID=minio
>>>>>>>>>> export AWS_SECRET_ACCESS_KEY=minio123
>>>>>>>>>>
>>>>>>>>>> ICEBERG_VERSION=0.11.1
>>>>>>>>>>
>>>>>>>>>> DEPENDENCIES="org.apache.iceberg:iceberg-spark3-runtime:$ICEBERG_VERSION,org.apache.iceberg:iceberg-hive-runtime:$ICEBERG_VERSION,org.apache.hadoop:hadoop-aws:3.2.0"
>>>>>>>>>>
>>>>>>>>>> MINIOSERVER=192.168.160.5
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> # add AWS dependnecy
>>>>>>>>>> AWS_SDK_VERSION=2.15.40
>>>>>>>>>> AWS_MAVEN_GROUP=software.amazon.awssdk
>>>>>>>>>> AWS_PACKAGES=(
>>>>>>>>>>     "bundle"
>>>>>>>>>>     "url-connection-client"
>>>>>>>>>> )
>>>>>>>>>> for pkg in "${AWS_PACKAGES[@]}"; do
>>>>>>>>>>     DEPENDENCIES+=",$AWS_MAVEN_GROUP:$pkg:$AWS_SDK_VERSION"
>>>>>>>>>> done
>>>>>>>>>>
>>>>>>>>>> # start Spark SQL client shell
>>>>>>>>>> /spark/bin/spark-shell --packages $DEPENDENCIES \
>>>>>>>>>>     --conf
>>>>>>>>>> spark.sql.catalog.hive_test=org.apache.iceberg.spark.SparkCatalog \
>>>>>>>>>>     --conf spark.sql.catalog.hive_test.type=hive  \
>>>>>>>>>>     --conf spark.hadoop.fs.s3a.endpoint=http://$MINIOSERVER:9000
>>>>>>>>>> \
>>>>>>>>>>     --conf spark.hadoop.fs.s3a.access.key=minio \
>>>>>>>>>>     --conf spark.hadoop.fs.s3a.secret.key=minio123 \
>>>>>>>>>>     --conf spark.hadoop.fs.s3a.path.style.access=true \
>>>>>>>>>>     --conf
>>>>>>>>>> spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem
>>>>>>>>>>
>>>>>>>>>> I got: MetaException: java.lang.RuntimeException:
>>>>>>>>>> java.lang.ClassNotFoundException: Class
>>>>>>>>>> org.apache.hadoop.fs.s3a.S3AFileSystem not found. My hive server is not
>>>>>>>>>> using s3 and should not cause this error. Any ideas? Thanks.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> I got "ClassNotFoundException: Class
>>>>>>>>>> org.apache.hadoop.fs.s3a.S3AFileSystem not found". Any idea what dependency
>>>>>>>>>> could I miss?
>>>>>>>>>>
>>>>>>>>>> On Fri, Aug 13, 2021 at 4:03 PM Daniel Weeks <
>>>>>>>>>> daniel.c.weeks@gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hey Lian,
>>>>>>>>>>>
>>>>>>>>>>> At a cursory glance, it appears that you might be mixing two
>>>>>>>>>>> different FileIO implementations, which may be why you are not getting the
>>>>>>>>>>> expected result.
>>>>>>>>>>>
>>>>>>>>>>> When you set: --conf
>>>>>>>>>>> spark.sql.catalog.hive_test.io-impl=org.apache.iceberg.aws.s3.S3FileIO you're
>>>>>>>>>>> actually switching over to the native S3 implementation within Iceberg (as
>>>>>>>>>>> opposed to S3AFileSystem via HadoopFileIO).  However, all of the following
>>>>>>>>>>> settings to setup access are then set for the S3AFileSystem (which would
>>>>>>>>>>> not be used with S3FileIO).
>>>>>>>>>>>
>>>>>>>>>>> You might try just removing that line since it should use the
>>>>>>>>>>> HadoopFileIO at that point and may work.
>>>>>>>>>>>
>>>>>>>>>>> Hope that's helpful,
>>>>>>>>>>> -Dan
>>>>>>>>>>>
>>>>>>>>>>> On Fri, Aug 13, 2021 at 3:50 PM Lian Jiang <
>>>>>>>>>>> jiangok2006@gmail.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Hi,
>>>>>>>>>>>>
>>>>>>>>>>>> I try to create an iceberg table on minio s3 and hive.
>>>>>>>>>>>>
>>>>>>>>>>>> *This is how I launch spark-shell:*
>>>>>>>>>>>>
>>>>>>>>>>>> # add Iceberg dependency
>>>>>>>>>>>> export AWS_REGION=us-east-1
>>>>>>>>>>>> export AWS_ACCESS_KEY_ID=minio
>>>>>>>>>>>> export AWS_SECRET_ACCESS_KEY=minio123
>>>>>>>>>>>>
>>>>>>>>>>>> ICEBERG_VERSION=0.11.1
>>>>>>>>>>>>
>>>>>>>>>>>> DEPENDENCIES="org.apache.iceberg:iceberg-spark3-runtime:$ICEBERG_VERSION"
>>>>>>>>>>>>
>>>>>>>>>>>> MINIOSERVER=192.168.160.5
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> # add AWS dependnecy
>>>>>>>>>>>> AWS_SDK_VERSION=2.15.40
>>>>>>>>>>>> AWS_MAVEN_GROUP=software.amazon.awssdk
>>>>>>>>>>>> AWS_PACKAGES=(
>>>>>>>>>>>>     "bundle"
>>>>>>>>>>>>     "url-connection-client"
>>>>>>>>>>>> )
>>>>>>>>>>>> for pkg in "${AWS_PACKAGES[@]}"; do
>>>>>>>>>>>>     DEPENDENCIES+=",$AWS_MAVEN_GROUP:$pkg:$AWS_SDK_VERSION"
>>>>>>>>>>>> done
>>>>>>>>>>>>
>>>>>>>>>>>> # start Spark SQL client shell
>>>>>>>>>>>> /spark/bin/spark-shell --packages $DEPENDENCIES \
>>>>>>>>>>>>     --conf
>>>>>>>>>>>> spark.sql.catalog.hive_test=org.apache.iceberg.spark.SparkCatalog \
>>>>>>>>>>>>     --conf
>>>>>>>>>>>> spark.sql.catalog.hive_test.warehouse=s3a://east/prefix \
>>>>>>>>>>>>     --conf spark.sql.catalog.hive_test.type=hive  \
>>>>>>>>>>>>     --conf
>>>>>>>>>>>> spark.sql.catalog.hive_test.io-impl=org.apache.iceberg.aws.s3.S3FileIO \
>>>>>>>>>>>>     --conf spark.hadoop.fs.s3a.endpoint=http://$MINIOSERVER:9000
>>>>>>>>>>>> \
>>>>>>>>>>>>     --conf spark.hadoop.fs.s3a.access.key=minio \
>>>>>>>>>>>>     --conf spark.hadoop.fs.s3a.secret.key=minio123 \
>>>>>>>>>>>>     --conf spark.hadoop.fs.s3a.path.style.access=true \
>>>>>>>>>>>>     --conf
>>>>>>>>>>>> spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem
>>>>>>>>>>>>
>>>>>>>>>>>> *Here is the spark code to create the iceberg table:*
>>>>>>>>>>>>
>>>>>>>>>>>> import org.apache.spark.sql.SparkSession
>>>>>>>>>>>> val values = List(1,2,3,4,5)
>>>>>>>>>>>>
>>>>>>>>>>>> val spark = SparkSession.builder().master("local").getOrCreate()
>>>>>>>>>>>> import spark.implicits._
>>>>>>>>>>>> val df = values.toDF()
>>>>>>>>>>>>
>>>>>>>>>>>> val core = "mytable8"
>>>>>>>>>>>> val table = s"hive_test.mydb.${core}"
>>>>>>>>>>>> val s3IcePath = s"s3a://spark-test/${core}.ice"
>>>>>>>>>>>>
>>>>>>>>>>>> df.writeTo(table)
>>>>>>>>>>>>     .tableProperty("write.format.default", "parquet")
>>>>>>>>>>>>     .tableProperty("location", s3IcePath)
>>>>>>>>>>>>     .createOrReplace()
>>>>>>>>>>>>
>>>>>>>>>>>> I got an error "The AWS Access Key Id you provided does not
>>>>>>>>>>>> exist in our records.".
>>>>>>>>>>>>
>>>>>>>>>>>> I have verified that I can login minio UI using the same
>>>>>>>>>>>> username and password that I passed to spark-shell via AWS_ACCESS_KEY_ID
>>>>>>>>>>>> and AWS_SECRET_ACCESS_KEY env variables.
>>>>>>>>>>>> https://github.com/apache/iceberg/issues/2168 is related but
>>>>>>>>>>>> does not help me. Not sure why the credential does not work for iceberg +
>>>>>>>>>>>> AWS. Any idea or an example of writing an iceberg table to S3 using hive
>>>>>>>>>>>> catalog will be highly appreciated! Thanks.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>>
>>>>>>>>>> Create your own email signature
>>>>>>>>>> <https://www.wisestamp.com/signature-in-email/?utm_source=promotion&utm_medium=signature&utm_campaign=create_your_own&srcid=5234462839406592>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>>
>>>>>>>> Create your own email signature
>>>>>>>> <https://www.wisestamp.com/signature-in-email/?utm_source=promotion&utm_medium=signature&utm_campaign=create_your_own&srcid=5234462839406592>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>> --
>>>>>>
>>>>>> Create your own email signature
>>>>>> <https://www.wisestamp.com/signature-in-email/?utm_source=promotion&utm_medium=signature&utm_campaign=create_your_own&srcid=5234462839406592>
>>>>>>
>>>>>
>>>>
>>>> --
>>>>
>>>> Create your own email signature
>>>> <https://www.wisestamp.com/signature-in-email/?utm_source=promotion&utm_medium=signature&utm_campaign=create_your_own&srcid=5234462839406592>
>>>>
>>>
>>
>> --
>> Ryan Blue
>> Tabular
>>
>
>
> --
>
> Create your own email signature
> <https://www.wisestamp.com/signature-in-email/?utm_source=promotion&utm_medium=signature&utm_campaign=create_your_own&srcid=5234462839406592>
>

Re: create iceberg on minio s3 got "The AWS Access Key Id you provided does not exist in our records."

Posted by Lian Jiang <ji...@gmail.com>.
Hi Ryan,

S3FileIO need canned ACL according to:

  /**
   * Used to configure canned access control list (ACL) for S3 client to
use during write.
   * If not set, ACL will not be set for requests.
   * <p>
   * The input must be one of {@link
software.amazon.awssdk.services.s3.model.ObjectCannedACL},
   * such as 'public-read-write'
   * For more details:
https://docs.aws.amazon.com/AmazonS3/latest/dev/acl-overview.html
   */
  public static final String S3FILEIO_ACL = "s3.acl";


Minio does not support canned ACL according to
https://docs.min.io/docs/minio-server-limits-per-tenant.html:

List of Amazon S3 Bucket API's not supported on MinIO

   - BucketACL (Use bucket policies
   <https://docs.min.io/docs/minio-client-complete-guide#policy> instead)
   - BucketCORS (CORS enabled by default on all buckets for all HTTP verbs)
   - BucketWebsite (Use caddy <https://github.com/caddyserver/caddy> or
   nginx <https://www.nginx.com/resources/wiki/>)
   - BucketAnalytics, BucketMetrics, BucketLogging (Use bucket notification
   <https://docs.min.io/docs/minio-client-complete-guide#events> APIs)
   - BucketRequestPayment

List of Amazon S3 Object API's not supported on MinIO

   - ObjectACL (Use bucket policies
   <https://docs.min.io/docs/minio-client-complete-guide#policy> instead)
   - ObjectTorrent



Hope this makes sense.

BTW, iceberg + Hive + S3A works after Hive using S3A issue has been fixed.
Thanks Jack for helping debugging.



On Tue, Aug 17, 2021 at 8:38 AM Ryan Blue <bl...@tabular.io> wrote:

> I'm not sure that I'm following why MinIO won't work with S3FileIO.
> S3FileIO assumes that the credentials are handled by a credentials provider
> outside of S3FileIO. How does MinIO handle credentials?
>
> Ryan
>
> On Mon, Aug 16, 2021 at 7:57 PM Jack Ye <ye...@gmail.com> wrote:
>
>> Talked with Lian on Slack, the user is using a hadoop 3.2.1 + hive
>> (postgres) + spark + minio docker installation. There might be some S3A
>> related dependencies missing on the Hive server side based on the stack
>> trace. Let's see if that fixes the issue.
>> -Jack
>>
>> On Mon, Aug 16, 2021 at 7:32 PM Lian Jiang <ji...@gmail.com> wrote:
>>
>>> This is my full script launching spark-shell:
>>>
>>> # add Iceberg dependency
>>> export AWS_REGION=us-east-1
>>> export AWS_ACCESS_KEY_ID=minio
>>> export AWS_SECRET_ACCESS_KEY=minio123
>>>
>>> ICEBERG_VERSION=0.11.1
>>>
>>> DEPENDENCIES="org.apache.iceberg:iceberg-spark3-runtime:$ICEBERG_VERSION,org.apache.iceberg:iceberg-hive-runtime:$ICEBERG_VERSION,org.apache.hadoop:hadoop-aws:3.2.0"
>>>
>>> MINIOSERVER=192.168.176.5
>>>
>>>
>>> # add AWS dependnecy
>>> AWS_SDK_VERSION=2.15.40
>>> AWS_MAVEN_GROUP=software.amazon.awssdk
>>> AWS_PACKAGES=(
>>>     "bundle"
>>>     "url-connection-client"
>>> )
>>> for pkg in "${AWS_PACKAGES[@]}"; do
>>>     DEPENDENCIES+=",$AWS_MAVEN_GROUP:$pkg:$AWS_SDK_VERSION"
>>> done
>>>
>>> # start Spark SQL client shell
>>> /spark/bin/spark-shell --packages $DEPENDENCIES \
>>>     --conf
>>> spark.sql.catalog.hive_test=org.apache.iceberg.spark.SparkCatalog \
>>>     --conf spark.sql.catalog.hive_test.type=hive  \
>>>     --conf
>>> spark.sql.catalog.hive_test.io-impl=org.apache.iceberg.hadoop.HadoopFileIO \
>>>     --conf spark.sql.catalog.hive_test.warehouse=s3a://east/warehouse \
>>>     --conf spark.hadoop.fs.s3a.endpoint=http://$MINIOSERVER:9000 \
>>>     --conf spark.hadoop.fs.s3a.access.key=minio \
>>>     --conf spark.hadoop.fs.s3a.secret.key=minio123 \
>>>     --conf spark.hadoop.fs.s3a.path.style.access=true \
>>>     --conf
>>> spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem
>>>
>>>
>>> Let me know if anything is missing. Thanks.
>>>
>>> On Mon, Aug 16, 2021 at 7:29 PM Jack Ye <ye...@gmail.com> wrote:
>>>
>>>> Have you included the hadoop-aws jar?
>>>> https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-aws
>>>> -Jack
>>>>
>>>> On Mon, Aug 16, 2021 at 7:09 PM Lian Jiang <ji...@gmail.com>
>>>> wrote:
>>>>
>>>>> Jack,
>>>>>
>>>>> You are right. S3FileIO will not work on minio since minio does not
>>>>> support ACL:
>>>>> https://docs.min.io/docs/minio-server-limits-per-tenant.html
>>>>>
>>>>> To use iceberg, minio + s3a, I used below script to launch spark-shell:
>>>>>
>>>>> /spark/bin/spark-shell --packages $DEPENDENCIES \
>>>>>     --conf
>>>>> spark.sql.catalog.hive_test=org.apache.iceberg.spark.SparkCatalog \
>>>>>     --conf spark.sql.catalog.hive_test.type=hive  \
>>>>> *    --conf
>>>>> spark.sql.catalog.hive_test.io-impl=org.apache.iceberg.hadoop.HadoopFileIO
>>>>> \*
>>>>>     --conf spark.sql.catalog.hive_test.warehouse=s3a://east/warehouse \
>>>>>     --conf spark.hadoop.fs.s3a.endpoint=http://$MINIOSERVER:9000 \
>>>>>     --conf spark.hadoop.fs.s3a.access.key=minio \
>>>>>     --conf spark.hadoop.fs.s3a.secret.key=minio123 \
>>>>>     --conf spark.hadoop.fs.s3a.path.style.access=true \
>>>>>     --conf
>>>>> spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem
>>>>>
>>>>>
>>>>>
>>>>> *The spark code:*
>>>>>
>>>>> import org.apache.spark.sql.SparkSession
>>>>> val values = List(1,2,3,4,5)
>>>>>
>>>>> val spark = SparkSession.builder().master("local").getOrCreate()
>>>>> import spark.implicits._
>>>>> val df = values.toDF()
>>>>>
>>>>> val core = "mytable"
>>>>> val table = s"hive_test.mydb.${core}"
>>>>> val s3IcePath = s"s3a://east/${core}.ice"
>>>>>
>>>>> df.writeTo(table)
>>>>>     .tableProperty("write.format.default", "parquet")
>>>>>     .tableProperty("location", s3IcePath)
>>>>>     .createOrReplace()
>>>>>
>>>>>
>>>>> *Still the same error:*
>>>>> java.lang.ClassNotFoundException: Class
>>>>> org.apache.hadoop.fs.s3a.S3AFileSystem not found
>>>>>
>>>>>
>>>>> What else could be wrong? Thanks for any clue.
>>>>>
>>>>>
>>>>>
>>>>> On Mon, Aug 16, 2021 at 9:35 AM Jack Ye <ye...@gmail.com> wrote:
>>>>>
>>>>>> Sorry for the late reply, I thought I replied on Friday but the email
>>>>>> did not send successfully.
>>>>>>
>>>>>> As Daniel said, you don't need to setup S3A if you are using S3FileIO.
>>>>>>
>>>>>> Th S3FileIO by default reads the default credentials chain to check
>>>>>> credential setups one by one:
>>>>>> https://docs.aws.amazon.com/sdk-for-java/latest/developer-guide/credentials.html#credentials-chain
>>>>>>
>>>>>> If you would like to use a specialized credential provider, you can
>>>>>> directly customize your S3 client:
>>>>>> https://iceberg.apache.org/aws/#aws-client-customization
>>>>>>
>>>>>> It looks like you are trying to use MinIO to mount S3A file system?
>>>>>> If you have to use MinIO then there is not a way to integrate with S3FileIO
>>>>>> right now. (maybe I am wrong on this, I don't know much about MinIO)
>>>>>>
>>>>>> To directly use S3FileIO with HiveCatalog, simply do:
>>>>>>
>>>>>> /spark/bin/spark-shell --packages $DEPENDENCIES \
>>>>>>     --conf
>>>>>> spark.sql.catalog.hive_test=org.apache.iceberg.spark.SparkCatalog \
>>>>>>     --conf spark.sql.catalog.hive_test.type=hive  \
>>>>>>     --conf
>>>>>> spark.sql.catalog.hive_test.io-impl=org.apache.iceberg.aws.s3.S3FileIO \
>>>>>>     --conf spark.sql.catalog.hive_test.warehouse=s3://bucket
>>>>>>
>>>>>> Best,
>>>>>> Jack Ye
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Sun, Aug 15, 2021 at 2:53 PM Lian Jiang <ji...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Thanks. I prefer S3FileIO as it is recommended by iceberg. Do you
>>>>>>> have a sample using hive catalog, s3FileIO, spark API (as opposed to SQL),
>>>>>>> S3 access.key and secret.key? It is hard to get all settings right for this
>>>>>>> combination without an example. Appreciate any help.
>>>>>>>
>>>>>>> On Fri, Aug 13, 2021 at 6:01 PM Daniel Weeks <
>>>>>>> daniel.c.weeks@gmail.com> wrote:
>>>>>>>
>>>>>>>> So, if I recall correctly, the hive server does need access to
>>>>>>>> check and create paths for table locations.
>>>>>>>>
>>>>>>>> There may be an option to disable this behavior, but otherwise the
>>>>>>>> fs implementation probably needs to be available to the hive metastore.
>>>>>>>>
>>>>>>>> -Dan
>>>>>>>>
>>>>>>>> On Fri, Aug 13, 2021, 4:48 PM Lian Jiang <ji...@gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Thanks Daniel.
>>>>>>>>>
>>>>>>>>> After modifying the script to,
>>>>>>>>>
>>>>>>>>> export AWS_REGION=us-east-1
>>>>>>>>> export AWS_ACCESS_KEY_ID=minio
>>>>>>>>> export AWS_SECRET_ACCESS_KEY=minio123
>>>>>>>>>
>>>>>>>>> ICEBERG_VERSION=0.11.1
>>>>>>>>>
>>>>>>>>> DEPENDENCIES="org.apache.iceberg:iceberg-spark3-runtime:$ICEBERG_VERSION,org.apache.iceberg:iceberg-hive-runtime:$ICEBERG_VERSION,org.apache.hadoop:hadoop-aws:3.2.0"
>>>>>>>>>
>>>>>>>>> MINIOSERVER=192.168.160.5
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> # add AWS dependnecy
>>>>>>>>> AWS_SDK_VERSION=2.15.40
>>>>>>>>> AWS_MAVEN_GROUP=software.amazon.awssdk
>>>>>>>>> AWS_PACKAGES=(
>>>>>>>>>     "bundle"
>>>>>>>>>     "url-connection-client"
>>>>>>>>> )
>>>>>>>>> for pkg in "${AWS_PACKAGES[@]}"; do
>>>>>>>>>     DEPENDENCIES+=",$AWS_MAVEN_GROUP:$pkg:$AWS_SDK_VERSION"
>>>>>>>>> done
>>>>>>>>>
>>>>>>>>> # start Spark SQL client shell
>>>>>>>>> /spark/bin/spark-shell --packages $DEPENDENCIES \
>>>>>>>>>     --conf
>>>>>>>>> spark.sql.catalog.hive_test=org.apache.iceberg.spark.SparkCatalog \
>>>>>>>>>     --conf spark.sql.catalog.hive_test.type=hive  \
>>>>>>>>>     --conf spark.hadoop.fs.s3a.endpoint=http://$MINIOSERVER:9000 \
>>>>>>>>>     --conf spark.hadoop.fs.s3a.access.key=minio \
>>>>>>>>>     --conf spark.hadoop.fs.s3a.secret.key=minio123 \
>>>>>>>>>     --conf spark.hadoop.fs.s3a.path.style.access=true \
>>>>>>>>>     --conf
>>>>>>>>> spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem
>>>>>>>>>
>>>>>>>>> I got: MetaException: java.lang.RuntimeException:
>>>>>>>>> java.lang.ClassNotFoundException: Class
>>>>>>>>> org.apache.hadoop.fs.s3a.S3AFileSystem not found. My hive server is not
>>>>>>>>> using s3 and should not cause this error. Any ideas? Thanks.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> I got "ClassNotFoundException: Class
>>>>>>>>> org.apache.hadoop.fs.s3a.S3AFileSystem not found". Any idea what dependency
>>>>>>>>> could I miss?
>>>>>>>>>
>>>>>>>>> On Fri, Aug 13, 2021 at 4:03 PM Daniel Weeks <
>>>>>>>>> daniel.c.weeks@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> Hey Lian,
>>>>>>>>>>
>>>>>>>>>> At a cursory glance, it appears that you might be mixing two
>>>>>>>>>> different FileIO implementations, which may be why you are not getting the
>>>>>>>>>> expected result.
>>>>>>>>>>
>>>>>>>>>> When you set: --conf
>>>>>>>>>> spark.sql.catalog.hive_test.io-impl=org.apache.iceberg.aws.s3.S3FileIO you're
>>>>>>>>>> actually switching over to the native S3 implementation within Iceberg (as
>>>>>>>>>> opposed to S3AFileSystem via HadoopFileIO).  However, all of the following
>>>>>>>>>> settings to setup access are then set for the S3AFileSystem (which would
>>>>>>>>>> not be used with S3FileIO).
>>>>>>>>>>
>>>>>>>>>> You might try just removing that line since it should use the
>>>>>>>>>> HadoopFileIO at that point and may work.
>>>>>>>>>>
>>>>>>>>>> Hope that's helpful,
>>>>>>>>>> -Dan
>>>>>>>>>>
>>>>>>>>>> On Fri, Aug 13, 2021 at 3:50 PM Lian Jiang <ji...@gmail.com>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hi,
>>>>>>>>>>>
>>>>>>>>>>> I try to create an iceberg table on minio s3 and hive.
>>>>>>>>>>>
>>>>>>>>>>> *This is how I launch spark-shell:*
>>>>>>>>>>>
>>>>>>>>>>> # add Iceberg dependency
>>>>>>>>>>> export AWS_REGION=us-east-1
>>>>>>>>>>> export AWS_ACCESS_KEY_ID=minio
>>>>>>>>>>> export AWS_SECRET_ACCESS_KEY=minio123
>>>>>>>>>>>
>>>>>>>>>>> ICEBERG_VERSION=0.11.1
>>>>>>>>>>>
>>>>>>>>>>> DEPENDENCIES="org.apache.iceberg:iceberg-spark3-runtime:$ICEBERG_VERSION"
>>>>>>>>>>>
>>>>>>>>>>> MINIOSERVER=192.168.160.5
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> # add AWS dependnecy
>>>>>>>>>>> AWS_SDK_VERSION=2.15.40
>>>>>>>>>>> AWS_MAVEN_GROUP=software.amazon.awssdk
>>>>>>>>>>> AWS_PACKAGES=(
>>>>>>>>>>>     "bundle"
>>>>>>>>>>>     "url-connection-client"
>>>>>>>>>>> )
>>>>>>>>>>> for pkg in "${AWS_PACKAGES[@]}"; do
>>>>>>>>>>>     DEPENDENCIES+=",$AWS_MAVEN_GROUP:$pkg:$AWS_SDK_VERSION"
>>>>>>>>>>> done
>>>>>>>>>>>
>>>>>>>>>>> # start Spark SQL client shell
>>>>>>>>>>> /spark/bin/spark-shell --packages $DEPENDENCIES \
>>>>>>>>>>>     --conf
>>>>>>>>>>> spark.sql.catalog.hive_test=org.apache.iceberg.spark.SparkCatalog \
>>>>>>>>>>>     --conf
>>>>>>>>>>> spark.sql.catalog.hive_test.warehouse=s3a://east/prefix \
>>>>>>>>>>>     --conf spark.sql.catalog.hive_test.type=hive  \
>>>>>>>>>>>     --conf
>>>>>>>>>>> spark.sql.catalog.hive_test.io-impl=org.apache.iceberg.aws.s3.S3FileIO \
>>>>>>>>>>>     --conf spark.hadoop.fs.s3a.endpoint=http://$MINIOSERVER:9000
>>>>>>>>>>> \
>>>>>>>>>>>     --conf spark.hadoop.fs.s3a.access.key=minio \
>>>>>>>>>>>     --conf spark.hadoop.fs.s3a.secret.key=minio123 \
>>>>>>>>>>>     --conf spark.hadoop.fs.s3a.path.style.access=true \
>>>>>>>>>>>     --conf
>>>>>>>>>>> spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem
>>>>>>>>>>>
>>>>>>>>>>> *Here is the spark code to create the iceberg table:*
>>>>>>>>>>>
>>>>>>>>>>> import org.apache.spark.sql.SparkSession
>>>>>>>>>>> val values = List(1,2,3,4,5)
>>>>>>>>>>>
>>>>>>>>>>> val spark = SparkSession.builder().master("local").getOrCreate()
>>>>>>>>>>> import spark.implicits._
>>>>>>>>>>> val df = values.toDF()
>>>>>>>>>>>
>>>>>>>>>>> val core = "mytable8"
>>>>>>>>>>> val table = s"hive_test.mydb.${core}"
>>>>>>>>>>> val s3IcePath = s"s3a://spark-test/${core}.ice"
>>>>>>>>>>>
>>>>>>>>>>> df.writeTo(table)
>>>>>>>>>>>     .tableProperty("write.format.default", "parquet")
>>>>>>>>>>>     .tableProperty("location", s3IcePath)
>>>>>>>>>>>     .createOrReplace()
>>>>>>>>>>>
>>>>>>>>>>> I got an error "The AWS Access Key Id you provided does not
>>>>>>>>>>> exist in our records.".
>>>>>>>>>>>
>>>>>>>>>>> I have verified that I can login minio UI using the same
>>>>>>>>>>> username and password that I passed to spark-shell via AWS_ACCESS_KEY_ID
>>>>>>>>>>> and AWS_SECRET_ACCESS_KEY env variables.
>>>>>>>>>>> https://github.com/apache/iceberg/issues/2168 is related but
>>>>>>>>>>> does not help me. Not sure why the credential does not work for iceberg +
>>>>>>>>>>> AWS. Any idea or an example of writing an iceberg table to S3 using hive
>>>>>>>>>>> catalog will be highly appreciated! Thanks.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>>
>>>>>>>>> Create your own email signature
>>>>>>>>> <https://www.wisestamp.com/signature-in-email/?utm_source=promotion&utm_medium=signature&utm_campaign=create_your_own&srcid=5234462839406592>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>>
>>>>>>> Create your own email signature
>>>>>>> <https://www.wisestamp.com/signature-in-email/?utm_source=promotion&utm_medium=signature&utm_campaign=create_your_own&srcid=5234462839406592>
>>>>>>>
>>>>>>
>>>>>
>>>>> --
>>>>>
>>>>> Create your own email signature
>>>>> <https://www.wisestamp.com/signature-in-email/?utm_source=promotion&utm_medium=signature&utm_campaign=create_your_own&srcid=5234462839406592>
>>>>>
>>>>
>>>
>>> --
>>>
>>> Create your own email signature
>>> <https://www.wisestamp.com/signature-in-email/?utm_source=promotion&utm_medium=signature&utm_campaign=create_your_own&srcid=5234462839406592>
>>>
>>
>
> --
> Ryan Blue
> Tabular
>


-- 

Create your own email signature
<https://www.wisestamp.com/signature-in-email/?utm_source=promotion&utm_medium=signature&utm_campaign=create_your_own&srcid=5234462839406592>

Re: create iceberg on minio s3 got "The AWS Access Key Id you provided does not exist in our records."

Posted by Ryan Blue <bl...@tabular.io>.
I'm not sure that I'm following why MinIO won't work with S3FileIO.
S3FileIO assumes that the credentials are handled by a credentials provider
outside of S3FileIO. How does MinIO handle credentials?

Ryan

On Mon, Aug 16, 2021 at 7:57 PM Jack Ye <ye...@gmail.com> wrote:

> Talked with Lian on Slack, the user is using a hadoop 3.2.1 + hive
> (postgres) + spark + minio docker installation. There might be some S3A
> related dependencies missing on the Hive server side based on the stack
> trace. Let's see if that fixes the issue.
> -Jack
>
> On Mon, Aug 16, 2021 at 7:32 PM Lian Jiang <ji...@gmail.com> wrote:
>
>> This is my full script launching spark-shell:
>>
>> # add Iceberg dependency
>> export AWS_REGION=us-east-1
>> export AWS_ACCESS_KEY_ID=minio
>> export AWS_SECRET_ACCESS_KEY=minio123
>>
>> ICEBERG_VERSION=0.11.1
>>
>> DEPENDENCIES="org.apache.iceberg:iceberg-spark3-runtime:$ICEBERG_VERSION,org.apache.iceberg:iceberg-hive-runtime:$ICEBERG_VERSION,org.apache.hadoop:hadoop-aws:3.2.0"
>>
>> MINIOSERVER=192.168.176.5
>>
>>
>> # add AWS dependnecy
>> AWS_SDK_VERSION=2.15.40
>> AWS_MAVEN_GROUP=software.amazon.awssdk
>> AWS_PACKAGES=(
>>     "bundle"
>>     "url-connection-client"
>> )
>> for pkg in "${AWS_PACKAGES[@]}"; do
>>     DEPENDENCIES+=",$AWS_MAVEN_GROUP:$pkg:$AWS_SDK_VERSION"
>> done
>>
>> # start Spark SQL client shell
>> /spark/bin/spark-shell --packages $DEPENDENCIES \
>>     --conf
>> spark.sql.catalog.hive_test=org.apache.iceberg.spark.SparkCatalog \
>>     --conf spark.sql.catalog.hive_test.type=hive  \
>>     --conf
>> spark.sql.catalog.hive_test.io-impl=org.apache.iceberg.hadoop.HadoopFileIO \
>>     --conf spark.sql.catalog.hive_test.warehouse=s3a://east/warehouse \
>>     --conf spark.hadoop.fs.s3a.endpoint=http://$MINIOSERVER:9000 \
>>     --conf spark.hadoop.fs.s3a.access.key=minio \
>>     --conf spark.hadoop.fs.s3a.secret.key=minio123 \
>>     --conf spark.hadoop.fs.s3a.path.style.access=true \
>>     --conf spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem
>>
>>
>> Let me know if anything is missing. Thanks.
>>
>> On Mon, Aug 16, 2021 at 7:29 PM Jack Ye <ye...@gmail.com> wrote:
>>
>>> Have you included the hadoop-aws jar?
>>> https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-aws
>>> -Jack
>>>
>>> On Mon, Aug 16, 2021 at 7:09 PM Lian Jiang <ji...@gmail.com>
>>> wrote:
>>>
>>>> Jack,
>>>>
>>>> You are right. S3FileIO will not work on minio since minio does not
>>>> support ACL:
>>>> https://docs.min.io/docs/minio-server-limits-per-tenant.html
>>>>
>>>> To use iceberg, minio + s3a, I used below script to launch spark-shell:
>>>>
>>>> /spark/bin/spark-shell --packages $DEPENDENCIES \
>>>>     --conf
>>>> spark.sql.catalog.hive_test=org.apache.iceberg.spark.SparkCatalog \
>>>>     --conf spark.sql.catalog.hive_test.type=hive  \
>>>> *    --conf
>>>> spark.sql.catalog.hive_test.io-impl=org.apache.iceberg.hadoop.HadoopFileIO
>>>> \*
>>>>     --conf spark.sql.catalog.hive_test.warehouse=s3a://east/warehouse \
>>>>     --conf spark.hadoop.fs.s3a.endpoint=http://$MINIOSERVER:9000 \
>>>>     --conf spark.hadoop.fs.s3a.access.key=minio \
>>>>     --conf spark.hadoop.fs.s3a.secret.key=minio123 \
>>>>     --conf spark.hadoop.fs.s3a.path.style.access=true \
>>>>     --conf
>>>> spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem
>>>>
>>>>
>>>>
>>>> *The spark code:*
>>>>
>>>> import org.apache.spark.sql.SparkSession
>>>> val values = List(1,2,3,4,5)
>>>>
>>>> val spark = SparkSession.builder().master("local").getOrCreate()
>>>> import spark.implicits._
>>>> val df = values.toDF()
>>>>
>>>> val core = "mytable"
>>>> val table = s"hive_test.mydb.${core}"
>>>> val s3IcePath = s"s3a://east/${core}.ice"
>>>>
>>>> df.writeTo(table)
>>>>     .tableProperty("write.format.default", "parquet")
>>>>     .tableProperty("location", s3IcePath)
>>>>     .createOrReplace()
>>>>
>>>>
>>>> *Still the same error:*
>>>> java.lang.ClassNotFoundException: Class
>>>> org.apache.hadoop.fs.s3a.S3AFileSystem not found
>>>>
>>>>
>>>> What else could be wrong? Thanks for any clue.
>>>>
>>>>
>>>>
>>>> On Mon, Aug 16, 2021 at 9:35 AM Jack Ye <ye...@gmail.com> wrote:
>>>>
>>>>> Sorry for the late reply, I thought I replied on Friday but the email
>>>>> did not send successfully.
>>>>>
>>>>> As Daniel said, you don't need to setup S3A if you are using S3FileIO.
>>>>>
>>>>> Th S3FileIO by default reads the default credentials chain to check
>>>>> credential setups one by one:
>>>>> https://docs.aws.amazon.com/sdk-for-java/latest/developer-guide/credentials.html#credentials-chain
>>>>>
>>>>> If you would like to use a specialized credential provider, you can
>>>>> directly customize your S3 client:
>>>>> https://iceberg.apache.org/aws/#aws-client-customization
>>>>>
>>>>> It looks like you are trying to use MinIO to mount S3A file system? If
>>>>> you have to use MinIO then there is not a way to integrate with S3FileIO
>>>>> right now. (maybe I am wrong on this, I don't know much about MinIO)
>>>>>
>>>>> To directly use S3FileIO with HiveCatalog, simply do:
>>>>>
>>>>> /spark/bin/spark-shell --packages $DEPENDENCIES \
>>>>>     --conf
>>>>> spark.sql.catalog.hive_test=org.apache.iceberg.spark.SparkCatalog \
>>>>>     --conf spark.sql.catalog.hive_test.type=hive  \
>>>>>     --conf
>>>>> spark.sql.catalog.hive_test.io-impl=org.apache.iceberg.aws.s3.S3FileIO \
>>>>>     --conf spark.sql.catalog.hive_test.warehouse=s3://bucket
>>>>>
>>>>> Best,
>>>>> Jack Ye
>>>>>
>>>>>
>>>>>
>>>>> On Sun, Aug 15, 2021 at 2:53 PM Lian Jiang <ji...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Thanks. I prefer S3FileIO as it is recommended by iceberg. Do you
>>>>>> have a sample using hive catalog, s3FileIO, spark API (as opposed to SQL),
>>>>>> S3 access.key and secret.key? It is hard to get all settings right for this
>>>>>> combination without an example. Appreciate any help.
>>>>>>
>>>>>> On Fri, Aug 13, 2021 at 6:01 PM Daniel Weeks <
>>>>>> daniel.c.weeks@gmail.com> wrote:
>>>>>>
>>>>>>> So, if I recall correctly, the hive server does need access to check
>>>>>>> and create paths for table locations.
>>>>>>>
>>>>>>> There may be an option to disable this behavior, but otherwise the
>>>>>>> fs implementation probably needs to be available to the hive metastore.
>>>>>>>
>>>>>>> -Dan
>>>>>>>
>>>>>>> On Fri, Aug 13, 2021, 4:48 PM Lian Jiang <ji...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Thanks Daniel.
>>>>>>>>
>>>>>>>> After modifying the script to,
>>>>>>>>
>>>>>>>> export AWS_REGION=us-east-1
>>>>>>>> export AWS_ACCESS_KEY_ID=minio
>>>>>>>> export AWS_SECRET_ACCESS_KEY=minio123
>>>>>>>>
>>>>>>>> ICEBERG_VERSION=0.11.1
>>>>>>>>
>>>>>>>> DEPENDENCIES="org.apache.iceberg:iceberg-spark3-runtime:$ICEBERG_VERSION,org.apache.iceberg:iceberg-hive-runtime:$ICEBERG_VERSION,org.apache.hadoop:hadoop-aws:3.2.0"
>>>>>>>>
>>>>>>>> MINIOSERVER=192.168.160.5
>>>>>>>>
>>>>>>>>
>>>>>>>> # add AWS dependnecy
>>>>>>>> AWS_SDK_VERSION=2.15.40
>>>>>>>> AWS_MAVEN_GROUP=software.amazon.awssdk
>>>>>>>> AWS_PACKAGES=(
>>>>>>>>     "bundle"
>>>>>>>>     "url-connection-client"
>>>>>>>> )
>>>>>>>> for pkg in "${AWS_PACKAGES[@]}"; do
>>>>>>>>     DEPENDENCIES+=",$AWS_MAVEN_GROUP:$pkg:$AWS_SDK_VERSION"
>>>>>>>> done
>>>>>>>>
>>>>>>>> # start Spark SQL client shell
>>>>>>>> /spark/bin/spark-shell --packages $DEPENDENCIES \
>>>>>>>>     --conf
>>>>>>>> spark.sql.catalog.hive_test=org.apache.iceberg.spark.SparkCatalog \
>>>>>>>>     --conf spark.sql.catalog.hive_test.type=hive  \
>>>>>>>>     --conf spark.hadoop.fs.s3a.endpoint=http://$MINIOSERVER:9000 \
>>>>>>>>     --conf spark.hadoop.fs.s3a.access.key=minio \
>>>>>>>>     --conf spark.hadoop.fs.s3a.secret.key=minio123 \
>>>>>>>>     --conf spark.hadoop.fs.s3a.path.style.access=true \
>>>>>>>>     --conf
>>>>>>>> spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem
>>>>>>>>
>>>>>>>> I got: MetaException: java.lang.RuntimeException:
>>>>>>>> java.lang.ClassNotFoundException: Class
>>>>>>>> org.apache.hadoop.fs.s3a.S3AFileSystem not found. My hive server is not
>>>>>>>> using s3 and should not cause this error. Any ideas? Thanks.
>>>>>>>>
>>>>>>>>
>>>>>>>> I got "ClassNotFoundException: Class
>>>>>>>> org.apache.hadoop.fs.s3a.S3AFileSystem not found". Any idea what dependency
>>>>>>>> could I miss?
>>>>>>>>
>>>>>>>> On Fri, Aug 13, 2021 at 4:03 PM Daniel Weeks <
>>>>>>>> daniel.c.weeks@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> Hey Lian,
>>>>>>>>>
>>>>>>>>> At a cursory glance, it appears that you might be mixing two
>>>>>>>>> different FileIO implementations, which may be why you are not getting the
>>>>>>>>> expected result.
>>>>>>>>>
>>>>>>>>> When you set: --conf
>>>>>>>>> spark.sql.catalog.hive_test.io-impl=org.apache.iceberg.aws.s3.S3FileIO you're
>>>>>>>>> actually switching over to the native S3 implementation within Iceberg (as
>>>>>>>>> opposed to S3AFileSystem via HadoopFileIO).  However, all of the following
>>>>>>>>> settings to setup access are then set for the S3AFileSystem (which would
>>>>>>>>> not be used with S3FileIO).
>>>>>>>>>
>>>>>>>>> You might try just removing that line since it should use the
>>>>>>>>> HadoopFileIO at that point and may work.
>>>>>>>>>
>>>>>>>>> Hope that's helpful,
>>>>>>>>> -Dan
>>>>>>>>>
>>>>>>>>> On Fri, Aug 13, 2021 at 3:50 PM Lian Jiang <ji...@gmail.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Hi,
>>>>>>>>>>
>>>>>>>>>> I try to create an iceberg table on minio s3 and hive.
>>>>>>>>>>
>>>>>>>>>> *This is how I launch spark-shell:*
>>>>>>>>>>
>>>>>>>>>> # add Iceberg dependency
>>>>>>>>>> export AWS_REGION=us-east-1
>>>>>>>>>> export AWS_ACCESS_KEY_ID=minio
>>>>>>>>>> export AWS_SECRET_ACCESS_KEY=minio123
>>>>>>>>>>
>>>>>>>>>> ICEBERG_VERSION=0.11.1
>>>>>>>>>>
>>>>>>>>>> DEPENDENCIES="org.apache.iceberg:iceberg-spark3-runtime:$ICEBERG_VERSION"
>>>>>>>>>>
>>>>>>>>>> MINIOSERVER=192.168.160.5
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> # add AWS dependnecy
>>>>>>>>>> AWS_SDK_VERSION=2.15.40
>>>>>>>>>> AWS_MAVEN_GROUP=software.amazon.awssdk
>>>>>>>>>> AWS_PACKAGES=(
>>>>>>>>>>     "bundle"
>>>>>>>>>>     "url-connection-client"
>>>>>>>>>> )
>>>>>>>>>> for pkg in "${AWS_PACKAGES[@]}"; do
>>>>>>>>>>     DEPENDENCIES+=",$AWS_MAVEN_GROUP:$pkg:$AWS_SDK_VERSION"
>>>>>>>>>> done
>>>>>>>>>>
>>>>>>>>>> # start Spark SQL client shell
>>>>>>>>>> /spark/bin/spark-shell --packages $DEPENDENCIES \
>>>>>>>>>>     --conf
>>>>>>>>>> spark.sql.catalog.hive_test=org.apache.iceberg.spark.SparkCatalog \
>>>>>>>>>>     --conf
>>>>>>>>>> spark.sql.catalog.hive_test.warehouse=s3a://east/prefix \
>>>>>>>>>>     --conf spark.sql.catalog.hive_test.type=hive  \
>>>>>>>>>>     --conf
>>>>>>>>>> spark.sql.catalog.hive_test.io-impl=org.apache.iceberg.aws.s3.S3FileIO \
>>>>>>>>>>     --conf spark.hadoop.fs.s3a.endpoint=http://$MINIOSERVER:9000
>>>>>>>>>> \
>>>>>>>>>>     --conf spark.hadoop.fs.s3a.access.key=minio \
>>>>>>>>>>     --conf spark.hadoop.fs.s3a.secret.key=minio123 \
>>>>>>>>>>     --conf spark.hadoop.fs.s3a.path.style.access=true \
>>>>>>>>>>     --conf
>>>>>>>>>> spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem
>>>>>>>>>>
>>>>>>>>>> *Here is the spark code to create the iceberg table:*
>>>>>>>>>>
>>>>>>>>>> import org.apache.spark.sql.SparkSession
>>>>>>>>>> val values = List(1,2,3,4,5)
>>>>>>>>>>
>>>>>>>>>> val spark = SparkSession.builder().master("local").getOrCreate()
>>>>>>>>>> import spark.implicits._
>>>>>>>>>> val df = values.toDF()
>>>>>>>>>>
>>>>>>>>>> val core = "mytable8"
>>>>>>>>>> val table = s"hive_test.mydb.${core}"
>>>>>>>>>> val s3IcePath = s"s3a://spark-test/${core}.ice"
>>>>>>>>>>
>>>>>>>>>> df.writeTo(table)
>>>>>>>>>>     .tableProperty("write.format.default", "parquet")
>>>>>>>>>>     .tableProperty("location", s3IcePath)
>>>>>>>>>>     .createOrReplace()
>>>>>>>>>>
>>>>>>>>>> I got an error "The AWS Access Key Id you provided does not exist
>>>>>>>>>> in our records.".
>>>>>>>>>>
>>>>>>>>>> I have verified that I can login minio UI using the same username
>>>>>>>>>> and password that I passed to spark-shell via AWS_ACCESS_KEY_ID and
>>>>>>>>>> AWS_SECRET_ACCESS_KEY env variables.
>>>>>>>>>> https://github.com/apache/iceberg/issues/2168 is related but
>>>>>>>>>> does not help me. Not sure why the credential does not work for iceberg +
>>>>>>>>>> AWS. Any idea or an example of writing an iceberg table to S3 using hive
>>>>>>>>>> catalog will be highly appreciated! Thanks.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>>
>>>>>>>> Create your own email signature
>>>>>>>> <https://www.wisestamp.com/signature-in-email/?utm_source=promotion&utm_medium=signature&utm_campaign=create_your_own&srcid=5234462839406592>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>> --
>>>>>>
>>>>>> Create your own email signature
>>>>>> <https://www.wisestamp.com/signature-in-email/?utm_source=promotion&utm_medium=signature&utm_campaign=create_your_own&srcid=5234462839406592>
>>>>>>
>>>>>
>>>>
>>>> --
>>>>
>>>> Create your own email signature
>>>> <https://www.wisestamp.com/signature-in-email/?utm_source=promotion&utm_medium=signature&utm_campaign=create_your_own&srcid=5234462839406592>
>>>>
>>>
>>
>> --
>>
>> Create your own email signature
>> <https://www.wisestamp.com/signature-in-email/?utm_source=promotion&utm_medium=signature&utm_campaign=create_your_own&srcid=5234462839406592>
>>
>

-- 
Ryan Blue
Tabular

Re: create iceberg on minio s3 got "The AWS Access Key Id you provided does not exist in our records."

Posted by Jack Ye <ye...@gmail.com>.
Talked with Lian on Slack, the user is using a hadoop 3.2.1 + hive
(postgres) + spark + minio docker installation. There might be some S3A
related dependencies missing on the Hive server side based on the stack
trace. Let's see if that fixes the issue.
-Jack

On Mon, Aug 16, 2021 at 7:32 PM Lian Jiang <ji...@gmail.com> wrote:

> This is my full script launching spark-shell:
>
> # add Iceberg dependency
> export AWS_REGION=us-east-1
> export AWS_ACCESS_KEY_ID=minio
> export AWS_SECRET_ACCESS_KEY=minio123
>
> ICEBERG_VERSION=0.11.1
>
> DEPENDENCIES="org.apache.iceberg:iceberg-spark3-runtime:$ICEBERG_VERSION,org.apache.iceberg:iceberg-hive-runtime:$ICEBERG_VERSION,org.apache.hadoop:hadoop-aws:3.2.0"
>
> MINIOSERVER=192.168.176.5
>
>
> # add AWS dependnecy
> AWS_SDK_VERSION=2.15.40
> AWS_MAVEN_GROUP=software.amazon.awssdk
> AWS_PACKAGES=(
>     "bundle"
>     "url-connection-client"
> )
> for pkg in "${AWS_PACKAGES[@]}"; do
>     DEPENDENCIES+=",$AWS_MAVEN_GROUP:$pkg:$AWS_SDK_VERSION"
> done
>
> # start Spark SQL client shell
> /spark/bin/spark-shell --packages $DEPENDENCIES \
>     --conf
> spark.sql.catalog.hive_test=org.apache.iceberg.spark.SparkCatalog \
>     --conf spark.sql.catalog.hive_test.type=hive  \
>     --conf
> spark.sql.catalog.hive_test.io-impl=org.apache.iceberg.hadoop.HadoopFileIO \
>     --conf spark.sql.catalog.hive_test.warehouse=s3a://east/warehouse \
>     --conf spark.hadoop.fs.s3a.endpoint=http://$MINIOSERVER:9000 \
>     --conf spark.hadoop.fs.s3a.access.key=minio \
>     --conf spark.hadoop.fs.s3a.secret.key=minio123 \
>     --conf spark.hadoop.fs.s3a.path.style.access=true \
>     --conf spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem
>
>
> Let me know if anything is missing. Thanks.
>
> On Mon, Aug 16, 2021 at 7:29 PM Jack Ye <ye...@gmail.com> wrote:
>
>> Have you included the hadoop-aws jar?
>> https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-aws
>> -Jack
>>
>> On Mon, Aug 16, 2021 at 7:09 PM Lian Jiang <ji...@gmail.com> wrote:
>>
>>> Jack,
>>>
>>> You are right. S3FileIO will not work on minio since minio does not
>>> support ACL:
>>> https://docs.min.io/docs/minio-server-limits-per-tenant.html
>>>
>>> To use iceberg, minio + s3a, I used below script to launch spark-shell:
>>>
>>> /spark/bin/spark-shell --packages $DEPENDENCIES \
>>>     --conf
>>> spark.sql.catalog.hive_test=org.apache.iceberg.spark.SparkCatalog \
>>>     --conf spark.sql.catalog.hive_test.type=hive  \
>>> *    --conf
>>> spark.sql.catalog.hive_test.io-impl=org.apache.iceberg.hadoop.HadoopFileIO
>>> \*
>>>     --conf spark.sql.catalog.hive_test.warehouse=s3a://east/warehouse \
>>>     --conf spark.hadoop.fs.s3a.endpoint=http://$MINIOSERVER:9000 \
>>>     --conf spark.hadoop.fs.s3a.access.key=minio \
>>>     --conf spark.hadoop.fs.s3a.secret.key=minio123 \
>>>     --conf spark.hadoop.fs.s3a.path.style.access=true \
>>>     --conf
>>> spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem
>>>
>>>
>>>
>>> *The spark code:*
>>>
>>> import org.apache.spark.sql.SparkSession
>>> val values = List(1,2,3,4,5)
>>>
>>> val spark = SparkSession.builder().master("local").getOrCreate()
>>> import spark.implicits._
>>> val df = values.toDF()
>>>
>>> val core = "mytable"
>>> val table = s"hive_test.mydb.${core}"
>>> val s3IcePath = s"s3a://east/${core}.ice"
>>>
>>> df.writeTo(table)
>>>     .tableProperty("write.format.default", "parquet")
>>>     .tableProperty("location", s3IcePath)
>>>     .createOrReplace()
>>>
>>>
>>> *Still the same error:*
>>> java.lang.ClassNotFoundException: Class
>>> org.apache.hadoop.fs.s3a.S3AFileSystem not found
>>>
>>>
>>> What else could be wrong? Thanks for any clue.
>>>
>>>
>>>
>>> On Mon, Aug 16, 2021 at 9:35 AM Jack Ye <ye...@gmail.com> wrote:
>>>
>>>> Sorry for the late reply, I thought I replied on Friday but the email
>>>> did not send successfully.
>>>>
>>>> As Daniel said, you don't need to setup S3A if you are using S3FileIO.
>>>>
>>>> Th S3FileIO by default reads the default credentials chain to check
>>>> credential setups one by one:
>>>> https://docs.aws.amazon.com/sdk-for-java/latest/developer-guide/credentials.html#credentials-chain
>>>>
>>>> If you would like to use a specialized credential provider, you can
>>>> directly customize your S3 client:
>>>> https://iceberg.apache.org/aws/#aws-client-customization
>>>>
>>>> It looks like you are trying to use MinIO to mount S3A file system? If
>>>> you have to use MinIO then there is not a way to integrate with S3FileIO
>>>> right now. (maybe I am wrong on this, I don't know much about MinIO)
>>>>
>>>> To directly use S3FileIO with HiveCatalog, simply do:
>>>>
>>>> /spark/bin/spark-shell --packages $DEPENDENCIES \
>>>>     --conf
>>>> spark.sql.catalog.hive_test=org.apache.iceberg.spark.SparkCatalog \
>>>>     --conf spark.sql.catalog.hive_test.type=hive  \
>>>>     --conf
>>>> spark.sql.catalog.hive_test.io-impl=org.apache.iceberg.aws.s3.S3FileIO \
>>>>     --conf spark.sql.catalog.hive_test.warehouse=s3://bucket
>>>>
>>>> Best,
>>>> Jack Ye
>>>>
>>>>
>>>>
>>>> On Sun, Aug 15, 2021 at 2:53 PM Lian Jiang <ji...@gmail.com>
>>>> wrote:
>>>>
>>>>> Thanks. I prefer S3FileIO as it is recommended by iceberg. Do you have
>>>>> a sample using hive catalog, s3FileIO, spark API (as opposed to SQL), S3
>>>>> access.key and secret.key? It is hard to get all settings right for this
>>>>> combination without an example. Appreciate any help.
>>>>>
>>>>> On Fri, Aug 13, 2021 at 6:01 PM Daniel Weeks <da...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> So, if I recall correctly, the hive server does need access to check
>>>>>> and create paths for table locations.
>>>>>>
>>>>>> There may be an option to disable this behavior, but otherwise the fs
>>>>>> implementation probably needs to be available to the hive metastore.
>>>>>>
>>>>>> -Dan
>>>>>>
>>>>>> On Fri, Aug 13, 2021, 4:48 PM Lian Jiang <ji...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Thanks Daniel.
>>>>>>>
>>>>>>> After modifying the script to,
>>>>>>>
>>>>>>> export AWS_REGION=us-east-1
>>>>>>> export AWS_ACCESS_KEY_ID=minio
>>>>>>> export AWS_SECRET_ACCESS_KEY=minio123
>>>>>>>
>>>>>>> ICEBERG_VERSION=0.11.1
>>>>>>>
>>>>>>> DEPENDENCIES="org.apache.iceberg:iceberg-spark3-runtime:$ICEBERG_VERSION,org.apache.iceberg:iceberg-hive-runtime:$ICEBERG_VERSION,org.apache.hadoop:hadoop-aws:3.2.0"
>>>>>>>
>>>>>>> MINIOSERVER=192.168.160.5
>>>>>>>
>>>>>>>
>>>>>>> # add AWS dependnecy
>>>>>>> AWS_SDK_VERSION=2.15.40
>>>>>>> AWS_MAVEN_GROUP=software.amazon.awssdk
>>>>>>> AWS_PACKAGES=(
>>>>>>>     "bundle"
>>>>>>>     "url-connection-client"
>>>>>>> )
>>>>>>> for pkg in "${AWS_PACKAGES[@]}"; do
>>>>>>>     DEPENDENCIES+=",$AWS_MAVEN_GROUP:$pkg:$AWS_SDK_VERSION"
>>>>>>> done
>>>>>>>
>>>>>>> # start Spark SQL client shell
>>>>>>> /spark/bin/spark-shell --packages $DEPENDENCIES \
>>>>>>>     --conf
>>>>>>> spark.sql.catalog.hive_test=org.apache.iceberg.spark.SparkCatalog \
>>>>>>>     --conf spark.sql.catalog.hive_test.type=hive  \
>>>>>>>     --conf spark.hadoop.fs.s3a.endpoint=http://$MINIOSERVER:9000 \
>>>>>>>     --conf spark.hadoop.fs.s3a.access.key=minio \
>>>>>>>     --conf spark.hadoop.fs.s3a.secret.key=minio123 \
>>>>>>>     --conf spark.hadoop.fs.s3a.path.style.access=true \
>>>>>>>     --conf
>>>>>>> spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem
>>>>>>>
>>>>>>> I got: MetaException: java.lang.RuntimeException:
>>>>>>> java.lang.ClassNotFoundException: Class
>>>>>>> org.apache.hadoop.fs.s3a.S3AFileSystem not found. My hive server is not
>>>>>>> using s3 and should not cause this error. Any ideas? Thanks.
>>>>>>>
>>>>>>>
>>>>>>> I got "ClassNotFoundException: Class
>>>>>>> org.apache.hadoop.fs.s3a.S3AFileSystem not found". Any idea what dependency
>>>>>>> could I miss?
>>>>>>>
>>>>>>> On Fri, Aug 13, 2021 at 4:03 PM Daniel Weeks <
>>>>>>> daniel.c.weeks@gmail.com> wrote:
>>>>>>>
>>>>>>>> Hey Lian,
>>>>>>>>
>>>>>>>> At a cursory glance, it appears that you might be mixing two
>>>>>>>> different FileIO implementations, which may be why you are not getting the
>>>>>>>> expected result.
>>>>>>>>
>>>>>>>> When you set: --conf
>>>>>>>> spark.sql.catalog.hive_test.io-impl=org.apache.iceberg.aws.s3.S3FileIO you're
>>>>>>>> actually switching over to the native S3 implementation within Iceberg (as
>>>>>>>> opposed to S3AFileSystem via HadoopFileIO).  However, all of the following
>>>>>>>> settings to setup access are then set for the S3AFileSystem (which would
>>>>>>>> not be used with S3FileIO).
>>>>>>>>
>>>>>>>> You might try just removing that line since it should use the
>>>>>>>> HadoopFileIO at that point and may work.
>>>>>>>>
>>>>>>>> Hope that's helpful,
>>>>>>>> -Dan
>>>>>>>>
>>>>>>>> On Fri, Aug 13, 2021 at 3:50 PM Lian Jiang <ji...@gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Hi,
>>>>>>>>>
>>>>>>>>> I try to create an iceberg table on minio s3 and hive.
>>>>>>>>>
>>>>>>>>> *This is how I launch spark-shell:*
>>>>>>>>>
>>>>>>>>> # add Iceberg dependency
>>>>>>>>> export AWS_REGION=us-east-1
>>>>>>>>> export AWS_ACCESS_KEY_ID=minio
>>>>>>>>> export AWS_SECRET_ACCESS_KEY=minio123
>>>>>>>>>
>>>>>>>>> ICEBERG_VERSION=0.11.1
>>>>>>>>>
>>>>>>>>> DEPENDENCIES="org.apache.iceberg:iceberg-spark3-runtime:$ICEBERG_VERSION"
>>>>>>>>>
>>>>>>>>> MINIOSERVER=192.168.160.5
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> # add AWS dependnecy
>>>>>>>>> AWS_SDK_VERSION=2.15.40
>>>>>>>>> AWS_MAVEN_GROUP=software.amazon.awssdk
>>>>>>>>> AWS_PACKAGES=(
>>>>>>>>>     "bundle"
>>>>>>>>>     "url-connection-client"
>>>>>>>>> )
>>>>>>>>> for pkg in "${AWS_PACKAGES[@]}"; do
>>>>>>>>>     DEPENDENCIES+=",$AWS_MAVEN_GROUP:$pkg:$AWS_SDK_VERSION"
>>>>>>>>> done
>>>>>>>>>
>>>>>>>>> # start Spark SQL client shell
>>>>>>>>> /spark/bin/spark-shell --packages $DEPENDENCIES \
>>>>>>>>>     --conf
>>>>>>>>> spark.sql.catalog.hive_test=org.apache.iceberg.spark.SparkCatalog \
>>>>>>>>>     --conf spark.sql.catalog.hive_test.warehouse=s3a://east/prefix
>>>>>>>>> \
>>>>>>>>>     --conf spark.sql.catalog.hive_test.type=hive  \
>>>>>>>>>     --conf
>>>>>>>>> spark.sql.catalog.hive_test.io-impl=org.apache.iceberg.aws.s3.S3FileIO \
>>>>>>>>>     --conf spark.hadoop.fs.s3a.endpoint=http://$MINIOSERVER:9000 \
>>>>>>>>>     --conf spark.hadoop.fs.s3a.access.key=minio \
>>>>>>>>>     --conf spark.hadoop.fs.s3a.secret.key=minio123 \
>>>>>>>>>     --conf spark.hadoop.fs.s3a.path.style.access=true \
>>>>>>>>>     --conf
>>>>>>>>> spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem
>>>>>>>>>
>>>>>>>>> *Here is the spark code to create the iceberg table:*
>>>>>>>>>
>>>>>>>>> import org.apache.spark.sql.SparkSession
>>>>>>>>> val values = List(1,2,3,4,5)
>>>>>>>>>
>>>>>>>>> val spark = SparkSession.builder().master("local").getOrCreate()
>>>>>>>>> import spark.implicits._
>>>>>>>>> val df = values.toDF()
>>>>>>>>>
>>>>>>>>> val core = "mytable8"
>>>>>>>>> val table = s"hive_test.mydb.${core}"
>>>>>>>>> val s3IcePath = s"s3a://spark-test/${core}.ice"
>>>>>>>>>
>>>>>>>>> df.writeTo(table)
>>>>>>>>>     .tableProperty("write.format.default", "parquet")
>>>>>>>>>     .tableProperty("location", s3IcePath)
>>>>>>>>>     .createOrReplace()
>>>>>>>>>
>>>>>>>>> I got an error "The AWS Access Key Id you provided does not exist
>>>>>>>>> in our records.".
>>>>>>>>>
>>>>>>>>> I have verified that I can login minio UI using the same username
>>>>>>>>> and password that I passed to spark-shell via AWS_ACCESS_KEY_ID and
>>>>>>>>> AWS_SECRET_ACCESS_KEY env variables.
>>>>>>>>> https://github.com/apache/iceberg/issues/2168 is related but does
>>>>>>>>> not help me. Not sure why the credential does not work for iceberg + AWS.
>>>>>>>>> Any idea or an example of writing an iceberg table to S3 using hive catalog
>>>>>>>>> will be highly appreciated! Thanks.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>>
>>>>>>> Create your own email signature
>>>>>>> <https://www.wisestamp.com/signature-in-email/?utm_source=promotion&utm_medium=signature&utm_campaign=create_your_own&srcid=5234462839406592>
>>>>>>>
>>>>>>
>>>>>
>>>>> --
>>>>>
>>>>> Create your own email signature
>>>>> <https://www.wisestamp.com/signature-in-email/?utm_source=promotion&utm_medium=signature&utm_campaign=create_your_own&srcid=5234462839406592>
>>>>>
>>>>
>>>
>>> --
>>>
>>> Create your own email signature
>>> <https://www.wisestamp.com/signature-in-email/?utm_source=promotion&utm_medium=signature&utm_campaign=create_your_own&srcid=5234462839406592>
>>>
>>
>
> --
>
> Create your own email signature
> <https://www.wisestamp.com/signature-in-email/?utm_source=promotion&utm_medium=signature&utm_campaign=create_your_own&srcid=5234462839406592>
>

Re: create iceberg on minio s3 got "The AWS Access Key Id you provided does not exist in our records."

Posted by Lian Jiang <ji...@gmail.com>.
This is my full script launching spark-shell:

# add Iceberg dependency
export AWS_REGION=us-east-1
export AWS_ACCESS_KEY_ID=minio
export AWS_SECRET_ACCESS_KEY=minio123

ICEBERG_VERSION=0.11.1
DEPENDENCIES="org.apache.iceberg:iceberg-spark3-runtime:$ICEBERG_VERSION,org.apache.iceberg:iceberg-hive-runtime:$ICEBERG_VERSION,org.apache.hadoop:hadoop-aws:3.2.0"

MINIOSERVER=192.168.176.5


# add AWS dependnecy
AWS_SDK_VERSION=2.15.40
AWS_MAVEN_GROUP=software.amazon.awssdk
AWS_PACKAGES=(
    "bundle"
    "url-connection-client"
)
for pkg in "${AWS_PACKAGES[@]}"; do
    DEPENDENCIES+=",$AWS_MAVEN_GROUP:$pkg:$AWS_SDK_VERSION"
done

# start Spark SQL client shell
/spark/bin/spark-shell --packages $DEPENDENCIES \
    --conf
spark.sql.catalog.hive_test=org.apache.iceberg.spark.SparkCatalog \
    --conf spark.sql.catalog.hive_test.type=hive  \
    --conf
spark.sql.catalog.hive_test.io-impl=org.apache.iceberg.hadoop.HadoopFileIO \
    --conf spark.sql.catalog.hive_test.warehouse=s3a://east/warehouse \
    --conf spark.hadoop.fs.s3a.endpoint=http://$MINIOSERVER:9000 \
    --conf spark.hadoop.fs.s3a.access.key=minio \
    --conf spark.hadoop.fs.s3a.secret.key=minio123 \
    --conf spark.hadoop.fs.s3a.path.style.access=true \
    --conf spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem


Let me know if anything is missing. Thanks.

On Mon, Aug 16, 2021 at 7:29 PM Jack Ye <ye...@gmail.com> wrote:

> Have you included the hadoop-aws jar?
> https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-aws
> -Jack
>
> On Mon, Aug 16, 2021 at 7:09 PM Lian Jiang <ji...@gmail.com> wrote:
>
>> Jack,
>>
>> You are right. S3FileIO will not work on minio since minio does not
>> support ACL: https://docs.min.io/docs/minio-server-limits-per-tenant.html
>>
>> To use iceberg, minio + s3a, I used below script to launch spark-shell:
>>
>> /spark/bin/spark-shell --packages $DEPENDENCIES \
>>     --conf
>> spark.sql.catalog.hive_test=org.apache.iceberg.spark.SparkCatalog \
>>     --conf spark.sql.catalog.hive_test.type=hive  \
>> *    --conf
>> spark.sql.catalog.hive_test.io-impl=org.apache.iceberg.hadoop.HadoopFileIO
>> \*
>>     --conf spark.sql.catalog.hive_test.warehouse=s3a://east/warehouse \
>>     --conf spark.hadoop.fs.s3a.endpoint=http://$MINIOSERVER:9000 \
>>     --conf spark.hadoop.fs.s3a.access.key=minio \
>>     --conf spark.hadoop.fs.s3a.secret.key=minio123 \
>>     --conf spark.hadoop.fs.s3a.path.style.access=true \
>>     --conf spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem
>>
>>
>>
>> *The spark code:*
>>
>> import org.apache.spark.sql.SparkSession
>> val values = List(1,2,3,4,5)
>>
>> val spark = SparkSession.builder().master("local").getOrCreate()
>> import spark.implicits._
>> val df = values.toDF()
>>
>> val core = "mytable"
>> val table = s"hive_test.mydb.${core}"
>> val s3IcePath = s"s3a://east/${core}.ice"
>>
>> df.writeTo(table)
>>     .tableProperty("write.format.default", "parquet")
>>     .tableProperty("location", s3IcePath)
>>     .createOrReplace()
>>
>>
>> *Still the same error:*
>> java.lang.ClassNotFoundException: Class
>> org.apache.hadoop.fs.s3a.S3AFileSystem not found
>>
>>
>> What else could be wrong? Thanks for any clue.
>>
>>
>>
>> On Mon, Aug 16, 2021 at 9:35 AM Jack Ye <ye...@gmail.com> wrote:
>>
>>> Sorry for the late reply, I thought I replied on Friday but the email
>>> did not send successfully.
>>>
>>> As Daniel said, you don't need to setup S3A if you are using S3FileIO.
>>>
>>> Th S3FileIO by default reads the default credentials chain to check
>>> credential setups one by one:
>>> https://docs.aws.amazon.com/sdk-for-java/latest/developer-guide/credentials.html#credentials-chain
>>>
>>> If you would like to use a specialized credential provider, you can
>>> directly customize your S3 client:
>>> https://iceberg.apache.org/aws/#aws-client-customization
>>>
>>> It looks like you are trying to use MinIO to mount S3A file system? If
>>> you have to use MinIO then there is not a way to integrate with S3FileIO
>>> right now. (maybe I am wrong on this, I don't know much about MinIO)
>>>
>>> To directly use S3FileIO with HiveCatalog, simply do:
>>>
>>> /spark/bin/spark-shell --packages $DEPENDENCIES \
>>>     --conf
>>> spark.sql.catalog.hive_test=org.apache.iceberg.spark.SparkCatalog \
>>>     --conf spark.sql.catalog.hive_test.type=hive  \
>>>     --conf
>>> spark.sql.catalog.hive_test.io-impl=org.apache.iceberg.aws.s3.S3FileIO \
>>>     --conf spark.sql.catalog.hive_test.warehouse=s3://bucket
>>>
>>> Best,
>>> Jack Ye
>>>
>>>
>>>
>>> On Sun, Aug 15, 2021 at 2:53 PM Lian Jiang <ji...@gmail.com>
>>> wrote:
>>>
>>>> Thanks. I prefer S3FileIO as it is recommended by iceberg. Do you have
>>>> a sample using hive catalog, s3FileIO, spark API (as opposed to SQL), S3
>>>> access.key and secret.key? It is hard to get all settings right for this
>>>> combination without an example. Appreciate any help.
>>>>
>>>> On Fri, Aug 13, 2021 at 6:01 PM Daniel Weeks <da...@gmail.com>
>>>> wrote:
>>>>
>>>>> So, if I recall correctly, the hive server does need access to check
>>>>> and create paths for table locations.
>>>>>
>>>>> There may be an option to disable this behavior, but otherwise the fs
>>>>> implementation probably needs to be available to the hive metastore.
>>>>>
>>>>> -Dan
>>>>>
>>>>> On Fri, Aug 13, 2021, 4:48 PM Lian Jiang <ji...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Thanks Daniel.
>>>>>>
>>>>>> After modifying the script to,
>>>>>>
>>>>>> export AWS_REGION=us-east-1
>>>>>> export AWS_ACCESS_KEY_ID=minio
>>>>>> export AWS_SECRET_ACCESS_KEY=minio123
>>>>>>
>>>>>> ICEBERG_VERSION=0.11.1
>>>>>>
>>>>>> DEPENDENCIES="org.apache.iceberg:iceberg-spark3-runtime:$ICEBERG_VERSION,org.apache.iceberg:iceberg-hive-runtime:$ICEBERG_VERSION,org.apache.hadoop:hadoop-aws:3.2.0"
>>>>>>
>>>>>> MINIOSERVER=192.168.160.5
>>>>>>
>>>>>>
>>>>>> # add AWS dependnecy
>>>>>> AWS_SDK_VERSION=2.15.40
>>>>>> AWS_MAVEN_GROUP=software.amazon.awssdk
>>>>>> AWS_PACKAGES=(
>>>>>>     "bundle"
>>>>>>     "url-connection-client"
>>>>>> )
>>>>>> for pkg in "${AWS_PACKAGES[@]}"; do
>>>>>>     DEPENDENCIES+=",$AWS_MAVEN_GROUP:$pkg:$AWS_SDK_VERSION"
>>>>>> done
>>>>>>
>>>>>> # start Spark SQL client shell
>>>>>> /spark/bin/spark-shell --packages $DEPENDENCIES \
>>>>>>     --conf
>>>>>> spark.sql.catalog.hive_test=org.apache.iceberg.spark.SparkCatalog \
>>>>>>     --conf spark.sql.catalog.hive_test.type=hive  \
>>>>>>     --conf spark.hadoop.fs.s3a.endpoint=http://$MINIOSERVER:9000 \
>>>>>>     --conf spark.hadoop.fs.s3a.access.key=minio \
>>>>>>     --conf spark.hadoop.fs.s3a.secret.key=minio123 \
>>>>>>     --conf spark.hadoop.fs.s3a.path.style.access=true \
>>>>>>     --conf
>>>>>> spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem
>>>>>>
>>>>>> I got: MetaException: java.lang.RuntimeException:
>>>>>> java.lang.ClassNotFoundException: Class
>>>>>> org.apache.hadoop.fs.s3a.S3AFileSystem not found. My hive server is not
>>>>>> using s3 and should not cause this error. Any ideas? Thanks.
>>>>>>
>>>>>>
>>>>>> I got "ClassNotFoundException: Class
>>>>>> org.apache.hadoop.fs.s3a.S3AFileSystem not found". Any idea what dependency
>>>>>> could I miss?
>>>>>>
>>>>>> On Fri, Aug 13, 2021 at 4:03 PM Daniel Weeks <
>>>>>> daniel.c.weeks@gmail.com> wrote:
>>>>>>
>>>>>>> Hey Lian,
>>>>>>>
>>>>>>> At a cursory glance, it appears that you might be mixing two
>>>>>>> different FileIO implementations, which may be why you are not getting the
>>>>>>> expected result.
>>>>>>>
>>>>>>> When you set: --conf
>>>>>>> spark.sql.catalog.hive_test.io-impl=org.apache.iceberg.aws.s3.S3FileIO you're
>>>>>>> actually switching over to the native S3 implementation within Iceberg (as
>>>>>>> opposed to S3AFileSystem via HadoopFileIO).  However, all of the following
>>>>>>> settings to setup access are then set for the S3AFileSystem (which would
>>>>>>> not be used with S3FileIO).
>>>>>>>
>>>>>>> You might try just removing that line since it should use the
>>>>>>> HadoopFileIO at that point and may work.
>>>>>>>
>>>>>>> Hope that's helpful,
>>>>>>> -Dan
>>>>>>>
>>>>>>> On Fri, Aug 13, 2021 at 3:50 PM Lian Jiang <ji...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> I try to create an iceberg table on minio s3 and hive.
>>>>>>>>
>>>>>>>> *This is how I launch spark-shell:*
>>>>>>>>
>>>>>>>> # add Iceberg dependency
>>>>>>>> export AWS_REGION=us-east-1
>>>>>>>> export AWS_ACCESS_KEY_ID=minio
>>>>>>>> export AWS_SECRET_ACCESS_KEY=minio123
>>>>>>>>
>>>>>>>> ICEBERG_VERSION=0.11.1
>>>>>>>>
>>>>>>>> DEPENDENCIES="org.apache.iceberg:iceberg-spark3-runtime:$ICEBERG_VERSION"
>>>>>>>>
>>>>>>>> MINIOSERVER=192.168.160.5
>>>>>>>>
>>>>>>>>
>>>>>>>> # add AWS dependnecy
>>>>>>>> AWS_SDK_VERSION=2.15.40
>>>>>>>> AWS_MAVEN_GROUP=software.amazon.awssdk
>>>>>>>> AWS_PACKAGES=(
>>>>>>>>     "bundle"
>>>>>>>>     "url-connection-client"
>>>>>>>> )
>>>>>>>> for pkg in "${AWS_PACKAGES[@]}"; do
>>>>>>>>     DEPENDENCIES+=",$AWS_MAVEN_GROUP:$pkg:$AWS_SDK_VERSION"
>>>>>>>> done
>>>>>>>>
>>>>>>>> # start Spark SQL client shell
>>>>>>>> /spark/bin/spark-shell --packages $DEPENDENCIES \
>>>>>>>>     --conf
>>>>>>>> spark.sql.catalog.hive_test=org.apache.iceberg.spark.SparkCatalog \
>>>>>>>>     --conf spark.sql.catalog.hive_test.warehouse=s3a://east/prefix \
>>>>>>>>     --conf spark.sql.catalog.hive_test.type=hive  \
>>>>>>>>     --conf
>>>>>>>> spark.sql.catalog.hive_test.io-impl=org.apache.iceberg.aws.s3.S3FileIO \
>>>>>>>>     --conf spark.hadoop.fs.s3a.endpoint=http://$MINIOSERVER:9000 \
>>>>>>>>     --conf spark.hadoop.fs.s3a.access.key=minio \
>>>>>>>>     --conf spark.hadoop.fs.s3a.secret.key=minio123 \
>>>>>>>>     --conf spark.hadoop.fs.s3a.path.style.access=true \
>>>>>>>>     --conf
>>>>>>>> spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem
>>>>>>>>
>>>>>>>> *Here is the spark code to create the iceberg table:*
>>>>>>>>
>>>>>>>> import org.apache.spark.sql.SparkSession
>>>>>>>> val values = List(1,2,3,4,5)
>>>>>>>>
>>>>>>>> val spark = SparkSession.builder().master("local").getOrCreate()
>>>>>>>> import spark.implicits._
>>>>>>>> val df = values.toDF()
>>>>>>>>
>>>>>>>> val core = "mytable8"
>>>>>>>> val table = s"hive_test.mydb.${core}"
>>>>>>>> val s3IcePath = s"s3a://spark-test/${core}.ice"
>>>>>>>>
>>>>>>>> df.writeTo(table)
>>>>>>>>     .tableProperty("write.format.default", "parquet")
>>>>>>>>     .tableProperty("location", s3IcePath)
>>>>>>>>     .createOrReplace()
>>>>>>>>
>>>>>>>> I got an error "The AWS Access Key Id you provided does not exist
>>>>>>>> in our records.".
>>>>>>>>
>>>>>>>> I have verified that I can login minio UI using the same username
>>>>>>>> and password that I passed to spark-shell via AWS_ACCESS_KEY_ID and
>>>>>>>> AWS_SECRET_ACCESS_KEY env variables.
>>>>>>>> https://github.com/apache/iceberg/issues/2168 is related but does
>>>>>>>> not help me. Not sure why the credential does not work for iceberg + AWS.
>>>>>>>> Any idea or an example of writing an iceberg table to S3 using hive catalog
>>>>>>>> will be highly appreciated! Thanks.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>
>>>>>> --
>>>>>>
>>>>>> Create your own email signature
>>>>>> <https://www.wisestamp.com/signature-in-email/?utm_source=promotion&utm_medium=signature&utm_campaign=create_your_own&srcid=5234462839406592>
>>>>>>
>>>>>
>>>>
>>>> --
>>>>
>>>> Create your own email signature
>>>> <https://www.wisestamp.com/signature-in-email/?utm_source=promotion&utm_medium=signature&utm_campaign=create_your_own&srcid=5234462839406592>
>>>>
>>>
>>
>> --
>>
>> Create your own email signature
>> <https://www.wisestamp.com/signature-in-email/?utm_source=promotion&utm_medium=signature&utm_campaign=create_your_own&srcid=5234462839406592>
>>
>

-- 

Create your own email signature
<https://www.wisestamp.com/signature-in-email/?utm_source=promotion&utm_medium=signature&utm_campaign=create_your_own&srcid=5234462839406592>

Re: create iceberg on minio s3 got "The AWS Access Key Id you provided does not exist in our records."

Posted by Jack Ye <ye...@gmail.com>.
Have you included the hadoop-aws jar?
https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-aws
-Jack

On Mon, Aug 16, 2021 at 7:09 PM Lian Jiang <ji...@gmail.com> wrote:

> Jack,
>
> You are right. S3FileIO will not work on minio since minio does not
> support ACL: https://docs.min.io/docs/minio-server-limits-per-tenant.html
>
> To use iceberg, minio + s3a, I used below script to launch spark-shell:
>
> /spark/bin/spark-shell --packages $DEPENDENCIES \
>     --conf
> spark.sql.catalog.hive_test=org.apache.iceberg.spark.SparkCatalog \
>     --conf spark.sql.catalog.hive_test.type=hive  \
> *    --conf
> spark.sql.catalog.hive_test.io-impl=org.apache.iceberg.hadoop.HadoopFileIO
> \*
>     --conf spark.sql.catalog.hive_test.warehouse=s3a://east/warehouse \
>     --conf spark.hadoop.fs.s3a.endpoint=http://$MINIOSERVER:9000 \
>     --conf spark.hadoop.fs.s3a.access.key=minio \
>     --conf spark.hadoop.fs.s3a.secret.key=minio123 \
>     --conf spark.hadoop.fs.s3a.path.style.access=true \
>     --conf spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem
>
>
>
> *The spark code:*
>
> import org.apache.spark.sql.SparkSession
> val values = List(1,2,3,4,5)
>
> val spark = SparkSession.builder().master("local").getOrCreate()
> import spark.implicits._
> val df = values.toDF()
>
> val core = "mytable"
> val table = s"hive_test.mydb.${core}"
> val s3IcePath = s"s3a://east/${core}.ice"
>
> df.writeTo(table)
>     .tableProperty("write.format.default", "parquet")
>     .tableProperty("location", s3IcePath)
>     .createOrReplace()
>
>
> *Still the same error:*
> java.lang.ClassNotFoundException: Class
> org.apache.hadoop.fs.s3a.S3AFileSystem not found
>
>
> What else could be wrong? Thanks for any clue.
>
>
>
> On Mon, Aug 16, 2021 at 9:35 AM Jack Ye <ye...@gmail.com> wrote:
>
>> Sorry for the late reply, I thought I replied on Friday but the email did
>> not send successfully.
>>
>> As Daniel said, you don't need to setup S3A if you are using S3FileIO.
>>
>> Th S3FileIO by default reads the default credentials chain to check
>> credential setups one by one:
>> https://docs.aws.amazon.com/sdk-for-java/latest/developer-guide/credentials.html#credentials-chain
>>
>> If you would like to use a specialized credential provider, you can
>> directly customize your S3 client:
>> https://iceberg.apache.org/aws/#aws-client-customization
>>
>> It looks like you are trying to use MinIO to mount S3A file system? If
>> you have to use MinIO then there is not a way to integrate with S3FileIO
>> right now. (maybe I am wrong on this, I don't know much about MinIO)
>>
>> To directly use S3FileIO with HiveCatalog, simply do:
>>
>> /spark/bin/spark-shell --packages $DEPENDENCIES \
>>     --conf
>> spark.sql.catalog.hive_test=org.apache.iceberg.spark.SparkCatalog \
>>     --conf spark.sql.catalog.hive_test.type=hive  \
>>     --conf
>> spark.sql.catalog.hive_test.io-impl=org.apache.iceberg.aws.s3.S3FileIO \
>>     --conf spark.sql.catalog.hive_test.warehouse=s3://bucket
>>
>> Best,
>> Jack Ye
>>
>>
>>
>> On Sun, Aug 15, 2021 at 2:53 PM Lian Jiang <ji...@gmail.com> wrote:
>>
>>> Thanks. I prefer S3FileIO as it is recommended by iceberg. Do you have a
>>> sample using hive catalog, s3FileIO, spark API (as opposed to SQL), S3
>>> access.key and secret.key? It is hard to get all settings right for this
>>> combination without an example. Appreciate any help.
>>>
>>> On Fri, Aug 13, 2021 at 6:01 PM Daniel Weeks <da...@gmail.com>
>>> wrote:
>>>
>>>> So, if I recall correctly, the hive server does need access to check
>>>> and create paths for table locations.
>>>>
>>>> There may be an option to disable this behavior, but otherwise the fs
>>>> implementation probably needs to be available to the hive metastore.
>>>>
>>>> -Dan
>>>>
>>>> On Fri, Aug 13, 2021, 4:48 PM Lian Jiang <ji...@gmail.com> wrote:
>>>>
>>>>> Thanks Daniel.
>>>>>
>>>>> After modifying the script to,
>>>>>
>>>>> export AWS_REGION=us-east-1
>>>>> export AWS_ACCESS_KEY_ID=minio
>>>>> export AWS_SECRET_ACCESS_KEY=minio123
>>>>>
>>>>> ICEBERG_VERSION=0.11.1
>>>>>
>>>>> DEPENDENCIES="org.apache.iceberg:iceberg-spark3-runtime:$ICEBERG_VERSION,org.apache.iceberg:iceberg-hive-runtime:$ICEBERG_VERSION,org.apache.hadoop:hadoop-aws:3.2.0"
>>>>>
>>>>> MINIOSERVER=192.168.160.5
>>>>>
>>>>>
>>>>> # add AWS dependnecy
>>>>> AWS_SDK_VERSION=2.15.40
>>>>> AWS_MAVEN_GROUP=software.amazon.awssdk
>>>>> AWS_PACKAGES=(
>>>>>     "bundle"
>>>>>     "url-connection-client"
>>>>> )
>>>>> for pkg in "${AWS_PACKAGES[@]}"; do
>>>>>     DEPENDENCIES+=",$AWS_MAVEN_GROUP:$pkg:$AWS_SDK_VERSION"
>>>>> done
>>>>>
>>>>> # start Spark SQL client shell
>>>>> /spark/bin/spark-shell --packages $DEPENDENCIES \
>>>>>     --conf
>>>>> spark.sql.catalog.hive_test=org.apache.iceberg.spark.SparkCatalog \
>>>>>     --conf spark.sql.catalog.hive_test.type=hive  \
>>>>>     --conf spark.hadoop.fs.s3a.endpoint=http://$MINIOSERVER:9000 \
>>>>>     --conf spark.hadoop.fs.s3a.access.key=minio \
>>>>>     --conf spark.hadoop.fs.s3a.secret.key=minio123 \
>>>>>     --conf spark.hadoop.fs.s3a.path.style.access=true \
>>>>>     --conf
>>>>> spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem
>>>>>
>>>>> I got: MetaException: java.lang.RuntimeException:
>>>>> java.lang.ClassNotFoundException: Class
>>>>> org.apache.hadoop.fs.s3a.S3AFileSystem not found. My hive server is not
>>>>> using s3 and should not cause this error. Any ideas? Thanks.
>>>>>
>>>>>
>>>>> I got "ClassNotFoundException: Class
>>>>> org.apache.hadoop.fs.s3a.S3AFileSystem not found". Any idea what dependency
>>>>> could I miss?
>>>>>
>>>>> On Fri, Aug 13, 2021 at 4:03 PM Daniel Weeks <da...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Hey Lian,
>>>>>>
>>>>>> At a cursory glance, it appears that you might be mixing two
>>>>>> different FileIO implementations, which may be why you are not getting the
>>>>>> expected result.
>>>>>>
>>>>>> When you set: --conf
>>>>>> spark.sql.catalog.hive_test.io-impl=org.apache.iceberg.aws.s3.S3FileIO you're
>>>>>> actually switching over to the native S3 implementation within Iceberg (as
>>>>>> opposed to S3AFileSystem via HadoopFileIO).  However, all of the following
>>>>>> settings to setup access are then set for the S3AFileSystem (which would
>>>>>> not be used with S3FileIO).
>>>>>>
>>>>>> You might try just removing that line since it should use the
>>>>>> HadoopFileIO at that point and may work.
>>>>>>
>>>>>> Hope that's helpful,
>>>>>> -Dan
>>>>>>
>>>>>> On Fri, Aug 13, 2021 at 3:50 PM Lian Jiang <ji...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> I try to create an iceberg table on minio s3 and hive.
>>>>>>>
>>>>>>> *This is how I launch spark-shell:*
>>>>>>>
>>>>>>> # add Iceberg dependency
>>>>>>> export AWS_REGION=us-east-1
>>>>>>> export AWS_ACCESS_KEY_ID=minio
>>>>>>> export AWS_SECRET_ACCESS_KEY=minio123
>>>>>>>
>>>>>>> ICEBERG_VERSION=0.11.1
>>>>>>>
>>>>>>> DEPENDENCIES="org.apache.iceberg:iceberg-spark3-runtime:$ICEBERG_VERSION"
>>>>>>>
>>>>>>> MINIOSERVER=192.168.160.5
>>>>>>>
>>>>>>>
>>>>>>> # add AWS dependnecy
>>>>>>> AWS_SDK_VERSION=2.15.40
>>>>>>> AWS_MAVEN_GROUP=software.amazon.awssdk
>>>>>>> AWS_PACKAGES=(
>>>>>>>     "bundle"
>>>>>>>     "url-connection-client"
>>>>>>> )
>>>>>>> for pkg in "${AWS_PACKAGES[@]}"; do
>>>>>>>     DEPENDENCIES+=",$AWS_MAVEN_GROUP:$pkg:$AWS_SDK_VERSION"
>>>>>>> done
>>>>>>>
>>>>>>> # start Spark SQL client shell
>>>>>>> /spark/bin/spark-shell --packages $DEPENDENCIES \
>>>>>>>     --conf
>>>>>>> spark.sql.catalog.hive_test=org.apache.iceberg.spark.SparkCatalog \
>>>>>>>     --conf spark.sql.catalog.hive_test.warehouse=s3a://east/prefix \
>>>>>>>     --conf spark.sql.catalog.hive_test.type=hive  \
>>>>>>>     --conf
>>>>>>> spark.sql.catalog.hive_test.io-impl=org.apache.iceberg.aws.s3.S3FileIO \
>>>>>>>     --conf spark.hadoop.fs.s3a.endpoint=http://$MINIOSERVER:9000 \
>>>>>>>     --conf spark.hadoop.fs.s3a.access.key=minio \
>>>>>>>     --conf spark.hadoop.fs.s3a.secret.key=minio123 \
>>>>>>>     --conf spark.hadoop.fs.s3a.path.style.access=true \
>>>>>>>     --conf
>>>>>>> spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem
>>>>>>>
>>>>>>> *Here is the spark code to create the iceberg table:*
>>>>>>>
>>>>>>> import org.apache.spark.sql.SparkSession
>>>>>>> val values = List(1,2,3,4,5)
>>>>>>>
>>>>>>> val spark = SparkSession.builder().master("local").getOrCreate()
>>>>>>> import spark.implicits._
>>>>>>> val df = values.toDF()
>>>>>>>
>>>>>>> val core = "mytable8"
>>>>>>> val table = s"hive_test.mydb.${core}"
>>>>>>> val s3IcePath = s"s3a://spark-test/${core}.ice"
>>>>>>>
>>>>>>> df.writeTo(table)
>>>>>>>     .tableProperty("write.format.default", "parquet")
>>>>>>>     .tableProperty("location", s3IcePath)
>>>>>>>     .createOrReplace()
>>>>>>>
>>>>>>> I got an error "The AWS Access Key Id you provided does not exist in
>>>>>>> our records.".
>>>>>>>
>>>>>>> I have verified that I can login minio UI using the same username
>>>>>>> and password that I passed to spark-shell via AWS_ACCESS_KEY_ID and
>>>>>>> AWS_SECRET_ACCESS_KEY env variables.
>>>>>>> https://github.com/apache/iceberg/issues/2168 is related but does
>>>>>>> not help me. Not sure why the credential does not work for iceberg + AWS.
>>>>>>> Any idea or an example of writing an iceberg table to S3 using hive catalog
>>>>>>> will be highly appreciated! Thanks.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>
>>>>> --
>>>>>
>>>>> Create your own email signature
>>>>> <https://www.wisestamp.com/signature-in-email/?utm_source=promotion&utm_medium=signature&utm_campaign=create_your_own&srcid=5234462839406592>
>>>>>
>>>>
>>>
>>> --
>>>
>>> Create your own email signature
>>> <https://www.wisestamp.com/signature-in-email/?utm_source=promotion&utm_medium=signature&utm_campaign=create_your_own&srcid=5234462839406592>
>>>
>>
>
> --
>
> Create your own email signature
> <https://www.wisestamp.com/signature-in-email/?utm_source=promotion&utm_medium=signature&utm_campaign=create_your_own&srcid=5234462839406592>
>

Re: create iceberg on minio s3 got "The AWS Access Key Id you provided does not exist in our records."

Posted by Lian Jiang <ji...@gmail.com>.
Jack,

You are right. S3FileIO will not work on minio since minio does not support
ACL: https://docs.min.io/docs/minio-server-limits-per-tenant.html

To use iceberg, minio + s3a, I used below script to launch spark-shell:

/spark/bin/spark-shell --packages $DEPENDENCIES \
    --conf
spark.sql.catalog.hive_test=org.apache.iceberg.spark.SparkCatalog \
    --conf spark.sql.catalog.hive_test.type=hive  \
*    --conf
spark.sql.catalog.hive_test.io-impl=org.apache.iceberg.hadoop.HadoopFileIO
\*
    --conf spark.sql.catalog.hive_test.warehouse=s3a://east/warehouse \
    --conf spark.hadoop.fs.s3a.endpoint=http://$MINIOSERVER:9000 \
    --conf spark.hadoop.fs.s3a.access.key=minio \
    --conf spark.hadoop.fs.s3a.secret.key=minio123 \
    --conf spark.hadoop.fs.s3a.path.style.access=true \
    --conf spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem



*The spark code:*

import org.apache.spark.sql.SparkSession
val values = List(1,2,3,4,5)

val spark = SparkSession.builder().master("local").getOrCreate()
import spark.implicits._
val df = values.toDF()

val core = "mytable"
val table = s"hive_test.mydb.${core}"
val s3IcePath = s"s3a://east/${core}.ice"

df.writeTo(table)
    .tableProperty("write.format.default", "parquet")
    .tableProperty("location", s3IcePath)
    .createOrReplace()


*Still the same error:*
java.lang.ClassNotFoundException: Class
org.apache.hadoop.fs.s3a.S3AFileSystem not found


What else could be wrong? Thanks for any clue.



On Mon, Aug 16, 2021 at 9:35 AM Jack Ye <ye...@gmail.com> wrote:

> Sorry for the late reply, I thought I replied on Friday but the email did
> not send successfully.
>
> As Daniel said, you don't need to setup S3A if you are using S3FileIO.
>
> Th S3FileIO by default reads the default credentials chain to check
> credential setups one by one:
> https://docs.aws.amazon.com/sdk-for-java/latest/developer-guide/credentials.html#credentials-chain
>
> If you would like to use a specialized credential provider, you can
> directly customize your S3 client:
> https://iceberg.apache.org/aws/#aws-client-customization
>
> It looks like you are trying to use MinIO to mount S3A file system? If you
> have to use MinIO then there is not a way to integrate with S3FileIO right
> now. (maybe I am wrong on this, I don't know much about MinIO)
>
> To directly use S3FileIO with HiveCatalog, simply do:
>
> /spark/bin/spark-shell --packages $DEPENDENCIES \
>     --conf
> spark.sql.catalog.hive_test=org.apache.iceberg.spark.SparkCatalog \
>     --conf spark.sql.catalog.hive_test.type=hive  \
>     --conf
> spark.sql.catalog.hive_test.io-impl=org.apache.iceberg.aws.s3.S3FileIO \
>     --conf spark.sql.catalog.hive_test.warehouse=s3://bucket
>
> Best,
> Jack Ye
>
>
>
> On Sun, Aug 15, 2021 at 2:53 PM Lian Jiang <ji...@gmail.com> wrote:
>
>> Thanks. I prefer S3FileIO as it is recommended by iceberg. Do you have a
>> sample using hive catalog, s3FileIO, spark API (as opposed to SQL), S3
>> access.key and secret.key? It is hard to get all settings right for this
>> combination without an example. Appreciate any help.
>>
>> On Fri, Aug 13, 2021 at 6:01 PM Daniel Weeks <da...@gmail.com>
>> wrote:
>>
>>> So, if I recall correctly, the hive server does need access to check and
>>> create paths for table locations.
>>>
>>> There may be an option to disable this behavior, but otherwise the fs
>>> implementation probably needs to be available to the hive metastore.
>>>
>>> -Dan
>>>
>>> On Fri, Aug 13, 2021, 4:48 PM Lian Jiang <ji...@gmail.com> wrote:
>>>
>>>> Thanks Daniel.
>>>>
>>>> After modifying the script to,
>>>>
>>>> export AWS_REGION=us-east-1
>>>> export AWS_ACCESS_KEY_ID=minio
>>>> export AWS_SECRET_ACCESS_KEY=minio123
>>>>
>>>> ICEBERG_VERSION=0.11.1
>>>>
>>>> DEPENDENCIES="org.apache.iceberg:iceberg-spark3-runtime:$ICEBERG_VERSION,org.apache.iceberg:iceberg-hive-runtime:$ICEBERG_VERSION,org.apache.hadoop:hadoop-aws:3.2.0"
>>>>
>>>> MINIOSERVER=192.168.160.5
>>>>
>>>>
>>>> # add AWS dependnecy
>>>> AWS_SDK_VERSION=2.15.40
>>>> AWS_MAVEN_GROUP=software.amazon.awssdk
>>>> AWS_PACKAGES=(
>>>>     "bundle"
>>>>     "url-connection-client"
>>>> )
>>>> for pkg in "${AWS_PACKAGES[@]}"; do
>>>>     DEPENDENCIES+=",$AWS_MAVEN_GROUP:$pkg:$AWS_SDK_VERSION"
>>>> done
>>>>
>>>> # start Spark SQL client shell
>>>> /spark/bin/spark-shell --packages $DEPENDENCIES \
>>>>     --conf
>>>> spark.sql.catalog.hive_test=org.apache.iceberg.spark.SparkCatalog \
>>>>     --conf spark.sql.catalog.hive_test.type=hive  \
>>>>     --conf spark.hadoop.fs.s3a.endpoint=http://$MINIOSERVER:9000 \
>>>>     --conf spark.hadoop.fs.s3a.access.key=minio \
>>>>     --conf spark.hadoop.fs.s3a.secret.key=minio123 \
>>>>     --conf spark.hadoop.fs.s3a.path.style.access=true \
>>>>     --conf
>>>> spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem
>>>>
>>>> I got: MetaException: java.lang.RuntimeException:
>>>> java.lang.ClassNotFoundException: Class
>>>> org.apache.hadoop.fs.s3a.S3AFileSystem not found. My hive server is not
>>>> using s3 and should not cause this error. Any ideas? Thanks.
>>>>
>>>>
>>>> I got "ClassNotFoundException: Class
>>>> org.apache.hadoop.fs.s3a.S3AFileSystem not found". Any idea what dependency
>>>> could I miss?
>>>>
>>>> On Fri, Aug 13, 2021 at 4:03 PM Daniel Weeks <da...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hey Lian,
>>>>>
>>>>> At a cursory glance, it appears that you might be mixing two different
>>>>> FileIO implementations, which may be why you are not getting the expected
>>>>> result.
>>>>>
>>>>> When you set: --conf
>>>>> spark.sql.catalog.hive_test.io-impl=org.apache.iceberg.aws.s3.S3FileIO you're
>>>>> actually switching over to the native S3 implementation within Iceberg (as
>>>>> opposed to S3AFileSystem via HadoopFileIO).  However, all of the following
>>>>> settings to setup access are then set for the S3AFileSystem (which would
>>>>> not be used with S3FileIO).
>>>>>
>>>>> You might try just removing that line since it should use the
>>>>> HadoopFileIO at that point and may work.
>>>>>
>>>>> Hope that's helpful,
>>>>> -Dan
>>>>>
>>>>> On Fri, Aug 13, 2021 at 3:50 PM Lian Jiang <ji...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> I try to create an iceberg table on minio s3 and hive.
>>>>>>
>>>>>> *This is how I launch spark-shell:*
>>>>>>
>>>>>> # add Iceberg dependency
>>>>>> export AWS_REGION=us-east-1
>>>>>> export AWS_ACCESS_KEY_ID=minio
>>>>>> export AWS_SECRET_ACCESS_KEY=minio123
>>>>>>
>>>>>> ICEBERG_VERSION=0.11.1
>>>>>>
>>>>>> DEPENDENCIES="org.apache.iceberg:iceberg-spark3-runtime:$ICEBERG_VERSION"
>>>>>>
>>>>>> MINIOSERVER=192.168.160.5
>>>>>>
>>>>>>
>>>>>> # add AWS dependnecy
>>>>>> AWS_SDK_VERSION=2.15.40
>>>>>> AWS_MAVEN_GROUP=software.amazon.awssdk
>>>>>> AWS_PACKAGES=(
>>>>>>     "bundle"
>>>>>>     "url-connection-client"
>>>>>> )
>>>>>> for pkg in "${AWS_PACKAGES[@]}"; do
>>>>>>     DEPENDENCIES+=",$AWS_MAVEN_GROUP:$pkg:$AWS_SDK_VERSION"
>>>>>> done
>>>>>>
>>>>>> # start Spark SQL client shell
>>>>>> /spark/bin/spark-shell --packages $DEPENDENCIES \
>>>>>>     --conf
>>>>>> spark.sql.catalog.hive_test=org.apache.iceberg.spark.SparkCatalog \
>>>>>>     --conf spark.sql.catalog.hive_test.warehouse=s3a://east/prefix \
>>>>>>     --conf spark.sql.catalog.hive_test.type=hive  \
>>>>>>     --conf
>>>>>> spark.sql.catalog.hive_test.io-impl=org.apache.iceberg.aws.s3.S3FileIO \
>>>>>>     --conf spark.hadoop.fs.s3a.endpoint=http://$MINIOSERVER:9000 \
>>>>>>     --conf spark.hadoop.fs.s3a.access.key=minio \
>>>>>>     --conf spark.hadoop.fs.s3a.secret.key=minio123 \
>>>>>>     --conf spark.hadoop.fs.s3a.path.style.access=true \
>>>>>>     --conf
>>>>>> spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem
>>>>>>
>>>>>> *Here is the spark code to create the iceberg table:*
>>>>>>
>>>>>> import org.apache.spark.sql.SparkSession
>>>>>> val values = List(1,2,3,4,5)
>>>>>>
>>>>>> val spark = SparkSession.builder().master("local").getOrCreate()
>>>>>> import spark.implicits._
>>>>>> val df = values.toDF()
>>>>>>
>>>>>> val core = "mytable8"
>>>>>> val table = s"hive_test.mydb.${core}"
>>>>>> val s3IcePath = s"s3a://spark-test/${core}.ice"
>>>>>>
>>>>>> df.writeTo(table)
>>>>>>     .tableProperty("write.format.default", "parquet")
>>>>>>     .tableProperty("location", s3IcePath)
>>>>>>     .createOrReplace()
>>>>>>
>>>>>> I got an error "The AWS Access Key Id you provided does not exist in
>>>>>> our records.".
>>>>>>
>>>>>> I have verified that I can login minio UI using the same username and
>>>>>> password that I passed to spark-shell via AWS_ACCESS_KEY_ID and
>>>>>> AWS_SECRET_ACCESS_KEY env variables.
>>>>>> https://github.com/apache/iceberg/issues/2168 is related but does
>>>>>> not help me. Not sure why the credential does not work for iceberg + AWS.
>>>>>> Any idea or an example of writing an iceberg table to S3 using hive catalog
>>>>>> will be highly appreciated! Thanks.
>>>>>>
>>>>>>
>>>>>>
>>>>
>>>> --
>>>>
>>>> Create your own email signature
>>>> <https://www.wisestamp.com/signature-in-email/?utm_source=promotion&utm_medium=signature&utm_campaign=create_your_own&srcid=5234462839406592>
>>>>
>>>
>>
>> --
>>
>> Create your own email signature
>> <https://www.wisestamp.com/signature-in-email/?utm_source=promotion&utm_medium=signature&utm_campaign=create_your_own&srcid=5234462839406592>
>>
>

-- 

Create your own email signature
<https://www.wisestamp.com/signature-in-email/?utm_source=promotion&utm_medium=signature&utm_campaign=create_your_own&srcid=5234462839406592>

Re: create iceberg on minio s3 got "The AWS Access Key Id you provided does not exist in our records."

Posted by Jack Ye <ye...@gmail.com>.
Sorry for the late reply, I thought I replied on Friday but the email did
not send successfully.

As Daniel said, you don't need to setup S3A if you are using S3FileIO.

Th S3FileIO by default reads the default credentials chain to check
credential setups one by one:
https://docs.aws.amazon.com/sdk-for-java/latest/developer-guide/credentials.html#credentials-chain

If you would like to use a specialized credential provider, you can
directly customize your S3 client:
https://iceberg.apache.org/aws/#aws-client-customization

It looks like you are trying to use MinIO to mount S3A file system? If you
have to use MinIO then there is not a way to integrate with S3FileIO right
now. (maybe I am wrong on this, I don't know much about MinIO)

To directly use S3FileIO with HiveCatalog, simply do:

/spark/bin/spark-shell --packages $DEPENDENCIES \
    --conf
spark.sql.catalog.hive_test=org.apache.iceberg.spark.SparkCatalog \
    --conf spark.sql.catalog.hive_test.type=hive  \
    --conf
spark.sql.catalog.hive_test.io-impl=org.apache.iceberg.aws.s3.S3FileIO \
    --conf spark.sql.catalog.hive_test.warehouse=s3://bucket

Best,
Jack Ye



On Sun, Aug 15, 2021 at 2:53 PM Lian Jiang <ji...@gmail.com> wrote:

> Thanks. I prefer S3FileIO as it is recommended by iceberg. Do you have a
> sample using hive catalog, s3FileIO, spark API (as opposed to SQL), S3
> access.key and secret.key? It is hard to get all settings right for this
> combination without an example. Appreciate any help.
>
> On Fri, Aug 13, 2021 at 6:01 PM Daniel Weeks <da...@gmail.com>
> wrote:
>
>> So, if I recall correctly, the hive server does need access to check and
>> create paths for table locations.
>>
>> There may be an option to disable this behavior, but otherwise the fs
>> implementation probably needs to be available to the hive metastore.
>>
>> -Dan
>>
>> On Fri, Aug 13, 2021, 4:48 PM Lian Jiang <ji...@gmail.com> wrote:
>>
>>> Thanks Daniel.
>>>
>>> After modifying the script to,
>>>
>>> export AWS_REGION=us-east-1
>>> export AWS_ACCESS_KEY_ID=minio
>>> export AWS_SECRET_ACCESS_KEY=minio123
>>>
>>> ICEBERG_VERSION=0.11.1
>>>
>>> DEPENDENCIES="org.apache.iceberg:iceberg-spark3-runtime:$ICEBERG_VERSION,org.apache.iceberg:iceberg-hive-runtime:$ICEBERG_VERSION,org.apache.hadoop:hadoop-aws:3.2.0"
>>>
>>> MINIOSERVER=192.168.160.5
>>>
>>>
>>> # add AWS dependnecy
>>> AWS_SDK_VERSION=2.15.40
>>> AWS_MAVEN_GROUP=software.amazon.awssdk
>>> AWS_PACKAGES=(
>>>     "bundle"
>>>     "url-connection-client"
>>> )
>>> for pkg in "${AWS_PACKAGES[@]}"; do
>>>     DEPENDENCIES+=",$AWS_MAVEN_GROUP:$pkg:$AWS_SDK_VERSION"
>>> done
>>>
>>> # start Spark SQL client shell
>>> /spark/bin/spark-shell --packages $DEPENDENCIES \
>>>     --conf
>>> spark.sql.catalog.hive_test=org.apache.iceberg.spark.SparkCatalog \
>>>     --conf spark.sql.catalog.hive_test.type=hive  \
>>>     --conf spark.hadoop.fs.s3a.endpoint=http://$MINIOSERVER:9000 \
>>>     --conf spark.hadoop.fs.s3a.access.key=minio \
>>>     --conf spark.hadoop.fs.s3a.secret.key=minio123 \
>>>     --conf spark.hadoop.fs.s3a.path.style.access=true \
>>>     --conf
>>> spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem
>>>
>>> I got: MetaException: java.lang.RuntimeException:
>>> java.lang.ClassNotFoundException: Class
>>> org.apache.hadoop.fs.s3a.S3AFileSystem not found. My hive server is not
>>> using s3 and should not cause this error. Any ideas? Thanks.
>>>
>>>
>>> I got "ClassNotFoundException: Class
>>> org.apache.hadoop.fs.s3a.S3AFileSystem not found". Any idea what dependency
>>> could I miss?
>>>
>>> On Fri, Aug 13, 2021 at 4:03 PM Daniel Weeks <da...@gmail.com>
>>> wrote:
>>>
>>>> Hey Lian,
>>>>
>>>> At a cursory glance, it appears that you might be mixing two different
>>>> FileIO implementations, which may be why you are not getting the expected
>>>> result.
>>>>
>>>> When you set: --conf
>>>> spark.sql.catalog.hive_test.io-impl=org.apache.iceberg.aws.s3.S3FileIO you're
>>>> actually switching over to the native S3 implementation within Iceberg (as
>>>> opposed to S3AFileSystem via HadoopFileIO).  However, all of the following
>>>> settings to setup access are then set for the S3AFileSystem (which would
>>>> not be used with S3FileIO).
>>>>
>>>> You might try just removing that line since it should use the
>>>> HadoopFileIO at that point and may work.
>>>>
>>>> Hope that's helpful,
>>>> -Dan
>>>>
>>>> On Fri, Aug 13, 2021 at 3:50 PM Lian Jiang <ji...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> I try to create an iceberg table on minio s3 and hive.
>>>>>
>>>>> *This is how I launch spark-shell:*
>>>>>
>>>>> # add Iceberg dependency
>>>>> export AWS_REGION=us-east-1
>>>>> export AWS_ACCESS_KEY_ID=minio
>>>>> export AWS_SECRET_ACCESS_KEY=minio123
>>>>>
>>>>> ICEBERG_VERSION=0.11.1
>>>>>
>>>>> DEPENDENCIES="org.apache.iceberg:iceberg-spark3-runtime:$ICEBERG_VERSION"
>>>>>
>>>>> MINIOSERVER=192.168.160.5
>>>>>
>>>>>
>>>>> # add AWS dependnecy
>>>>> AWS_SDK_VERSION=2.15.40
>>>>> AWS_MAVEN_GROUP=software.amazon.awssdk
>>>>> AWS_PACKAGES=(
>>>>>     "bundle"
>>>>>     "url-connection-client"
>>>>> )
>>>>> for pkg in "${AWS_PACKAGES[@]}"; do
>>>>>     DEPENDENCIES+=",$AWS_MAVEN_GROUP:$pkg:$AWS_SDK_VERSION"
>>>>> done
>>>>>
>>>>> # start Spark SQL client shell
>>>>> /spark/bin/spark-shell --packages $DEPENDENCIES \
>>>>>     --conf
>>>>> spark.sql.catalog.hive_test=org.apache.iceberg.spark.SparkCatalog \
>>>>>     --conf spark.sql.catalog.hive_test.warehouse=s3a://east/prefix \
>>>>>     --conf spark.sql.catalog.hive_test.type=hive  \
>>>>>     --conf
>>>>> spark.sql.catalog.hive_test.io-impl=org.apache.iceberg.aws.s3.S3FileIO \
>>>>>     --conf spark.hadoop.fs.s3a.endpoint=http://$MINIOSERVER:9000 \
>>>>>     --conf spark.hadoop.fs.s3a.access.key=minio \
>>>>>     --conf spark.hadoop.fs.s3a.secret.key=minio123 \
>>>>>     --conf spark.hadoop.fs.s3a.path.style.access=true \
>>>>>     --conf
>>>>> spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem
>>>>>
>>>>> *Here is the spark code to create the iceberg table:*
>>>>>
>>>>> import org.apache.spark.sql.SparkSession
>>>>> val values = List(1,2,3,4,5)
>>>>>
>>>>> val spark = SparkSession.builder().master("local").getOrCreate()
>>>>> import spark.implicits._
>>>>> val df = values.toDF()
>>>>>
>>>>> val core = "mytable8"
>>>>> val table = s"hive_test.mydb.${core}"
>>>>> val s3IcePath = s"s3a://spark-test/${core}.ice"
>>>>>
>>>>> df.writeTo(table)
>>>>>     .tableProperty("write.format.default", "parquet")
>>>>>     .tableProperty("location", s3IcePath)
>>>>>     .createOrReplace()
>>>>>
>>>>> I got an error "The AWS Access Key Id you provided does not exist in
>>>>> our records.".
>>>>>
>>>>> I have verified that I can login minio UI using the same username and
>>>>> password that I passed to spark-shell via AWS_ACCESS_KEY_ID and
>>>>> AWS_SECRET_ACCESS_KEY env variables.
>>>>> https://github.com/apache/iceberg/issues/2168 is related but does not
>>>>> help me. Not sure why the credential does not work for iceberg + AWS. Any
>>>>> idea or an example of writing an iceberg table to S3 using hive catalog
>>>>> will be highly appreciated! Thanks.
>>>>>
>>>>>
>>>>>
>>>
>>> --
>>>
>>> Create your own email signature
>>> <https://www.wisestamp.com/signature-in-email/?utm_source=promotion&utm_medium=signature&utm_campaign=create_your_own&srcid=5234462839406592>
>>>
>>
>
> --
>
> Create your own email signature
> <https://www.wisestamp.com/signature-in-email/?utm_source=promotion&utm_medium=signature&utm_campaign=create_your_own&srcid=5234462839406592>
>

Re: create iceberg on minio s3 got "The AWS Access Key Id you provided does not exist in our records."

Posted by Lian Jiang <ji...@gmail.com>.
Thanks. I prefer S3FileIO as it is recommended by iceberg. Do you have a
sample using hive catalog, s3FileIO, spark API (as opposed to SQL), S3
access.key and secret.key? It is hard to get all settings right for this
combination without an example. Appreciate any help.

On Fri, Aug 13, 2021 at 6:01 PM Daniel Weeks <da...@gmail.com>
wrote:

> So, if I recall correctly, the hive server does need access to check and
> create paths for table locations.
>
> There may be an option to disable this behavior, but otherwise the fs
> implementation probably needs to be available to the hive metastore.
>
> -Dan
>
> On Fri, Aug 13, 2021, 4:48 PM Lian Jiang <ji...@gmail.com> wrote:
>
>> Thanks Daniel.
>>
>> After modifying the script to,
>>
>> export AWS_REGION=us-east-1
>> export AWS_ACCESS_KEY_ID=minio
>> export AWS_SECRET_ACCESS_KEY=minio123
>>
>> ICEBERG_VERSION=0.11.1
>>
>> DEPENDENCIES="org.apache.iceberg:iceberg-spark3-runtime:$ICEBERG_VERSION,org.apache.iceberg:iceberg-hive-runtime:$ICEBERG_VERSION,org.apache.hadoop:hadoop-aws:3.2.0"
>>
>> MINIOSERVER=192.168.160.5
>>
>>
>> # add AWS dependnecy
>> AWS_SDK_VERSION=2.15.40
>> AWS_MAVEN_GROUP=software.amazon.awssdk
>> AWS_PACKAGES=(
>>     "bundle"
>>     "url-connection-client"
>> )
>> for pkg in "${AWS_PACKAGES[@]}"; do
>>     DEPENDENCIES+=",$AWS_MAVEN_GROUP:$pkg:$AWS_SDK_VERSION"
>> done
>>
>> # start Spark SQL client shell
>> /spark/bin/spark-shell --packages $DEPENDENCIES \
>>     --conf
>> spark.sql.catalog.hive_test=org.apache.iceberg.spark.SparkCatalog \
>>     --conf spark.sql.catalog.hive_test.type=hive  \
>>     --conf spark.hadoop.fs.s3a.endpoint=http://$MINIOSERVER:9000 \
>>     --conf spark.hadoop.fs.s3a.access.key=minio \
>>     --conf spark.hadoop.fs.s3a.secret.key=minio123 \
>>     --conf spark.hadoop.fs.s3a.path.style.access=true \
>>     --conf spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem
>>
>> I got: MetaException: java.lang.RuntimeException:
>> java.lang.ClassNotFoundException: Class
>> org.apache.hadoop.fs.s3a.S3AFileSystem not found. My hive server is not
>> using s3 and should not cause this error. Any ideas? Thanks.
>>
>>
>> I got "ClassNotFoundException: Class
>> org.apache.hadoop.fs.s3a.S3AFileSystem not found". Any idea what dependency
>> could I miss?
>>
>> On Fri, Aug 13, 2021 at 4:03 PM Daniel Weeks <da...@gmail.com>
>> wrote:
>>
>>> Hey Lian,
>>>
>>> At a cursory glance, it appears that you might be mixing two different
>>> FileIO implementations, which may be why you are not getting the expected
>>> result.
>>>
>>> When you set: --conf
>>> spark.sql.catalog.hive_test.io-impl=org.apache.iceberg.aws.s3.S3FileIO you're
>>> actually switching over to the native S3 implementation within Iceberg (as
>>> opposed to S3AFileSystem via HadoopFileIO).  However, all of the following
>>> settings to setup access are then set for the S3AFileSystem (which would
>>> not be used with S3FileIO).
>>>
>>> You might try just removing that line since it should use the
>>> HadoopFileIO at that point and may work.
>>>
>>> Hope that's helpful,
>>> -Dan
>>>
>>> On Fri, Aug 13, 2021 at 3:50 PM Lian Jiang <ji...@gmail.com>
>>> wrote:
>>>
>>>> Hi,
>>>>
>>>> I try to create an iceberg table on minio s3 and hive.
>>>>
>>>> *This is how I launch spark-shell:*
>>>>
>>>> # add Iceberg dependency
>>>> export AWS_REGION=us-east-1
>>>> export AWS_ACCESS_KEY_ID=minio
>>>> export AWS_SECRET_ACCESS_KEY=minio123
>>>>
>>>> ICEBERG_VERSION=0.11.1
>>>>
>>>> DEPENDENCIES="org.apache.iceberg:iceberg-spark3-runtime:$ICEBERG_VERSION"
>>>>
>>>> MINIOSERVER=192.168.160.5
>>>>
>>>>
>>>> # add AWS dependnecy
>>>> AWS_SDK_VERSION=2.15.40
>>>> AWS_MAVEN_GROUP=software.amazon.awssdk
>>>> AWS_PACKAGES=(
>>>>     "bundle"
>>>>     "url-connection-client"
>>>> )
>>>> for pkg in "${AWS_PACKAGES[@]}"; do
>>>>     DEPENDENCIES+=",$AWS_MAVEN_GROUP:$pkg:$AWS_SDK_VERSION"
>>>> done
>>>>
>>>> # start Spark SQL client shell
>>>> /spark/bin/spark-shell --packages $DEPENDENCIES \
>>>>     --conf
>>>> spark.sql.catalog.hive_test=org.apache.iceberg.spark.SparkCatalog \
>>>>     --conf spark.sql.catalog.hive_test.warehouse=s3a://east/prefix \
>>>>     --conf spark.sql.catalog.hive_test.type=hive  \
>>>>     --conf
>>>> spark.sql.catalog.hive_test.io-impl=org.apache.iceberg.aws.s3.S3FileIO \
>>>>     --conf spark.hadoop.fs.s3a.endpoint=http://$MINIOSERVER:9000 \
>>>>     --conf spark.hadoop.fs.s3a.access.key=minio \
>>>>     --conf spark.hadoop.fs.s3a.secret.key=minio123 \
>>>>     --conf spark.hadoop.fs.s3a.path.style.access=true \
>>>>     --conf
>>>> spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem
>>>>
>>>> *Here is the spark code to create the iceberg table:*
>>>>
>>>> import org.apache.spark.sql.SparkSession
>>>> val values = List(1,2,3,4,5)
>>>>
>>>> val spark = SparkSession.builder().master("local").getOrCreate()
>>>> import spark.implicits._
>>>> val df = values.toDF()
>>>>
>>>> val core = "mytable8"
>>>> val table = s"hive_test.mydb.${core}"
>>>> val s3IcePath = s"s3a://spark-test/${core}.ice"
>>>>
>>>> df.writeTo(table)
>>>>     .tableProperty("write.format.default", "parquet")
>>>>     .tableProperty("location", s3IcePath)
>>>>     .createOrReplace()
>>>>
>>>> I got an error "The AWS Access Key Id you provided does not exist in
>>>> our records.".
>>>>
>>>> I have verified that I can login minio UI using the same username and
>>>> password that I passed to spark-shell via AWS_ACCESS_KEY_ID and
>>>> AWS_SECRET_ACCESS_KEY env variables.
>>>> https://github.com/apache/iceberg/issues/2168 is related but does not
>>>> help me. Not sure why the credential does not work for iceberg + AWS. Any
>>>> idea or an example of writing an iceberg table to S3 using hive catalog
>>>> will be highly appreciated! Thanks.
>>>>
>>>>
>>>>
>>
>> --
>>
>> Create your own email signature
>> <https://www.wisestamp.com/signature-in-email/?utm_source=promotion&utm_medium=signature&utm_campaign=create_your_own&srcid=5234462839406592>
>>
>

-- 

Create your own email signature
<https://www.wisestamp.com/signature-in-email/?utm_source=promotion&utm_medium=signature&utm_campaign=create_your_own&srcid=5234462839406592>

Re: create iceberg on minio s3 got "The AWS Access Key Id you provided does not exist in our records."

Posted by Daniel Weeks <da...@gmail.com>.
So, if I recall correctly, the hive server does need access to check and
create paths for table locations.

There may be an option to disable this behavior, but otherwise the fs
implementation probably needs to be available to the hive metastore.

-Dan

On Fri, Aug 13, 2021, 4:48 PM Lian Jiang <ji...@gmail.com> wrote:

> Thanks Daniel.
>
> After modifying the script to,
>
> export AWS_REGION=us-east-1
> export AWS_ACCESS_KEY_ID=minio
> export AWS_SECRET_ACCESS_KEY=minio123
>
> ICEBERG_VERSION=0.11.1
>
> DEPENDENCIES="org.apache.iceberg:iceberg-spark3-runtime:$ICEBERG_VERSION,org.apache.iceberg:iceberg-hive-runtime:$ICEBERG_VERSION,org.apache.hadoop:hadoop-aws:3.2.0"
>
> MINIOSERVER=192.168.160.5
>
>
> # add AWS dependnecy
> AWS_SDK_VERSION=2.15.40
> AWS_MAVEN_GROUP=software.amazon.awssdk
> AWS_PACKAGES=(
>     "bundle"
>     "url-connection-client"
> )
> for pkg in "${AWS_PACKAGES[@]}"; do
>     DEPENDENCIES+=",$AWS_MAVEN_GROUP:$pkg:$AWS_SDK_VERSION"
> done
>
> # start Spark SQL client shell
> /spark/bin/spark-shell --packages $DEPENDENCIES \
>     --conf
> spark.sql.catalog.hive_test=org.apache.iceberg.spark.SparkCatalog \
>     --conf spark.sql.catalog.hive_test.type=hive  \
>     --conf spark.hadoop.fs.s3a.endpoint=http://$MINIOSERVER:9000 \
>     --conf spark.hadoop.fs.s3a.access.key=minio \
>     --conf spark.hadoop.fs.s3a.secret.key=minio123 \
>     --conf spark.hadoop.fs.s3a.path.style.access=true \
>     --conf spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem
>
> I got: MetaException: java.lang.RuntimeException:
> java.lang.ClassNotFoundException: Class
> org.apache.hadoop.fs.s3a.S3AFileSystem not found. My hive server is not
> using s3 and should not cause this error. Any ideas? Thanks.
>
>
> I got "ClassNotFoundException: Class
> org.apache.hadoop.fs.s3a.S3AFileSystem not found". Any idea what dependency
> could I miss?
>
> On Fri, Aug 13, 2021 at 4:03 PM Daniel Weeks <da...@gmail.com>
> wrote:
>
>> Hey Lian,
>>
>> At a cursory glance, it appears that you might be mixing two different
>> FileIO implementations, which may be why you are not getting the expected
>> result.
>>
>> When you set: --conf
>> spark.sql.catalog.hive_test.io-impl=org.apache.iceberg.aws.s3.S3FileIO you're
>> actually switching over to the native S3 implementation within Iceberg (as
>> opposed to S3AFileSystem via HadoopFileIO).  However, all of the following
>> settings to setup access are then set for the S3AFileSystem (which would
>> not be used with S3FileIO).
>>
>> You might try just removing that line since it should use the
>> HadoopFileIO at that point and may work.
>>
>> Hope that's helpful,
>> -Dan
>>
>> On Fri, Aug 13, 2021 at 3:50 PM Lian Jiang <ji...@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> I try to create an iceberg table on minio s3 and hive.
>>>
>>> *This is how I launch spark-shell:*
>>>
>>> # add Iceberg dependency
>>> export AWS_REGION=us-east-1
>>> export AWS_ACCESS_KEY_ID=minio
>>> export AWS_SECRET_ACCESS_KEY=minio123
>>>
>>> ICEBERG_VERSION=0.11.1
>>> DEPENDENCIES="org.apache.iceberg:iceberg-spark3-runtime:$ICEBERG_VERSION"
>>>
>>> MINIOSERVER=192.168.160.5
>>>
>>>
>>> # add AWS dependnecy
>>> AWS_SDK_VERSION=2.15.40
>>> AWS_MAVEN_GROUP=software.amazon.awssdk
>>> AWS_PACKAGES=(
>>>     "bundle"
>>>     "url-connection-client"
>>> )
>>> for pkg in "${AWS_PACKAGES[@]}"; do
>>>     DEPENDENCIES+=",$AWS_MAVEN_GROUP:$pkg:$AWS_SDK_VERSION"
>>> done
>>>
>>> # start Spark SQL client shell
>>> /spark/bin/spark-shell --packages $DEPENDENCIES \
>>>     --conf
>>> spark.sql.catalog.hive_test=org.apache.iceberg.spark.SparkCatalog \
>>>     --conf spark.sql.catalog.hive_test.warehouse=s3a://east/prefix \
>>>     --conf spark.sql.catalog.hive_test.type=hive  \
>>>     --conf
>>> spark.sql.catalog.hive_test.io-impl=org.apache.iceberg.aws.s3.S3FileIO \
>>>     --conf spark.hadoop.fs.s3a.endpoint=http://$MINIOSERVER:9000 \
>>>     --conf spark.hadoop.fs.s3a.access.key=minio \
>>>     --conf spark.hadoop.fs.s3a.secret.key=minio123 \
>>>     --conf spark.hadoop.fs.s3a.path.style.access=true \
>>>     --conf
>>> spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem
>>>
>>> *Here is the spark code to create the iceberg table:*
>>>
>>> import org.apache.spark.sql.SparkSession
>>> val values = List(1,2,3,4,5)
>>>
>>> val spark = SparkSession.builder().master("local").getOrCreate()
>>> import spark.implicits._
>>> val df = values.toDF()
>>>
>>> val core = "mytable8"
>>> val table = s"hive_test.mydb.${core}"
>>> val s3IcePath = s"s3a://spark-test/${core}.ice"
>>>
>>> df.writeTo(table)
>>>     .tableProperty("write.format.default", "parquet")
>>>     .tableProperty("location", s3IcePath)
>>>     .createOrReplace()
>>>
>>> I got an error "The AWS Access Key Id you provided does not exist in our
>>> records.".
>>>
>>> I have verified that I can login minio UI using the same username and
>>> password that I passed to spark-shell via AWS_ACCESS_KEY_ID and
>>> AWS_SECRET_ACCESS_KEY env variables.
>>> https://github.com/apache/iceberg/issues/2168 is related but does not
>>> help me. Not sure why the credential does not work for iceberg + AWS. Any
>>> idea or an example of writing an iceberg table to S3 using hive catalog
>>> will be highly appreciated! Thanks.
>>>
>>>
>>>
>
> --
>
> Create your own email signature
> <https://www.wisestamp.com/signature-in-email/?utm_source=promotion&utm_medium=signature&utm_campaign=create_your_own&srcid=5234462839406592>
>

Re: create iceberg on minio s3 got "The AWS Access Key Id you provided does not exist in our records."

Posted by Lian Jiang <ji...@gmail.com>.
Thanks Daniel.

After modifying the script to,

export AWS_REGION=us-east-1
export AWS_ACCESS_KEY_ID=minio
export AWS_SECRET_ACCESS_KEY=minio123

ICEBERG_VERSION=0.11.1
DEPENDENCIES="org.apache.iceberg:iceberg-spark3-runtime:$ICEBERG_VERSION,org.apache.iceberg:iceberg-hive-runtime:$ICEBERG_VERSION,org.apache.hadoop:hadoop-aws:3.2.0"

MINIOSERVER=192.168.160.5


# add AWS dependnecy
AWS_SDK_VERSION=2.15.40
AWS_MAVEN_GROUP=software.amazon.awssdk
AWS_PACKAGES=(
    "bundle"
    "url-connection-client"
)
for pkg in "${AWS_PACKAGES[@]}"; do
    DEPENDENCIES+=",$AWS_MAVEN_GROUP:$pkg:$AWS_SDK_VERSION"
done

# start Spark SQL client shell
/spark/bin/spark-shell --packages $DEPENDENCIES \
    --conf
spark.sql.catalog.hive_test=org.apache.iceberg.spark.SparkCatalog \
    --conf spark.sql.catalog.hive_test.type=hive  \
    --conf spark.hadoop.fs.s3a.endpoint=http://$MINIOSERVER:9000 \
    --conf spark.hadoop.fs.s3a.access.key=minio \
    --conf spark.hadoop.fs.s3a.secret.key=minio123 \
    --conf spark.hadoop.fs.s3a.path.style.access=true \
    --conf spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem

I got: MetaException: java.lang.RuntimeException:
java.lang.ClassNotFoundException: Class
org.apache.hadoop.fs.s3a.S3AFileSystem not found. My hive server is not
using s3 and should not cause this error. Any ideas? Thanks.


I got "ClassNotFoundException: Class org.apache.hadoop.fs.s3a.S3AFileSystem
not found". Any idea what dependency could I miss?

On Fri, Aug 13, 2021 at 4:03 PM Daniel Weeks <da...@gmail.com>
wrote:

> Hey Lian,
>
> At a cursory glance, it appears that you might be mixing two different
> FileIO implementations, which may be why you are not getting the expected
> result.
>
> When you set: --conf
> spark.sql.catalog.hive_test.io-impl=org.apache.iceberg.aws.s3.S3FileIO you're
> actually switching over to the native S3 implementation within Iceberg (as
> opposed to S3AFileSystem via HadoopFileIO).  However, all of the following
> settings to setup access are then set for the S3AFileSystem (which would
> not be used with S3FileIO).
>
> You might try just removing that line since it should use the HadoopFileIO
> at that point and may work.
>
> Hope that's helpful,
> -Dan
>
> On Fri, Aug 13, 2021 at 3:50 PM Lian Jiang <ji...@gmail.com> wrote:
>
>> Hi,
>>
>> I try to create an iceberg table on minio s3 and hive.
>>
>> *This is how I launch spark-shell:*
>>
>> # add Iceberg dependency
>> export AWS_REGION=us-east-1
>> export AWS_ACCESS_KEY_ID=minio
>> export AWS_SECRET_ACCESS_KEY=minio123
>>
>> ICEBERG_VERSION=0.11.1
>> DEPENDENCIES="org.apache.iceberg:iceberg-spark3-runtime:$ICEBERG_VERSION"
>>
>> MINIOSERVER=192.168.160.5
>>
>>
>> # add AWS dependnecy
>> AWS_SDK_VERSION=2.15.40
>> AWS_MAVEN_GROUP=software.amazon.awssdk
>> AWS_PACKAGES=(
>>     "bundle"
>>     "url-connection-client"
>> )
>> for pkg in "${AWS_PACKAGES[@]}"; do
>>     DEPENDENCIES+=",$AWS_MAVEN_GROUP:$pkg:$AWS_SDK_VERSION"
>> done
>>
>> # start Spark SQL client shell
>> /spark/bin/spark-shell --packages $DEPENDENCIES \
>>     --conf
>> spark.sql.catalog.hive_test=org.apache.iceberg.spark.SparkCatalog \
>>     --conf spark.sql.catalog.hive_test.warehouse=s3a://east/prefix \
>>     --conf spark.sql.catalog.hive_test.type=hive  \
>>     --conf
>> spark.sql.catalog.hive_test.io-impl=org.apache.iceberg.aws.s3.S3FileIO \
>>     --conf spark.hadoop.fs.s3a.endpoint=http://$MINIOSERVER:9000 \
>>     --conf spark.hadoop.fs.s3a.access.key=minio \
>>     --conf spark.hadoop.fs.s3a.secret.key=minio123 \
>>     --conf spark.hadoop.fs.s3a.path.style.access=true \
>>     --conf spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem
>>
>> *Here is the spark code to create the iceberg table:*
>>
>> import org.apache.spark.sql.SparkSession
>> val values = List(1,2,3,4,5)
>>
>> val spark = SparkSession.builder().master("local").getOrCreate()
>> import spark.implicits._
>> val df = values.toDF()
>>
>> val core = "mytable8"
>> val table = s"hive_test.mydb.${core}"
>> val s3IcePath = s"s3a://spark-test/${core}.ice"
>>
>> df.writeTo(table)
>>     .tableProperty("write.format.default", "parquet")
>>     .tableProperty("location", s3IcePath)
>>     .createOrReplace()
>>
>> I got an error "The AWS Access Key Id you provided does not exist in our
>> records.".
>>
>> I have verified that I can login minio UI using the same username and
>> password that I passed to spark-shell via AWS_ACCESS_KEY_ID and
>> AWS_SECRET_ACCESS_KEY env variables.
>> https://github.com/apache/iceberg/issues/2168 is related but does not
>> help me. Not sure why the credential does not work for iceberg + AWS. Any
>> idea or an example of writing an iceberg table to S3 using hive catalog
>> will be highly appreciated! Thanks.
>>
>>
>>

-- 

Create your own email signature
<https://www.wisestamp.com/signature-in-email/?utm_source=promotion&utm_medium=signature&utm_campaign=create_your_own&srcid=5234462839406592>

Re: create iceberg on minio s3 got "The AWS Access Key Id you provided does not exist in our records."

Posted by Daniel Weeks <da...@gmail.com>.
Hey Lian,

At a cursory glance, it appears that you might be mixing two different
FileIO implementations, which may be why you are not getting the expected
result.

When you set: --conf
spark.sql.catalog.hive_test.io-impl=org.apache.iceberg.aws.s3.S3FileIO you're
actually switching over to the native S3 implementation within Iceberg (as
opposed to S3AFileSystem via HadoopFileIO).  However, all of the following
settings to setup access are then set for the S3AFileSystem (which would
not be used with S3FileIO).

You might try just removing that line since it should use the HadoopFileIO
at that point and may work.

Hope that's helpful,
-Dan

On Fri, Aug 13, 2021 at 3:50 PM Lian Jiang <ji...@gmail.com> wrote:

> Hi,
>
> I try to create an iceberg table on minio s3 and hive.
>
> *This is how I launch spark-shell:*
>
> # add Iceberg dependency
> export AWS_REGION=us-east-1
> export AWS_ACCESS_KEY_ID=minio
> export AWS_SECRET_ACCESS_KEY=minio123
>
> ICEBERG_VERSION=0.11.1
> DEPENDENCIES="org.apache.iceberg:iceberg-spark3-runtime:$ICEBERG_VERSION"
>
> MINIOSERVER=192.168.160.5
>
>
> # add AWS dependnecy
> AWS_SDK_VERSION=2.15.40
> AWS_MAVEN_GROUP=software.amazon.awssdk
> AWS_PACKAGES=(
>     "bundle"
>     "url-connection-client"
> )
> for pkg in "${AWS_PACKAGES[@]}"; do
>     DEPENDENCIES+=",$AWS_MAVEN_GROUP:$pkg:$AWS_SDK_VERSION"
> done
>
> # start Spark SQL client shell
> /spark/bin/spark-shell --packages $DEPENDENCIES \
>     --conf
> spark.sql.catalog.hive_test=org.apache.iceberg.spark.SparkCatalog \
>     --conf spark.sql.catalog.hive_test.warehouse=s3a://east/prefix \
>     --conf spark.sql.catalog.hive_test.type=hive  \
>     --conf
> spark.sql.catalog.hive_test.io-impl=org.apache.iceberg.aws.s3.S3FileIO \
>     --conf spark.hadoop.fs.s3a.endpoint=http://$MINIOSERVER:9000 \
>     --conf spark.hadoop.fs.s3a.access.key=minio \
>     --conf spark.hadoop.fs.s3a.secret.key=minio123 \
>     --conf spark.hadoop.fs.s3a.path.style.access=true \
>     --conf spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem
>
> *Here is the spark code to create the iceberg table:*
>
> import org.apache.spark.sql.SparkSession
> val values = List(1,2,3,4,5)
>
> val spark = SparkSession.builder().master("local").getOrCreate()
> import spark.implicits._
> val df = values.toDF()
>
> val core = "mytable8"
> val table = s"hive_test.mydb.${core}"
> val s3IcePath = s"s3a://spark-test/${core}.ice"
>
> df.writeTo(table)
>     .tableProperty("write.format.default", "parquet")
>     .tableProperty("location", s3IcePath)
>     .createOrReplace()
>
> I got an error "The AWS Access Key Id you provided does not exist in our
> records.".
>
> I have verified that I can login minio UI using the same username and
> password that I passed to spark-shell via AWS_ACCESS_KEY_ID and
> AWS_SECRET_ACCESS_KEY env variables.
> https://github.com/apache/iceberg/issues/2168 is related but does not
> help me. Not sure why the credential does not work for iceberg + AWS. Any
> idea or an example of writing an iceberg table to S3 using hive catalog
> will be highly appreciated! Thanks.
>
>
>