You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2020/01/30 02:36:02 UTC

[GitHub] [incubator-iceberg] ramkumarkb opened a new issue #761: Suggestion for newbie getting started guide

ramkumarkb opened a new issue #761: Suggestion for newbie getting started guide
URL: https://github.com/apache/incubator-iceberg/issues/761

Hello,

I am a newbie with Iceberg and was going thru the documentation and table format specs. I was wondering if there is a more simpler way to get started w/o Spark or Presto. Specifically, can one do the following:

1. Install local S3 Object Store with MinIO - https://min.io/ - rather straightforward
2. Java app to create tables - is it https://iceberg.apache.org/api-quickstart/#using-hadoop-tables - can this be used for S3 with `s3a://` instead of `hdfs://` - this is to configure the storage to the above mentioned S3 installation. Is this as simple as that? I am bit wary of the versions of the Hadoop libraries (and its compatibility with S3) to use.
3. Similarly create Schema, Partition Spec - https://iceberg.apache.org/api-quickstart/#create-a-schema
4. Client Java app to add some data, read data and modify one particular record - Not sure if this via `IcebergGenerics`?

I think can contribute this, with some guidance of course, if this if of any use.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] apatrida commented on issue #761: Suggestion for newbie getting started guide

Posted by GitBox <gi...@apache.org>.

apatrida commented on issue #761:
URL: https://github.com/apache/iceberg/issues/761#issuecomment-1050808548


   Looks like no atomic rename:
   
   https://github.com/minio/minio-js/issues/787
   https://docs.min.io/docs/disaggregated-spark-and-hadoop-hive-with-minio.html


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [incubator-iceberg] rdblue commented on issue #761: Suggestion for newbie getting started guide

Posted by GitBox <gi...@apache.org>.

rdblue commented on issue #761: Suggestion for newbie getting started guide
URL: https://github.com/apache/incubator-iceberg/issues/761#issuecomment-582979633
 
 
   @ramkumarkb, it isn't clear from that alone. Iceberg's Hadoop table implementation requires atomic rename for correctness.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [incubator-iceberg] ramkumarkb commented on issue #761: Suggestion for newbie getting started guide

Posted by GitBox <gi...@apache.org>.

ramkumarkb commented on issue #761: Suggestion for newbie getting started guide
URL: https://github.com/apache/incubator-iceberg/issues/761#issuecomment-582860009
 
 
   Ryan,
   
   Thank you for your reply.
   
   As per the MinIO docs - https://docs.minio.io/docs/distributed-minio-quickstart-guide.html 
   > MinIO follows strict read-after-write and list-after-write consistency model for all i/o operations both in distributed and standalone modes.
   
   With this, a Hive Metastore is still needed? 
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [incubator-iceberg] rdblue commented on issue #761: Suggestion for newbie getting started guide

Posted by GitBox <gi...@apache.org>.

rdblue commented on issue #761: Suggestion for newbie getting started guide
URL: https://github.com/apache/incubator-iceberg/issues/761#issuecomment-581184156
 
 
   Hadoop tables cannot be used with a file system that doesn't support atomic rename. They should only be used with HDFS or a local FS, not with S3. For S3, you should be using a metastore, like Hive.
   
   You're right that IcebergGenerics is the recommended way to read a table directly, without using an engine like Spark or Presto.
   
   If you want to write up some documentation, that would be great! Please open a PR.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] apatrida edited a comment on issue #761: Suggestion for newbie getting started guide

Posted by GitBox <gi...@apache.org>.

apatrida edited a comment on issue #761:
URL: https://github.com/apache/iceberg/issues/761#issuecomment-1050808548

Looks like no atomic rename:

https://github.com/minio/minio-js/issues/787
https://docs.min.io/docs/disaggregated-spark-and-hadoop-hive-with-minio.html

In the 2nd link:

> S3A is the connector to use S3 and other S3-compatible object stores such as MinIO. MapReduce workloads typically interact with object stores in the same way they do with HDFS. These workloads rely on HDFS atomic rename functionality to complete writing data to the datastore. Object storage operations are atomic by nature and they do not require/implement rename API. The default S3A committer emulates renames through copy and delete APIs. This interaction pattern causes significant loss of performance because of the write amplification. Netflix, for example, developed two new staging committers - the Directory staging committer and the Partitioned staging committer - to take full advantage of native object storage operations. These committers do not require rename operation. The two staging committers were evaluated, along with another new addition called the Magic committer for benchmarking.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org