You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2020/12/09 22:49:45 UTC
[GitHub] [iceberg] johnclara commented on pull request #1844: AWS: support custom client configuration

johnclara commented on pull request #1844:
URL: https://github.com/apache/iceberg/pull/1844#issuecomment-742112667


   > though I'd like to hear thoughts from others on that, @johnclara?
   
   I'm still catching up on all the different configuration locations/barriers. Clearly from my above comments, I'm confused about how it all works together.
   
   Ryan's email helped me a lot: https://lists.apache.org/thread.html/r1c357c7867234fed58b12f0cba6e9d2c13ed1af94b8d922ffe3a3def%40%3Cdev.iceberg.apache.org%3E
   
   For this PR it sounds like it's still in flux but my understanding is (please correct the things I have wrong):
   
   For spark as an example the config barriers are:
   User passes:
   command line args + hadoop config files on disk
   
   Spark will push down:
   DataSourceV2 options + hadoop config
   
   Spark source will initialize catalog with:
   Map<String, String> argument (called properties?) + hadoop config
   
   Inside the catalog the config sources seem like:
   Map<String, String> arguments: system/compute constraints (like running on prem vs on an ec2 instance)
   HadoopConfig: for setting up HadoopFileSystem
   CatalogProperties: I'm not sure what these are? If it's hive or glue would it be a Map<String, String> per table entry or would it be for the catalog as a whole? In ryan's email I think it says that it should just give the location and that it is an iceberg table (but what if it should also say if it's an iceberg table backed by aws vs gcp and how to read it?)
   TableProperties: table specific infromation
   
   It sounds like for tables which use S3FileIO:
   HadoopConfig should be ignored since we're not using HadoopFileIO
   Map<String, String> arguments should give enough info to reach the catalog
   Then catalog properties should be mixed with some arguments which have constraints on where the compute is running (eg proxy for on prem vs in the could). This should give enough info to be able to read the table's json file
   Then the catalog properties should be ignored and the table properties should be mixed with the arguments. This should give enough information to read and write to the table?
   
   Should a file io be created just for the json file and then a new file io be created for the other part of the table (manifest list/ manifest + data files?)
   
   Where should the kms key id go? Is that a per table thing to write data files (table properties) or is it a user/system defined property which should go in the arguments?
   
   It would help me a ton if that was written out somewhere (it might be in Jack's docs that he's written up I still have to go take a look)


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org