You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/04/12 11:27:40 UTC

[GitHub] [arrow-datafusion] tustvold opened a new issue, #2206: [datafusion-contrib] AWS Glue Integration

tustvold opened a new issue, #2206:
URL: https://github.com/apache/arrow-datafusion/issues/2206

   **Is your feature request related to a problem or challenge? Please describe what you are trying to do.**
   
   This has been discussed in various places, https://github.com/apache/arrow-datafusion/issues/907 and https://github.com/datafusion-contrib/datafusion-objectstore-s3/pull/53 to name a few, so creating an issue for visibility.
   
   **Describe the solution you'd like**
   
   I would propose creating a new datafusion-contrib crate, perhaps `datafusion-catalog-glue`, which communicates with an [AWS Glue Data Catalog](https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api.html).
   
   I'll leave the exact design for whoever picks this up, but I might expect something along the following lines.
   
   * Create a `GlueCatalog` with an optional catalog ID
   * Provide a `async fn GlueCatalog::list_databases(&self) -> Vec<String>` to list the databases
   * Provide a `async fn GlueCatalog::get_database(&self, name: &str) -> Result<GlueDatabase>` to get a database
   * Implement `SchemaProvider` for `GlueDatabase`
   
   I think it should be possible to reuse the `FileScanConfig` structure used by `ListingTable` to simplify implementation of the `TableProvider`.
   
   **Describe alternatives you've considered**
   
   We could not support AWS Glue
   
   **Additional context**
   
   This will help with https://github.com/datafusion-contrib/datafusion-objectstore-s3/pull/53 by alleviating the need to infer the schema from the files on every query, and only listing files in non-pruned partitions.
   
   This may need to depend on https://github.com/datafusion-contrib/datafusion-objectstore-s3 as I think it will still need to list S3 in order to get the files within a given partition.
   
   The Glue API is not the snappiest of things, so a future extension might be to cache the metadata returned, as is done by the [Java client](https://github.com/awslabs/aws-glue-data-catalog-client-for-apache-hive-metastore#enabling-client-side-caching-for-catalog).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] matthewmturner commented on issue #2206: [datafusion-contrib] AWS Glue Integration

Posted by GitBox <gi...@apache.org>.
matthewmturner commented on issue #2206:
URL: https://github.com/apache/arrow-datafusion/issues/2206#issuecomment-1096735274

   I created issue on the official aws-sdk-rust repo to add support for Glue.  I believe they prioritize based on getting 👍  so anyone interested in this please up vote there if you get the chance.
   
   I suppose this would mean using rusoto in the meantime.  im unsure what this would mean if we tried to use rusoto for glue and aws-sdk-rust for s3 (via `datafusion-objectstore-s3`).  If needed / any type of incompatibility perhaps we could add a rusoto feature to `datafusion-objectstore-s3` that could be used until official support lands.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] timvw commented on issue #2206: [datafusion-contrib] AWS Glue Integration

Posted by GitBox <gi...@apache.org>.
timvw commented on issue #2206:
URL: https://github.com/apache/arrow-datafusion/issues/2206#issuecomment-1116651823

   I have some sample code which uses the aws-sdk-glue client, iterates over all databases and tables, and registers them in a memorycatalogprovider
   
   https://gist.github.com/timvw/84246389d9c79fc0bf07570c625fdaf4


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org