You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@beam.apache.org by "psolomin (via GitHub)" <gi...@apache.org> on 2023/04/26 11:13:30 UTC

[GitHub] [beam] psolomin opened a new issue, #26429: [Feature Request]: Support read from / write to AWS Glue catalog tables backed by AWS S3 data

psolomin opened a new issue, #26429:
URL: https://github.com/apache/beam/issues/26429

   ### What would you like to happen?
   
   ## Overview
   
   Glue catalog is a serverless metadata storage (databases, tables, schemas, partitions, connectors, etc). More: https://docs.aws.amazon.com/glue/latest/dg/glue-connections.html
   
   It would be cool to have Beam supporting smth like:
   
   ```
   GlueIO.write()
         .withClientConfiguration(awsClientConfiguration)
         .withDatabaseName("my_glue_db")
         .withTableName("my_glue_table")
         .withIOType(org.apache.beam.sdk.io.FileIO.class)
         .withIOConfig(... configs for FileIO / JdbcIO / ... )
         .withSchemaUpdateStrategy(ADD_NEW_COLUMNS | DISABLED)
   ```
   
   ## Other existing implementations
   
   1. For AWS S3-backed tables Spark on AWS EMR supports writing to a Glue table in a similar way it does for Hive Metastore tables:
   
   ```
   df.write.saveAsTable("glue_db.glue_table")
   ```
   
   2. Trino supports Glue catalog too - https://trino.io/docs/current/connector/hive.html - in a similar fashion Spark does - using it as a replacement for metadata of tables which are stored in some filesystem.
   
   3. AWS Glue job (which is AWS fork of Spark) supports other types of storages: Mongo, RDS, etc
   
   4. Flink seems to have it as work-in-progress: https://github.com/apache/flink-connector-aws/pull/47
   
   ## Notes on possible implementation
   
   Beam has `HCatalogIO` implementation - https://beam.apache.org/documentation/io/built-in/hcatalog/ - but it does not seem to be a good place for `GlueIO`:
    - it is highly coupled with Hive dependencies
    - it can not run on Java 11: https://github.com/apache/beam/issues/21299
    - Potentially `GlueIO` can support more storage types besides file systems - adding such into `GlueIO` can be easier
    - `HCatalogIO` currently doesn't have any machinery for AWS auth, coders, etc
   
   
   ### Issue Priority
   
   Priority: 2 (default / most feature requests should be filed as P2)
   
   ### Issue Components
   
   - [ ] Component: Python SDK
   - [X] Component: Java SDK
   - [ ] Component: Go SDK
   - [ ] Component: Typescript SDK
   - [X] Component: IO connector
   - [ ] Component: Beam examples
   - [ ] Component: Beam playground
   - [ ] Component: Beam katas
   - [ ] Component: Website
   - [ ] Component: Spark Runner
   - [ ] Component: Flink Runner
   - [ ] Component: Samza Runner
   - [ ] Component: Twister2 Runner
   - [ ] Component: Hazelcast Jet Runner
   - [ ] Component: Google Cloud Dataflow Runner


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [beam] psolomin commented on issue #26429: [Feature Request]: Support read from / write to AWS Glue catalog tables backed by AWS S3 data

Posted by "psolomin (via GitHub)" <gi...@apache.org>.

psolomin commented on issue #26429:
URL: https://github.com/apache/beam/issues/26429#issuecomment-1544796438

   @mosche 
   
   > Do you have specific use cases in mind?
   
   Yeah, let me name some:
   - read from a Glue table backed by files in AWS S3 - FileIO will actually read files Glue catalog points to
   - read from a Glue table backed by AWS RDS (SQL) table - JdbcIO will actually read a table Glue catalog has
   
   Writing is more tricky cause in that case Beam will need to edit Glue catalog objects
   
   > Glue preparing that data itself
   
   "Glue" is actually multiple things: catalog (similar to Hive Metastore), Glue jobs (AWS proprietary version of serverless Spark), Glue crawlers, etc. For now this feature request is about Glue catalog only.
   
   > Iceberg
   
   That one will be very useful, yes. Trino, Flink and Spark already have Iceberg support. Adding support for Iceberg will be more value added, I would say. And, as I remember, Iceberg can work without catalog or Hive metastore, and use table locations where both data & metadata is stored.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [beam] mosche commented on issue #26429: [Feature Request]: Support read from / write to AWS Glue catalog tables backed by AWS S3 data

Posted by "mosche (via GitHub)" <gi...@apache.org>.

mosche commented on issue #26429:
URL: https://github.com/apache/beam/issues/26429#issuecomment-1535816174

   @psolomin Do you have specific use cases in mind? 
   Reading tables from the Glue meta store sounds like a useful integration! I'm not to sure about writing though, it feels a bit like that would conflict with the purpose of Glue preparing that data itself ... not sure though, I haven't used Glue much.
   
   On the practical side, I'd expect more and more S3 backed tables catalogued in Glue to migrate to Iceberg / Hudi rather than using the fold Hive format. Beam not having support for these newer table formats might limit the value of such an IO.
   I've been thinking about working on an IcebergIO for Beam, but unfortunately won't have time for it in the near future.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [beam] mosche commented on issue #26429: [Feature Request]: Support read from / write to AWS Glue catalog tables backed by AWS S3 data

Posted by "mosche (via GitHub)" <gi...@apache.org>.

mosche commented on issue #26429:
URL: https://github.com/apache/beam/issues/26429#issuecomment-1640334311

   @psolomin fyi https://github.com/apache/beam/issues/20327


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org