You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by "roeap (via GitHub)" <gi...@apache.org> on 2023/05/14 13:15:38 UTC

[GitHub] [arrow-datafusion] roeap opened a new issue, #6350: Advice on using external catalogue as Catalog/Schema provider

roeap opened a new issue, #6350:
URL: https://github.com/apache/arrow-datafusion/issues/6350

   ### Is your feature request related to a problem or challenge?
   
   When trying to integrate Unity catalog with delta-rs, which in turn means integrating it with datafusion, we are facing some challenges making API calls to the catalog in some of the trait functions.
   
   Essentially, only the `table` function on `SchemaProvider` is async whereas all other functions on the various providers are synchronous.
   
   Looking at other implementations for remote catalogs out there it seems most are resorting to downloading the data beforehand and then essentially using the build in `Memory*` implementations. While technically feasible it poses some other challenges as the amount of registered tables in a catalog may get quite large and also change relatively frequently.
   
   Thus I was wondering of there are alternative best practices out there or if it would be feasible to make some more of the trait functions async.
   
   ### Describe the solution you'd like
   
   Having more functions on the `*Provider` traits async with priority increasing as you move down the tree - i.e. pre-loading all catalogs hurts less then pre-laoding all schemas, which hurts less then loading all tables ...
   
   ### Describe alternatives you've considered
   
   * Preloading all data but the table details.
   * Working around using async in sync code while there is already a running runtime.
   
   ### Additional context
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] roeap commented on issue #6350: Advice on using external catalogue as Catalog/Schema provider

Posted by "roeap (via GitHub)" <gi...@apache.org>.
roeap commented on issue #6350:
URL: https://github.com/apache/arrow-datafusion/issues/6350#issuecomment-1547010192

   > but is a non-trivial change
   
   We can leave it as is then for now and I'll observe the actual impact in production as we adopt this more and more.
   
   I did have a somewhat related question though 😆. To implement the catalog client I more or less just copied the client and auth methods out of object_store and was planning to use the same for the glue catalog integration (this still uses the deprecated rusoto crate in delta-rs) - it's quite nice to use for other mutli-cloud scenarios.
   
   I guess my question is, do you believe there may be value in having something similar to object_store for cloud catalogs or maybe expose the client and credentials?
   
   I guess object_store would not want to make a contract with its consumers on the internal http client apis, but maybe if it is made explicit that these are APIs that do not pay into the public contract and versioning guarantees?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] roeap commented on issue #6350: Advice on using external catalogue as Catalog/Schema provider

Posted by "roeap (via GitHub)" <gi...@apache.org>.
roeap commented on issue #6350:
URL: https://github.com/apache/arrow-datafusion/issues/6350#issuecomment-1547000928

   Thinking about it a bit more, in interactive sessions users are likely only allowed to see a rather small subset of all catalogs / schemas / tables, so it's likely much less then I originally thought.
   
   Being able to asynchronously list all tables would still ne nice to have, but probably not worth it if if the change is too invasive here.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] tustvold commented on issue #6350: Advice on using external catalogue as Catalog/Schema provider

Posted by "tustvold (via GitHub)" <gi...@apache.org>.
tustvold commented on issue #6350:
URL: https://github.com/apache/arrow-datafusion/issues/6350#issuecomment-1549437555

   > I more or less just copied the client and auth methods out of object_store 
   
   I created https://github.com/apache/arrow-rs/issues/4223 to track adding support for exposing the authorization logic, as this is the major one I think


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] roeap commented on issue #6350: Advice on using external catalogue as Catalog/Schema provider

Posted by "roeap (via GitHub)" <gi...@apache.org>.
roeap commented on issue #6350:
URL: https://github.com/apache/arrow-datafusion/issues/6350#issuecomment-1546995574

   Yes, full table metadata / schema can be fetched in table. If making table names async is feasible, that would be great!
   
   The thing with unity catalog is, that - at least in Azure - there can be only one instance per tenant in a region. Since I work in a multi-national org, where colleagues across the globe are working in various worksapces within a reagion, the number of tables can be significant. Not in terms of having concerns about memory o.a. but some too much to be comfortable :).
   
   At the catalog / schema level its not so bad, ab pre-loading is fine in my opinion.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] tustvold commented on issue #6350: Advice on using external catalogue as Catalog/Schema provider

Posted by "tustvold (via GitHub)" <gi...@apache.org>.
tustvold commented on issue #6350:
URL: https://github.com/apache/arrow-datafusion/issues/6350#issuecomment-1547007115

   I had a brief play, its certainly quite a fiddly change, the virality of async causes it to bubble up out of SessionContext and this then causes some excitement with its locking, I don't think it is anything insurmountable, but is a non-trivial change - likely why I punted on it initially 😅


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] roeap closed issue #6350: Advice on using external catalogue as Catalog/Schema provider

Posted by "roeap (via GitHub)" <gi...@apache.org>.
roeap closed issue #6350: Advice on using external catalogue as Catalog/Schema provider
URL: https://github.com/apache/arrow-datafusion/issues/6350


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] tustvold commented on issue #6350: Advice on using external catalogue as Catalog/Schema provider

Posted by "tustvold (via GitHub)" <gi...@apache.org>.
tustvold commented on issue #6350:
URL: https://github.com/apache/arrow-datafusion/issues/6350#issuecomment-1546991540

   Would it work for `SchemaProvider::table` to fetch the table metadata such as schema information? This would mean you would not need to know table metadata beyond that referenced by a particular query?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] roeap commented on issue #6350: Advice on using external catalogue as Catalog/Schema provider

Posted by "roeap (via GitHub)" <gi...@apache.org>.
roeap commented on issue #6350:
URL: https://github.com/apache/arrow-datafusion/issues/6350#issuecomment-1550326033

   Thanks @tustvold for taking care of this! I'll close this issue. Thanks again for the great advice.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org