You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/10/25 09:30:37 UTC

[GitHub] [arrow-datafusion] yahoNanJing opened a new pull request, #3955: Change back to non async interface for try_into_logical_plan

yahoNanJing opened a new pull request, #3955:
URL: https://github.com/apache/arrow-datafusion/pull/3955

   # Which issue does this PR close?
   
   <!--
   We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax. For example `Closes #123` indicates that this PR will close issue #123.
   -->
   
   Closes #3954.
   
    # Rationale for this change
   <!--
    Why are you proposing this change? If this is already explained clearly in the issue then this section is not needed.
    Explaining clearly why changes are proposed helps reviewers understand your changes and offer better suggestions for fixes.  
   -->
   
   # What changes are included in this PR?
   <!--
   There is no need to duplicate the description in the issue here but it is sometimes worth providing a summary of the individual changes in this PR.
   -->
   
   # Are there any user-facing changes?
   <!--
   If there are user-facing changes then we may require documentation to be updated before approving the PR.
   -->
   
   <!--
   If there are any breaking changes to public APIs, please add the `api change` label.
   -->


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] tustvold commented on pull request #3955: Change back to non async interface for try_into_logical_plan

Posted by GitBox <gi...@apache.org>.
tustvold commented on PR #3955:
URL: https://github.com/apache/arrow-datafusion/pull/3955#issuecomment-1292447599

   I'm not intimately familiar with the details here, but when given a choice between async or not, I almost always favour the latter.  :sweat_smile:. Certainly the idea of deserialization logic making network calls a little bit surprising. If it is possible to achieve the end result without making this async, that seems like a better path to me
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] avantgardnerio commented on pull request #3955: Change back to non async interface for try_into_logical_plan

Posted by GitBox <gi...@apache.org>.
avantgardnerio commented on PR #3955:
URL: https://github.com/apache/arrow-datafusion/pull/3955#issuecomment-1290804432

   > update the code in question to be async?
   
   FWIW, I have already done this, I just haven't filed a PR yet. I can do that easily if we decide to stick with async.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] avantgardnerio commented on pull request #3955: Change back to non async interface for try_into_logical_plan

Posted by GitBox <gi...@apache.org>.
avantgardnerio commented on PR #3955:
URL: https://github.com/apache/arrow-datafusion/pull/3955#issuecomment-1290797135

   > with spawn blocking.
   
   This is [the PR](https://github.com/delta-io/delta-rs/pull/892) in question. This is the [deserialization line of code](https://github.com/spaceandtimelabs/delta-rs/blob/9c6918f73f8f67e03a8f7db6ae002e786c867c0c/rust/src/delta_datafusion.rs#L832) that calls the [async method](https://github.com/spaceandtimelabs/delta-rs/blob/9c6918f73f8f67e03a8f7db6ae002e786c867c0c/rust/src/delta_datafusion.rs#L858). If you know another way to do this, I would be open to specific recommendations. I don't understand how spawn blocking would help, but I am open to it.
   
   The only other option I was able to think of is to serialize the whole state of the `DeltaTable` including the `ObjectStore` config and the JSON history of the delta log.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] yahoNanJing commented on pull request #3955: Change back to non async interface for try_into_logical_plan

Posted by GitBox <gi...@apache.org>.
yahoNanJing commented on PR #3955:
URL: https://github.com/apache/arrow-datafusion/pull/3955#issuecomment-1290774729

   Thanks @avantgardnerio for the explanation. In case of network usage or other IO usage, we can refer to the object store interface implementation with spawn blocking.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] yahoNanJing commented on pull request #3955: Change back to non async interface for try_into_logical_plan

Posted by GitBox <gi...@apache.org>.
yahoNanJing commented on PR #3955:
URL: https://github.com/apache/arrow-datafusion/pull/3955#issuecomment-1292876221

   Thanks @avantgardnerio. I agree to close this PR if #3978 works for both you and non-async 😄 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb commented on pull request #3955: Change back to non async interface for try_into_logical_plan

Posted by GitBox <gi...@apache.org>.
alamb commented on PR #3955:
URL: https://github.com/apache/arrow-datafusion/pull/3955#issuecomment-1290818498

   @yahoNanJing I can't tell from this PR or https://github.com/apache/arrow-datafusion/issues/3954 about why you are concerned about this change. Specifically does `async` prevent you from doing something? Or did you just want clarification on what the API was changed? Or something else?
   
   
   Can you please let us know what you would prefer 
   1. backout the `async` change to the serialize API while we have a discussion on the matter
   2. keep it in and proceed
   3. Something else
   
   Note that @andygrove filed  https://github.com/apache/arrow-datafusion/issues/3957 recently 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] tustvold commented on a diff in pull request #3955: Change back to non async interface for try_into_logical_plan

Posted by GitBox <gi...@apache.org>.
tustvold commented on code in PR #3955:
URL: https://github.com/apache/arrow-datafusion/pull/3955#discussion_r1006053534


##########
datafusion/core/src/datasource/datasource.rs:
##########
@@ -86,5 +86,5 @@ pub trait TableProvider: Sync + Send {
 #[async_trait]

Review Comment:
   Can probably remove `#[async_trait]` as well



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] yahoNanJing commented on pull request #3955: Change back to non async interface for try_into_logical_plan

Posted by GitBox <gi...@apache.org>.
yahoNanJing commented on PR #3955:
URL: https://github.com/apache/arrow-datafusion/pull/3955#issuecomment-1290285043

   It's caused by #3907. I think there's some discussion before for changing the async interface to non async. I'm wondering why it's changed back to async again in #3907. 
   
   Hi @alamb, @andygrove and @avantgardnerio, what do you think of this?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb closed pull request #3955: Change back to non async interface for try_into_logical_plan

Posted by GitBox <gi...@apache.org>.
alamb closed pull request #3955: Change back to non async interface for try_into_logical_plan
URL: https://github.com/apache/arrow-datafusion/pull/3955


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] yahoNanJing commented on pull request #3955: Change back to non async interface for try_into_logical_plan

Posted by GitBox <gi...@apache.org>.
yahoNanJing commented on PR #3955:
URL: https://github.com/apache/arrow-datafusion/pull/3955#issuecomment-1291877333

   Hi @avantgardnerio and @alamb, I think it's a similar case for the `execute` interface in `ExecutionPlan`. There're some discussion there, #2307, #2434. 
   
   > Specifically does async prevent you from doing something?
   Currently there's no blocking issue for the async change to ballista. But I'm not sure whether it will bring issue in the future. From my point of view, generally, the interface of creating the logical plan should be serialize API. For some edge case, we can use some workaround like https://docs.rs/tokio/1.21.2/tokio/runtime/struct.Runtime.html#method.block_on does.
   
   Maybe we can involve more guys for the API design, like @tustvold, @andygrove, etc.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] avantgardnerio commented on pull request #3955: Change back to non async interface for try_into_logical_plan

Posted by GitBox <gi...@apache.org>.
avantgardnerio commented on PR #3955:
URL: https://github.com/apache/arrow-datafusion/pull/3955#issuecomment-1292637737

   @yahoNanJing and everyone, I'd like to close this PR in favor of https://github.com/apache/arrow-datafusion/pull/3978 which will do the same thing, but not break deltalake. As long as we get it merged soon (before 14?) I think no harm was done, right?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb commented on pull request #3955: Change back to non async interface for try_into_logical_plan

Posted by GitBox <gi...@apache.org>.
alamb commented on PR #3955:
URL: https://github.com/apache/arrow-datafusion/pull/3955#issuecomment-1292670860

   So basically making an API `async` I think allows for more network enabled planning (e.g async table providers etc when the catalog information is not already at hand). I don't have a strong opinion either way. I agree with @yahoNanJing  there are ways to do serialization without requiring network access (by leaving placeholders, for example, and then do a subsequent pass to look up anything from the network that is needed)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb commented on pull request #3955: Change back to non async interface for try_into_logical_plan

Posted by GitBox <gi...@apache.org>.
alamb commented on PR #3955:
URL: https://github.com/apache/arrow-datafusion/pull/3955#issuecomment-1293546194

   https://github.com/apache/arrow-datafusion/pull/3978 was merged so closing this one


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] avantgardnerio commented on pull request #3955: Change back to non async interface for try_into_logical_plan

Posted by GitBox <gi...@apache.org>.
avantgardnerio commented on PR #3955:
URL: https://github.com/apache/arrow-datafusion/pull/3955#issuecomment-1290713704

   > there's some discussion before
   
   @yahoNanJing I am not familiar with the discussion (or don't recall)... do you have a link to the PR or github issue?
   
   > wondering why it's changed back to async
   
   Because [creating TableProviders](https://github.com/apache/arrow-datafusion/blob/7559c4425e6f32655c6d09e8ed17c9c51896472b/datafusion/core/src/execution/context.rs#L432) may have to be an async operation for ones like Deltalake that need to go load schema from the network.
   
   I looked into the alternative: having two methods on `TableProviderFactory`:
   
   1. `fn async create(url: String)` - what we have now
   2. `fn with_schema(url: String, schema: SchemaRef)` so that in theory when deserializing `TableProviders` we could skip the async operation by using the schema which should have been serialized in the scan.
   
   Unfortunately, it did not look trivial to serialize all the state that a `DeltaTable` sets up during planning, so based on @andygrove 's suggestion I switched it to async.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb commented on pull request #3955: Change back to non async interface for try_into_logical_plan

Posted by GitBox <gi...@apache.org>.
alamb commented on PR #3955:
URL: https://github.com/apache/arrow-datafusion/pull/3955#issuecomment-1290801916

   > Thanks @avantgardnerio for the explanation. In case of network usage or other IO usage, we can refer to the object store interface implementation with spawn blocking.
   
   Yeah, the `async` is very infectious -- so adding async definitely causes non trivial downstream churn. 
   
   @yahoNanJing  would an alternative approach be to create a PR in Ballista to update the code in question to be async?
   
   @avantgardnerio  perhaps @yahoNanJing  is suggesting something like https://docs.rs/tokio/1.21.2/tokio/runtime/struct.Runtime.html#method.block_on which blocks a thread and waits for an async method to resolve
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] avantgardnerio commented on pull request #3955: Change back to non async interface for try_into_logical_plan

Posted by GitBox <gi...@apache.org>.
avantgardnerio commented on PR #3955:
URL: https://github.com/apache/arrow-datafusion/pull/3955#issuecomment-1292184021

   > the interface of creating the logical plan should be serialize API
   
   I have the unfortunate tendency to agree with @yahoNanJing here... see fun examples like the log4j CVE and Java's famous [URL.equals()](https://news.ycombinator.com/item?id=21765788) examples of abuse of network access in parsing logic. That's why I originally spent so much time working on the separate `async fn create()` and `fn with_schema()` functions. 
   
   It was very convenient to switch this to async, but from a principled perspective I think it's the wrong move. I can work @houqp and the folks over at `delta-rs` to serde the whole `DeltaTable` and `ObjectStore`.
   
   Since that PR is already working though, it would be nice to have some time to verify the assumption that the issue is resolvable before merging this PR.
   
   Thoughts @andygrove and @alamb ?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org