You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/03/29 02:50:09 UTC

[GitHub] [arrow-datafusion] yahoNanJing opened a new pull request #2111: Introduce datafusion-objectstore-hdfs as optional features in the datafusion core

yahoNanJing opened a new pull request #2111:
URL: https://github.com/apache/arrow-datafusion/pull/2111


   # Which issue does this PR close?
   
   <!--
   We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax. For example `Closes #123` indicates that this PR will close issue #123.
   -->
   
   Closes #2072.
   
    # Rationale for this change
   <!--
    Why are you proposing this change? If this is already explained clearly in the issue then this section is not needed.
    Explaining clearly why changes are proposed helps reviewers understand your changes and offer better suggestions for fixes.  
   -->
   
   # What changes are included in this PR?
   <!--
   There is no need to duplicate the description in the issue here but it is sometimes worth providing a summary of the individual changes in this PR.
   -->
   
   # Are there any user-facing changes?
   <!--
   If there are user-facing changes then we may require documentation to be updated before approving the PR.
   -->
   
   <!--
   If there are any breaking changes to public APIs, please add the `api change` label.
   -->
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] alamb commented on a change in pull request #2111: Introduce datafusion-objectstore-hdfs as optional features in the datafusion core

Posted by GitBox <gi...@apache.org>.
alamb commented on a change in pull request #2111:
URL: https://github.com/apache/arrow-datafusion/pull/2111#discussion_r838779683



##########
File path: datafusion/core/Cargo.toml
##########
@@ -60,9 +62,10 @@ async-trait = "0.1.41"
 avro-rs = { version = "0.13", features = ["snappy"], optional = true }
 chrono = { version = "0.4", default-features = false }
 datafusion-common = { path = "../common", version = "7.0.0", features = ["parquet"] }
-datafusion-data-access = { path = "../../data-access", version = "1.0.0" }
+datafusion-data-access = { git = "https://github.com/apache/arrow-datafusion.git", rev = "41b4e491663029f653e491b110d0b5e74d08a0b6" }

Review comment:
       This change seems a little suspicious to me (as the data access crate should be picked up locally)




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] thinkharderdev commented on a change in pull request #2111: Introduce datafusion-objectstore-hdfs as optional features in the datafusion core

Posted by GitBox <gi...@apache.org>.
thinkharderdev commented on a change in pull request #2111:
URL: https://github.com/apache/arrow-datafusion/pull/2111#discussion_r839383936



##########
File path: datafusion/core/Cargo.toml
##########
@@ -60,9 +62,10 @@ async-trait = "0.1.41"
 avro-rs = { version = "0.13", features = ["snappy"], optional = true }
 chrono = { version = "0.4", default-features = false }
 datafusion-common = { path = "../common", version = "7.0.0", features = ["parquet"] }
-datafusion-data-access = { path = "../../data-access", version = "1.0.0" }
+datafusion-data-access = { git = "https://github.com/apache/arrow-datafusion.git", rev = "41b4e491663029f653e491b110d0b5e74d08a0b6" }

Review comment:
       Isn't `datafusion-data-access` a core part of datfusion though. The `ObjectStore` trait is baked in everywhere. It seems a bit weird to move that into a contrib module. 
   
   Would it be possible to leverage something like https://github.com/apache/arrow-datafusion/pull/1881 to load contrib `ObjectStores` as plugins?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] alamb commented on pull request #2111: Introduce datafusion-objectstore-hdfs as optional features in the datafusion core

Posted by GitBox <gi...@apache.org>.
alamb commented on pull request #2111:
URL: https://github.com/apache/arrow-datafusion/pull/2111#issuecomment-1085008983


   > Would it be possible to leverage something like https://github.com/apache/arrow-datafusion/pull/1881 to load contrib ObjectStores as plugins?
   
   That certainly sounds like a good idea to me
   
   I almost wonder if it is time to think about "ballista the codebase" and "ballista the packaged system" as different things -- for example, maybe we could contemplate a script / dockerfile that created ballista packages with whatever optional features were desired and pulled the code from its various sources 🤔 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] yahoNanJing commented on a change in pull request #2111: Introduce datafusion-objectstore-hdfs as optional features in the datafusion core

Posted by GitBox <gi...@apache.org>.
yahoNanJing commented on a change in pull request #2111:
URL: https://github.com/apache/arrow-datafusion/pull/2111#discussion_r839208250



##########
File path: datafusion/core/Cargo.toml
##########
@@ -60,9 +62,10 @@ async-trait = "0.1.41"
 avro-rs = { version = "0.13", features = ["snappy"], optional = true }
 chrono = { version = "0.4", default-features = false }
 datafusion-common = { path = "../common", version = "7.0.0", features = ["parquet"] }
-datafusion-data-access = { path = "../../data-access", version = "1.0.0" }
+datafusion-data-access = { git = "https://github.com/apache/arrow-datafusion.git", rev = "41b4e491663029f653e491b110d0b5e74d08a0b6" }

Review comment:
       Yes. Since the data access crate is relatively stable, I'm not sure whether it's necessary to release it together with datafusion. How about splitting it to another repository, for example to the https://github.com/datafusion-contrib? 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] houqp commented on pull request #2111: Introduce datafusion-objectstore-hdfs as optional features in the datafusion core

Posted by GitBox <gi...@apache.org>.
houqp commented on pull request #2111:
URL: https://github.com/apache/arrow-datafusion/pull/2111#issuecomment-1086551296


   > Isn't datafusion-data-access a core part of datfusion though. The ObjectStore trait is baked in everywhere. It seems a bit weird to move that into a contrib module.
   
   Yeah, if we move it out, it would be treated similar to how we use and maintain the sql parser and arrow-rs crate.
   
   Being able to load custom object stores as plugins at runtime seems like a good feature to have in the long run so folks can load proprietary object store plugins as well. 
   
   I also think these two routes are not conflicting with each other so perhaps we can do both?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] houqp commented on a change in pull request #2111: Introduce datafusion-objectstore-hdfs as optional features in the datafusion core

Posted by GitBox <gi...@apache.org>.
houqp commented on a change in pull request #2111:
URL: https://github.com/apache/arrow-datafusion/pull/2111#discussion_r839193390



##########
File path: datafusion/core/Cargo.toml
##########
@@ -60,9 +62,10 @@ async-trait = "0.1.41"
 avro-rs = { version = "0.13", features = ["snappy"], optional = true }
 chrono = { version = "0.4", default-features = false }
 datafusion-common = { path = "../common", version = "7.0.0", features = ["parquet"] }
-datafusion-data-access = { path = "../../data-access", version = "1.0.0" }
+datafusion-data-access = { git = "https://github.com/apache/arrow-datafusion.git", rev = "41b4e491663029f653e491b110d0b5e74d08a0b6" }

Review comment:
       There is no circular dependency in the cargo dependency graph, but there is a circular dependency on how these crates gets released. The data access crate needs to be released together with datafusion right now, which is what's causing the problem.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] houqp commented on a change in pull request #2111: Introduce datafusion-objectstore-hdfs as optional features in the datafusion core

Posted by GitBox <gi...@apache.org>.
houqp commented on a change in pull request #2111:
URL: https://github.com/apache/arrow-datafusion/pull/2111#discussion_r839211034



##########
File path: datafusion/core/Cargo.toml
##########
@@ -60,9 +62,10 @@ async-trait = "0.1.41"
 avro-rs = { version = "0.13", features = ["snappy"], optional = true }
 chrono = { version = "0.4", default-features = false }
 datafusion-common = { path = "../common", version = "7.0.0", features = ["parquet"] }
-datafusion-data-access = { path = "../../data-access", version = "1.0.0" }
+datafusion-data-access = { git = "https://github.com/apache/arrow-datafusion.git", rev = "41b4e491663029f653e491b110d0b5e74d08a0b6" }

Review comment:
       yeah, i am leaning towards moving the data access crate into contrib as well. this can further reduce datafusion release overhead.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] houqp commented on a change in pull request #2111: Introduce datafusion-objectstore-hdfs as optional features in the datafusion core

Posted by GitBox <gi...@apache.org>.
houqp commented on a change in pull request #2111:
URL: https://github.com/apache/arrow-datafusion/pull/2111#discussion_r839211034



##########
File path: datafusion/core/Cargo.toml
##########
@@ -60,9 +62,10 @@ async-trait = "0.1.41"
 avro-rs = { version = "0.13", features = ["snappy"], optional = true }
 chrono = { version = "0.4", default-features = false }
 datafusion-common = { path = "../common", version = "7.0.0", features = ["parquet"] }
-datafusion-data-access = { path = "../../data-access", version = "1.0.0" }
+datafusion-data-access = { git = "https://github.com/apache/arrow-datafusion.git", rev = "41b4e491663029f653e491b110d0b5e74d08a0b6" }

Review comment:
       yeah, i am leaning towards moving the data access crate into contrib as well




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] yahoNanJing commented on a change in pull request #2111: Introduce datafusion-objectstore-hdfs as optional features in the datafusion core

Posted by GitBox <gi...@apache.org>.
yahoNanJing commented on a change in pull request #2111:
URL: https://github.com/apache/arrow-datafusion/pull/2111#discussion_r839110416



##########
File path: datafusion/core/Cargo.toml
##########
@@ -60,9 +62,10 @@ async-trait = "0.1.41"
 avro-rs = { version = "0.13", features = ["snappy"], optional = true }
 chrono = { version = "0.4", default-features = false }
 datafusion-common = { path = "../common", version = "7.0.0", features = ["parquet"] }
-datafusion-data-access = { path = "../../data-access", version = "1.0.0" }
+datafusion-data-access = { git = "https://github.com/apache/arrow-datafusion.git", rev = "41b4e491663029f653e491b110d0b5e74d08a0b6" }

Review comment:
       Hi @alamb, with the datafusion-data-access module spitted out, now the dependencies are as follows:
   <pre>
                         ballista
                              ⬇️ 
                       datafusion   ➡️   objectstore-hdfs
                              ⬇️             ⬇️ 
                          datafusion-data-access
   </pre>
   
   Therefore, there's no cyclic dependency. Maybe it's better to split the datafusion-data-access into a separate repository and make a first version of release. Then both the datafusion and objectstore-hdfs can depend on the specified version of datafusion-data-access rather than leveraging the rev way.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] yahoNanJing commented on pull request #2111: Introduce datafusion-objectstore-hdfs as optional features in the datafusion core

Posted by GitBox <gi...@apache.org>.
yahoNanJing commented on pull request #2111:
URL: https://github.com/apache/arrow-datafusion/pull/2111#issuecomment-1081382106


   With the hdfs feature enabled, there's a testing case in https://github.com/datafusion-contrib/datafusion-objectstore-hdfs/commit/17b68ff531e4067d0122de2fe46d174b8203f288 for verification


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] yahoNanJing commented on pull request #2111: Introduce datafusion-objectstore-hdfs as optional features in the datafusion core

Posted by GitBox <gi...@apache.org>.
yahoNanJing commented on pull request #2111:
URL: https://github.com/apache/arrow-datafusion/pull/2111#issuecomment-1081383924


   Hi @alamb, @matthewmturner, @yjshen, @houqp, do you think whether it's possible to introduce the object store extensions as optional features like this way?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org