You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/04/16 00:38:21 UTC

[GitHub] [arrow-datafusion] wjones127 opened a new pull request, #2246: Sketch out write operations

wjones127 opened a new pull request, #2246:
URL: https://github.com/apache/arrow-datafusion/pull/2246

   # Which issue does this PR close?
   
   Closes #2185.
   
    # Rationale for this change
   
   This PR will add 
   
   # What changes are included in this PR?
   
    * [x] Add writer to `ObjectStore` trait
    * [ ] Implement writer for local filesystem
    * [ ] Add delete functions to `ObjectStore` trait
    * [ ] Implement delete functions for local filesystem
   
   # Are there any user-facing changes?
   <!--
   If there are user-facing changes then we may require documentation to be updated before approving the PR.
   -->
   
   <!--
   If there are any breaking changes to public APIs, please add the `api change` label.
   -->
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] wjones127 commented on a diff in pull request #2246: Add write and delete operations to ObjectStore

Posted by GitBox <gi...@apache.org>.
wjones127 commented on code in PR #2246:
URL: https://github.com/apache/arrow-datafusion/pull/2246#discussion_r853622692


##########
ballista/rust/core/src/serde/logical_plan/mod.rs:
##########
@@ -1016,6 +1016,55 @@ mod roundtrip_tests {
                 "this is only a test object store".to_string(),
             ))
         }
+
+        fn file_writer(
+            &self,
+            _path: &str,
+        ) -> datafusion_data_access::Result<Arc<dyn ObjectWriter>> {
+            unimplemented!();
+        }
+
+        async fn create_dir(

Review Comment:
   That's a good point. I didn't know that was the behavior in C++ Arrow. I started a sibling PR to implement these methods for S3 in https://github.com/datafusion-contrib/datafusion-objectstore-s3/pull/54, so I might prototype some more there and see if that's necessary in any way.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb commented on pull request #2246: Add write and delete operations to ObjectStore

Posted by GitBox <gi...@apache.org>.
alamb commented on PR #2246:
URL: https://github.com/apache/arrow-datafusion/pull/2246#issuecomment-1100858190

   cc @yjshen  @tustvold 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] matthewmturner commented on pull request #2246: Add write and delete operations to ObjectStore

Posted by GitBox <gi...@apache.org>.
matthewmturner commented on PR #2246:
URL: https://github.com/apache/arrow-datafusion/pull/2246#issuecomment-1100493921

   This would also close #1777 task 1


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] wjones127 commented on pull request #2246: Add write and delete operations to ObjectStore

Posted by GitBox <gi...@apache.org>.
wjones127 commented on PR #2246:
URL: https://github.com/apache/arrow-datafusion/pull/2246#issuecomment-1100961851

   I've written the API to take `&str` for paths, but maybe the better thing to to do is to take `AsRef<Path>`? I think that would mean you couldn't pass in paths that include the scheme (e.g. `file://path/to/thing`), but is that already the expectation?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] tustvold commented on a diff in pull request #2246: Add write and delete operations to ObjectStore

Posted by GitBox <gi...@apache.org>.
tustvold commented on code in PR #2246:
URL: https://github.com/apache/arrow-datafusion/pull/2246#discussion_r852853304


##########
ballista/rust/core/src/serde/logical_plan/mod.rs:
##########
@@ -1016,6 +1016,55 @@ mod roundtrip_tests {
                 "this is only a test object store".to_string(),
             ))
         }
+
+        fn file_writer(
+            &self,
+            _path: &str,
+        ) -> datafusion_data_access::Result<Arc<dyn ObjectWriter>> {
+            unimplemented!();
+        }
+
+        async fn create_dir(
+            &self,
+            _path: &str,
+            _recursive: bool,
+        ) -> datafusion_data_access::Result<()> {
+            unimplemented!();
+        }
+
+        async fn remove_dir_all(
+            &self,
+            _path: &str,
+        ) -> datafusion_data_access::Result<()> {
+            unimplemented!();
+        }
+
+        async fn remove_dir_contents(
+            &self,
+            _path: &str,
+        ) -> datafusion_data_access::Result<()> {
+            unimplemented!();
+        }
+
+        async fn remove_file(&self, _path: &str) -> datafusion_data_access::Result<()> {
+            unimplemented!();
+        }
+
+        async fn rename(

Review Comment:
   I'm not aware of any object stores that support native rename, I think this will need to be implemented as a copy and remove. Perhaps we could provide this as the default implementation?



##########
data-access/src/object_store/mod.rs:
##########
@@ -101,4 +110,57 @@ pub trait ObjectStore: Sync + Send + Debug {
 
     /// Get object reader for one file
     fn file_reader(&self, file: SizedFile) -> Result<Arc<dyn ObjectReader>>;
+

Review Comment:
   I think it would be good to define somewhere the atomicity requirements of implementations. For example, S3 does not provide create if not exists APIs, and therefore cannot guarantee atomic writes.
   
   Similarly if rename, copy, etc... is called concurrently with a writer, the behaviour would likely be implementation defined. Renaming a local file would rename the writer, deleting a local file will vary if Windows or POSIX, etc... 
   
   I think it is fine to just say this is not supported, but if people use this trait and expect it to behave like a filesystem, they're likely to be surprised :sweat_smile: 



##########
ballista/rust/core/src/serde/logical_plan/mod.rs:
##########
@@ -1016,6 +1016,55 @@ mod roundtrip_tests {
                 "this is only a test object store".to_string(),
             ))
         }
+
+        fn file_writer(
+            &self,
+            _path: &str,
+        ) -> datafusion_data_access::Result<Arc<dyn ObjectWriter>> {
+            unimplemented!();
+        }
+
+        async fn create_dir(

Review Comment:
   I was curious what this corresponded to given that object stores don't have a concept of a directory. It would appear that at least in the case of the S3 implementation, it creates empty objects as pretend directories - https://github.com/apache/arrow/blob/master/cpp/src/arrow/filesystem/s3fs.cc
   
   I'm not sure how I feel about this



##########
data-access/src/object_store/local.rs:
##########
@@ -185,7 +297,30 @@ pub fn local_unpartitioned_file(file: String) -> FileMeta {
             size: metadata.len(),
             path: file,
         },
-        last_modified: metadata.modified().map(chrono::DateTime::from).ok(),
+        last_modified: metadata.modified().map(DateTime::<Utc>::from).ok(),
+    }
+}
+
+struct LocalFileWriter {
+    path: String,
+}
+
+impl LocalFileWriter {
+    fn new(path: String) -> Result<Self> {
+        Ok(Self { path })
+    }
+}
+
+#[async_trait]
+impl ObjectWriter for LocalFileWriter {
+    async fn writer(&self) -> Result<Pin<Box<dyn AsyncWrite>>> {
+        let file = AsyncFile::open(&self.path).await?;
+        Ok(Box::pin(file))
+    }
+
+    fn sync_writer(&self) -> Result<Box<dyn Write + Send + Sync>> {

Review Comment:
   FYI currently writing parquet requires Seek - but there is a suggestion here on how that could be removed https://github.com/apache/arrow-rs/issues/937



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] wjones127 commented on a diff in pull request #2246: Add write and delete operations to ObjectStore

Posted by GitBox <gi...@apache.org>.
wjones127 commented on code in PR #2246:
URL: https://github.com/apache/arrow-datafusion/pull/2246#discussion_r853632070


##########
ballista/rust/core/src/serde/logical_plan/mod.rs:
##########
@@ -1016,6 +1016,55 @@ mod roundtrip_tests {
                 "this is only a test object store".to_string(),
             ))
         }
+
+        fn file_writer(
+            &self,
+            _path: &str,
+        ) -> datafusion_data_access::Result<Arc<dyn ObjectWriter>> {
+            unimplemented!();
+        }
+
+        async fn create_dir(
+            &self,
+            _path: &str,
+            _recursive: bool,
+        ) -> datafusion_data_access::Result<()> {
+            unimplemented!();
+        }
+
+        async fn remove_dir_all(
+            &self,
+            _path: &str,
+        ) -> datafusion_data_access::Result<()> {
+            unimplemented!();
+        }
+
+        async fn remove_dir_contents(
+            &self,
+            _path: &str,
+        ) -> datafusion_data_access::Result<()> {
+            unimplemented!();
+        }
+
+        async fn remove_file(&self, _path: &str) -> datafusion_data_access::Result<()> {
+            unimplemented!();
+        }
+
+        async fn rename(

Review Comment:
   AFAIK, in Rust you can't override implementations from a trait, so I'm reluctant to provide a default implementation. But agreed that as far as I can see that does seem to be the pattern.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] tustvold commented on a diff in pull request #2246: Add write and delete operations to ObjectStore

Posted by GitBox <gi...@apache.org>.
tustvold commented on code in PR #2246:
URL: https://github.com/apache/arrow-datafusion/pull/2246#discussion_r854030100


##########
ballista/rust/core/src/serde/logical_plan/mod.rs:
##########
@@ -1016,6 +1016,55 @@ mod roundtrip_tests {
                 "this is only a test object store".to_string(),
             ))
         }
+
+        fn file_writer(
+            &self,
+            _path: &str,
+        ) -> datafusion_data_access::Result<Arc<dyn ObjectWriter>> {
+            unimplemented!();
+        }
+
+        async fn create_dir(
+            &self,
+            _path: &str,
+            _recursive: bool,
+        ) -> datafusion_data_access::Result<()> {
+            unimplemented!();
+        }
+
+        async fn remove_dir_all(
+            &self,
+            _path: &str,
+        ) -> datafusion_data_access::Result<()> {
+            unimplemented!();
+        }
+
+        async fn remove_dir_contents(
+            &self,
+            _path: &str,
+        ) -> datafusion_data_access::Result<()> {
+            unimplemented!();
+        }
+
+        async fn remove_file(&self, _path: &str) -> datafusion_data_access::Result<()> {
+            unimplemented!();
+        }
+
+        async fn rename(

Review Comment:
   https://doc.rust-lang.org/book/ch10-02-traits.html#default-implementations



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] wjones127 commented on pull request #2246: Add write and delete operations to ObjectStore

Posted by GitBox <gi...@apache.org>.
wjones127 commented on PR #2246:
URL: https://github.com/apache/arrow-datafusion/pull/2246#issuecomment-1146120787

   Closing in favor of work on https://github.com/influxdata/object_store_rs


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] wjones127 closed pull request #2246: Add write and delete operations to ObjectStore

Posted by GitBox <gi...@apache.org>.
wjones127 closed pull request #2246: Add write and delete operations to ObjectStore
URL: https://github.com/apache/arrow-datafusion/pull/2246


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] wjones127 commented on a diff in pull request #2246: Add write and delete operations to ObjectStore

Posted by GitBox <gi...@apache.org>.
wjones127 commented on code in PR #2246:
URL: https://github.com/apache/arrow-datafusion/pull/2246#discussion_r859314973


##########
ballista/rust/core/src/serde/logical_plan/mod.rs:
##########
@@ -1016,6 +1016,55 @@ mod roundtrip_tests {
                 "this is only a test object store".to_string(),
             ))
         }
+
+        fn file_writer(
+            &self,
+            _path: &str,
+        ) -> datafusion_data_access::Result<Arc<dyn ObjectWriter>> {
+            unimplemented!();
+        }
+
+        async fn create_dir(

Review Comment:
   It looks like fsspec implementation [doesn't create empty objects as directories](https://github.com/fsspec/s3fs/blob/c4fb41f7cc2f2aede6bbb7755096c38b9e4cc553/s3fs/core.py#L730-L761), so I think it is sensible to do the same. 
   
   It's not totally a no-op though for cases like S3. You might want to have `create_dir` create a bucket if it doesn't exist (though I'd like for there to be an option to turn off bucket creation). And it should probably check that there isn't an existing file at the path you are trying to create a directory at.
   
   I think what I will do is document the expected behavior and create a generic test that can be run on an arbitrary filesystem to help implementation make sure they follow expected behavior.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org